Firefox marketshare revisited

Why building a better browser doesn’t translate to a better marketshare

I posted a couple weeks ago about Chrome effectively having won the browser wars. The market share observations in the blog post were based on data provided by StatCounter. Several commenters criticized the StatCounter data as inaccurate so I decided to take a look at raw installation data Mozilla publishes to see whether it aligns with the StatCounter data.

Active Firefox Installs

Mozilla’s public data shows that the number of active Firefox Desktop installs running the most recent version of Firefox has been declining for several years. Based on this data, 22% fewer Firefox Desktop installations are active today than a year ago. This is a loss of 16 million Firefox installs in a year. The year over year decline used to be below 10% but accelerated to 14% in 2016. It returned to a more modest 10% year over year loss late 2016, which could be the result of a successful marketing campaign (Mozilla’s biggest marketing campaigns are often in the fall). That effect was temporary as the accelerating decline this year shows (Philipp suggests that the two recent drops could be the result of support for older machines and Windows versions being removed and those users continuing to use previous versions of Firefox, see comments).

Year over Year Firefox Active Daily Installs (Desktop). The Y axis is not zero-based. Click on the graph to enlarge.

Obtaining the data

Mozilla publishes aggregated Firefox usage data in form of Active Daily Installs (ADIs) here (Update: looks like the site requires a login now. It used to be available publicly for years and was public until a few days ago). The site is a bit clumsy and you can look at individual days only so I wrote some code to fetch the data for the last 3 years so its easier to analyze (link). The raw ADI data is pretty noisy as you can see here:

Desktop Firefox Daily Active Installs. The Y axis is not zero-based. Click on the graph to enlarge.

During any given week the ADI number can vary substantially. For the last week the peak was around 80 million users and the low was around 53 million users. To understand why the data is so variable it’s necessary to understand how Active Daily Installs are calculated.

Firefox tries to contact Mozilla once a day to check for security updates. This is called the “updater ping”. The ADI number is the aggregate number of these pings that were seen on a given day and can be understood as the number of running Firefox installs on that day.

The main reason that ADI data is so noisy is that work machines are switched off on the weekend. Those Firefox installs don’t check over the weekend, so the ADI number drops significantly. This also explains why ADIs don’t map 1:1 to Active Daily Users (ADUs). A user may be briefly active on a given day but then switches off the machine before Firefox had a chance to phone home. The ADI count can miss this user. Inversely, Firefox may be active on a day but the user actually wasn’t. Mozilla has a disclaimer on the site that publishes ADI data to point out that ADI data is imprecise, and from data I have seen actual Active Daily Users are about 10% higher than ADIs but this is just a ballpark estimate.

The graphs above also only look at the most recent version of Firefox. A subset of users are often stranded on older versions of Firefox. This subset tends to be relatively small since Mozilla is doing a good job these days converting as many users as possible to the most recently/most secure/most performant version of Firefox.

The first graph in this post was obtained by sliding a 90 day window over the data and comparing for each 90 day window the total number of active daily installs to the 90 day window a year prior. This helps eliminate some of the variability in the data and shows are clearer trend. In addition to weekly swings there is also a strong seasonality. College students being on break and people spending time with family over Christmas are some of the effects you can see in the raw data that the sliding window mechanism can filter to a degree.

If you want to explore the data yourself and see what effect shorter window parameters have you can use this link. If you see any mistakes or have ideas how to improve the visualization please send a pull request.

What about Mobile?

Mozilla doesn’t publish ADI data for Firefox for iOS and Firefox Focus, but since none of these appear in any browser statistics it means their market share is probably very small. ADI data is available for Firefox for Android and that graph looks quite a bit different from Desktop:

Firefox for Android Active Daily Installs. The Y axis is not zero-based. Click on the graph to enlarge.

Firefox for Android has been growing pretty consistently over the last few years. There is a big drop in early 2015 which is likely when Mozilla stopped support for very old versions of Android. The otherwise pretty consistent albeit slow growth seems to have stopped this year but it’s too early still to tell whether this trend will hold.

As you can see ADI data for mobile is not as noisy as desktop. This makes sense because people are much less likely to switch of their phones than their PCs.


A lot of commenters asked why Firefox marketshare is falling off a cliff. I think that question can be best answered with a few screenshots Mozilla engineer Chris Lord posted:

Google is aggressively using its monopoly position in Internet services such as Google Mail, Google Calendar and YouTube to advertise Chrome. Browsers are a mature product and its hard to compete in a mature market if your main competitor has access to billions of dollars worth of free marketing.

Google’s incentives here are pretty clear. The Desktop market is not growing much any more, so Google can’t acquire new users easily which threatens Google’s revenue growth. Instead, Google is going after Firefox and other browsers to grow. Chrome allows Google to lock in a user and make sure that that user heads to Google services first. No wonder Google is so aggressively converting everyone to Chrome, especially if the marketing for that is essentially free to them.

This explains why the market share decline of Firefox has accelerated so dramatically the last 12 months despite Firefox getting much better during the same time window. The Firefox engineering team at Mozilla has made amazing improvements to Firefox and is rebuilding many parts of Firefox with state of the art technology based on Mozilla’s futuristic rendering engine Servo. Firefox is today as good as Chrome in most ways, and better in some (memory use for example). However, this simply doesn’t matter in this market.

Firefox’s decline is not an engineering problem. Its a market disruption (Desktop to Mobile shift) and monopoly problem. There are no engineering solutions to these market problems. The only way to escape this is to pivot to a different market (Firefox OS tried to pivot Mozilla into mobile), or create a new market. The latter is what Brendan’s Brave is all about: be the browser for a better less ad infested Web instead of a traditional Desktop browser.

What makes today very different from the founding days of Mozilla is that Google isn’t neglecting Chrome and the Web the way Microsoft did during the Internet Explorer 6 days. Google continues to advance Chrome and the Web at breakneck pace. Despite this silver lining it is still very concerning that we are headed towards a Web monoculture dominated by Chrome.

What about Mozilla?

Mozilla helped the Web win but Firefox is now losing an unwinnable marketing fight against Google. This does not mean Firefox is not a great browser. Firefox is losing despite being a great browser, and getting better all the time. Firefox is simply the victim of Google’s need to increase profit in a relatively stagnant market. And it’s also important to note that while Firefox Desktop is probably headed for extinction over the next couple years, today it’s still a product used by some 90 million people, and still generating significant revenue for Mozilla for some time.

While I no longer work for Mozilla and no longer have insight into their future plans, I firmly believe that the decline of Firefox won’t necessarily mean the decline of Mozilla. There is a lot of important work beyond Firefox that Mozilla can do and is doing for the Web. Mozilla’s Rust programming language has crossed into the mainstream and is growing steadily and Rust might become Mozilla’s second most lasting contribution to the world.

Mozilla’s engineering team is also building a futuristic rendering engine Servo which is a fascinating piece of technology. If you are interested in the internals of a modern rendering engine, you should definitely take a look. Finding a relevant product to use Servo in will be a challenge, but that doesn’t diminish Servo’s role in pushing the envelope of how fast the Web can be.

And, last but not least, Mozilla is also still actively engaged in Web standards (WebAssembly and WebVR for example), even though it has to rely more on good will than market might these days. The battle for the open web is far from over.

Chrome won

Disclaimer: I worked for 7 years at Mozilla and was Mozilla’s Chief Technology Officer before leaving 2 years ago to found an embedded AI startup.

Mozilla published a blog post two days ago highlighting its efforts to make the Desktop Firefox browser competitive again. I used to closely follow the browser market but haven’t looked in a few years, so I figured it’s time to look at some numbers:


The chart above shows the percentage market share of the 4 major browsers over the last 6 years, across all devices. The data is from StatCounter and you can argue that the data is biased in a bunch of different ways, but at the macro level it’s safe to say that Chrome is eating the browser market, and everyone else except Safari is getting obliterated.


I tried a couple different ways to plot a trendline and an exponential fit seems to work best. This aligns pretty well with theories around the explosive diffusion of innovation, and the slow decline of legacy technologies. If the 6 year trend holds, IE should be pretty much dead in 2 or 3 years. Firefox is not faring much better, unfortunately, and is headed towards a 2-3% market share. For both IE and Firefox these low market share numbers further accelerate the decline because Web authors don’t test for browsers with a small market share. Broken content makes users switch browsers, which causes more users to depart. A vicious cycle.

Chrome and Safari don’t fit as well as IE and Firefox. The explanation for Chrome is likely that the market share is so large that Chrome is running out of users to acquire. Some people are stuck on old operating systems that don’t support Chrome. Safari’s recent growth is underperforming its trend most likely because iOS device growth has slowed.

Desktop market share

Looking at all devices blends mobile and desktop market shares, which can be misleading. Safari/iOS is dominant on mobile whereas on Desktop Safari has a very small share. Firefox in turn is essentially not present on mobile. So let’s look at the Desktop numbers only.


The Desktop-only graph unfortunately doesn’t predict a different fate for IE and Firefox either. The overall desktop PC market is growing slightly (most sales are replacement PCs, but new users are added as well). Despite an expanding market both IE and Firefox are declining unsustainably.

Adding users?

Eric mentioned in the blog post that Firefox added users last year. The relative Firefox market share declined from 16% to 14.85% during that period. For comparison, Safari Desktop is relatively flat, which likely means Safari market share is keeping up with the (slow) growth of the PC/Laptop market. Two possible theories are that Eric meant in his blog post that browser installs were added. People often re-install the browser on a new machine, which could be called an “added user”, but it comes usually at the expense of the previous machine becoming disused. It’s also possible that the absolute daily active user count has indeed increased due to the growth of the PC/laptop market, despite the steep decline in relative market share. Firefox ADUs aren’t public so it’s hard to tell.

From these graphs it’s pretty clear that Firefox is not going anywhere. That means that the esteemed Fox will be around for many many years, albeit with an ever diminishing market share. It also, unfortunately, means that a turnaround is all but impossible.

With a CEO transition about 3 years ago there was a major strategic shift at Mozilla to re-focus efforts on Firefox and thus the Desktop. Prior to 2014 Mozilla heavily invested in building a Mobile OS to compete with Android: Firefox OS. I started the Firefox OS project and brought it to scale. While we made quite a splash and sold several million devices, in the end we were a bit too late and we didn’t manage to catch up with Android’s explosive growth. Mozilla’s strategic rationale for building Firefox OS was often misunderstood. Mozilla’s founding mission was to build the Web by building a browser. Mobile thoroughly disrupted this mission. On mobile browsers are much less relevant–even more so third party mobile browsers. On mobile browsers are a feature of the Facebook and Twitter apps, not a product. To influence the Web on mobile, Mozilla had to build a whole stack with the Web at its core. Building mobile browsers (Firefox Android) or browser-like apps (Firefox Focus) is unlikely to capture a meaningful share of use cases. Both Firefox for Android and Firefox Focus have a market share close to 0%.

The strategic shift in 2014, back to Firefox, and with that back to Desktop, was significant for Mozilla. As Eric describes in his article, a lot of amazing technical work has gone into Firefox for Desktop the last years. The Desktop-focused teams were expanded, and mobile-focused efforts curtailed. Firefox Desktop today is technically competitive with Chrome Desktop in many areas, and even better than Chrome in some. Unfortunately, looking at the graphs, none of this has had any effect on market trends. Browsers are a commodity product. They all pretty much look the same and feel the same. All browsers work pretty well, and being slightly faster or using slightly less memory is unlikely to sway users. If even Eric–who heads Mozilla’s marketing team–uses Chrome every day as he mentioned in the first sentence, it’s not surprising that almost 65% of desktop users are doing the same.

What does this mean for the Web?

I started Firefox OS in 2011 because already back then I was convinced that desktops and browsers were dead. Not immediately–here we are 6 years later and both are still around–but both are legacy technologies that are not particularly influential going forward. I don’t think there will be a new browser war where Firefox or some other competitor re-captures market share from Chrome. It’s like launching a new and improved horse in the year 2017. We all drive cars now. Some people still use horses, and there is value to horses, but technology has moved on when it comes to transportation.

Does this mean Google owns the Web if they own Chrome? No. Absolutely not. Browsers are what the Web looked like in the first decades of the Internet. Mobile disrupted the Web, but the Web embraced mobile and at the heart of most apps beats a lot of JavaScript and HTTPS and REST these days. The future Web will look yet again completely different. Much will survive, and some parts of it will get disrupted. I left Mozilla because I became curious what the Web looks like once it consists predominantly of devices instead of desktops and mobile phones. At Silk we created an IoT platform built around open Web technologies such as JavaScript, and we do a lot of work around democratizing data ownership through embedding AI in devices instead of sending everything to the cloud.

So while Google won the browser wars, they haven’t won the Web. To stick with the transportation metaphor: Google makes the best horses in the world and they clearly won the horse race. I just don’t think that race matters much going forward.

Update: A lot of good comments in a HackerNews thread here. My favorite was this one: “Mozilla won the browser war. Firefox lost the browser fight. But there’s many wars left to fight, and I hope Mozilla dives into a new one.” Couldn’t agree more.

Brendan is back to save the Web

Brendan is back, and he has a plan to save the Web. Its a big and bold plan, and it may just work. I am pretty excited about this. If you have 5 minutes to read along I’ll explain why I think you should be as well.

The Web is broken

Lets face it, the Web today is a mess. Everywhere we go online we are constantly inundated with annoying ads. Often pages are more ads than content, and the more ads the industry throws at us, the more we ignore them, the more obnoxious ads get, trying to catch our attention. As Brendan explains in his blog post, the browser used to be on the user’s side—we call browsers the user agent for a reason. Part of the early success of Firefox was that it blocked popup ads. But somewhere over the last 10 years of modern Web browsers, browsers lost their way and stopped being the user’s agent alone. Why?

Browsers aren’t free

Making a modern Web browser is not free. It takes hundreds of engineers to make a competitive modern browser engine. Someone has to pay for that, and that someone needs to have a reason to pay for it. Google doesn’t make Chrome for the good of mankind. Google makes Chrome so you can consume more Web and along with it, more Google ads. Each time you click on one, Google makes more money. Chrome is a billion dollar business for Google. And the same is true for pretty much every other browser. Every major browser out there is funded through advertisement. No browser maker can escape this dilemma. Maybe now you understand why no major browser ships with a builtin enabled by default ad-blocker, even though ad-blockers are by far the most popular add-ons.

Our privacy is at stake

It’s not just the unregulated flood of advertisement that needs a solution. Every ad you see is often selected based on sensitive private information advertisement networks have extracted from your browsing behavior through tracking. Remember how the FBI used to track what books Americans read at the library, and it was a big scandal? Today the Googles and Facebooks of the world know almost every site you visit, everything you buy online, and they use this data to target you with advertisement. I am often puzzled why people are so afraid of the NSA spying on us but show so little concern about all the deeply personal data Google and Facebook are amassing about everyone.

Blocking alone doesn’t scale

I wish the solution was as easy as just blocking all ads. There is a lot of great Web content out there: news, entertainment, educational content. It’s not free to make all this content, but we have gotten used to consuming it “for free”. Banning all ads without an alternative mechanism would break the economic backbone of the Web. This dilemma has existed for many years, and the big browser vendors seem to have given up on it. It’s hard to blame them. How do you disrupt the status quo without sawing off the (ad revenue) branch you are sitting on?

It takes an newcomer to fix this mess

I think its unlikely that the incumbent browser vendors will make any bold moves to solve this mess. There is too much money at stake. I am excited to see a startup take a swipe at this problem, because they have little to lose (seed money aside). Brave is getting the user agent back into the game. Browsers have intentionally remained silent onlookers to the ad industry invading users’ privacy. With Brave, Brendan makes the user agent step up and fight for the user as it was always intended to do.

Brave basically consists of two parts: part one blocks third party ad content and tracking signals. Instead of these Brave inserts alternative ad content. Sites can sign up to get a fair share of any ads that Brave displays for them. The big change in comparison to the status quo is that the Brave user agent is in control and can regulate what you see. It’s like a speed limit for advertisement on the Web, with the goal to restore balance and give sites a fair way to monetize while giving the user control through the user agent.

Making money with a better Web

The ironic part of Brave is that its for-profit. Brave can make money by reducing obnoxious ads and protecting your privacy at the same time. If Brave succeeds, it’s going to drain money away from the crappy privacy-invasive obnoxious advertisement world we have today, and publishers and sites will start transacting in the new Brave world that is regulated by the user agent. Brave will take a cut of these transactions. And I think this is key. It aligns the incentives right. The current funding structure of major browsers encourages them to keep things as they are. Brave’s incentive is to bring down the whole diseased temple and usher in a better Web. Exciting.

Quick update: I had a chance to look over the Brave GitHub repo. It looks like the Brave Desktop browser is based on Chromium, not Gecko. Yes, you read that right. Brave is using Google’s rendering engine, not Mozilla’s. Much to write about this one, but it will definitely help Brave “hide” better in the large volume of Chrome users, making it harder for sites to identify and block Brave users. Brave for iOS seems to be a fork of Firefox for iOS, but it manages to block ads (Mozilla says they can’t).

Oracle sinks its claws into Android

This is my first blog post since leaving my role as Mozilla’s CTO 6 months ago. As you may have read in the press, a good chunk of the original Firefox OS founding team has moved on from mobile and we created a startup to work on some cool products and technologies for the Internet of Things. You’ll hear more about what we are up to next month.

While I am no longer working directly on mobile, a curious event got my attention: A commit appeared in the Android code base that indicates that Google is abandoning its own re-implementation of Java in favor of Oracle’s original Java implementation. I’ll try to explain why I think this is a huge change and will have far-reaching implications for Android and the Android ecosystem.

Why did Google create its own Java clone?

To run a Java app, you need a runtime library written in Java called the Java standard classes. This library implements basic language constructs like hash tables or strings.

Since the early days, Android didn’t use Sun’s version of the Java standard classes. Instead, the Android team enhanced the open source Apache Harmony Java standard libraries. Harmony is an independent “clean room” open-source implementation of the Java standard libraries maintained by the Apache Foundation.

There is basically no technical advantage in using Harmony. Its a strictly less complete and less correct version of Sun’s original implementation. Why did Android invest all this effort to duplicate Sun’s open source Java standard classes?

Apache vs GPL

Over the course of Android’s meteoric rise, the powers behind Android demonstrated a deep strategic understanding of different classes of open source licenses, their strength, and their weaknesses. Android has from its early days successfully used open source licenses to enable proprietary technology. Sounds counter-intuitive, but it explains why Google rewrote so much open technology for Android.

Java is actually not the only major open technology piece Google reinvented. Since Android 1.0 Google uses bionic as its standard C library. There were very few strong technical reasons to use bionic over open source alternatives such as the GNU libc. Quite to the contrary, at Mozilla in the earlier days of Android we had to constantly fight deficiencies in bionic in comparison to existing open source standard C libraries. Famously, bionic was not thread safe in many cases, crashing multi-threaded applications.

Writing a standard C library from scratch is crazy. Its one of the most commoditized pieces of software. Its almost impossible to do it significantly better than existing implementations, and it costs a ton of time and money and compatibility is a huge pain. Why did Google do it anyway? There is a simple answer: Licensing.

Bionic (as Google’s Java implementation) is licensed under the non-viral Apache 2 (APL) license. You can use and modify APL code without having to publish the changes. In other words, you can make proprietary changes and improvements. This is not possible with the GNU libc, which is under the LGPL. I am pretty sure I know why Google thought that this is important, because as part of launching Firefox OS I talked to many of the same chipset vendors and OEMs that Google works with. Silicon vendors and OEMs like to differentiate at the software level, trying to improve Android code all over the stack. Especially silicon vendors often modify library code to take advantage of their proprietary silicon, and they don’t want to share these changes with the world. Its their competitive moat–their proprietary advantage. Google rewrote bionic — and Java standard classes — because silicon vendors and OEMs probably demanded that most parts of Android are open (ironic use of the word in this context) to this sort of proprietary approach.

UpdateBob Lee who worked for Google at the time commented below that OpenJDK didn’t exist yet when Android 1.0 launched (2007), so Google couldn’t use OpenJDK back then. GNU Classpath (LGPL) did exist since 2004, however. Google still chose Harmony, and stuck with it even after Apache abandoned the Harmony project in 2011. The switch now is clearly driven by the Oracle vs Google lawsuit.


OpenJDK is the name for Oracle’s Java that you can obtain under the GPL2, a viral open source license with strong protections. Any changes to OpenJDK have to be published as source code (except if you are Oracle).

Because Oracle has means to control Java beyond source code, OpenJDK is about as open as a prison. You can vote on how high the walls are, and you can even help build the walls, but if you are ever forced to walk into it, Oracle alone will decided when and whether you can leave. Oracle owns much of the roadmap of OpenJDK, and via compatibility requirements, trademarks, existing agreements, and API copyright lawsuits (Oracle vs Google) Oracle is pretty much in full control where OpenJDK is headed. What does this mean for Android?

In short: there is a new sheriff in town. The app ecosystem is at the heart of every mobile OS. Its what made Android and iOS successful, and its what made Firefox OS struggle. The app ecosystem rests on the app stack, in Android’s case Harmony in the past, and going forward OpenJDK. In other words, Oracle has now at least one hand at the steering wheel as well.

Its anyone’s guess what Oracle will do with it, but Google and Oracle have a long history of not getting along, so its going to be quite curious to watch. Java itself aims to be a platform, and it is similarly vast in scope as Android. Java includes its own user interface (UI) library Swing, for example. Google has of course its own Android UI framework. Swing will now sit on every Android phone, using up resources. Its unlikely that Oracle will try to force Google to actually use Swing, but Google has to make sure it works and is present and apps can use it. And, Oracle can easily force Google to include pretty much any other code or service that pleases Oracle. How about Java Push Notifications, specified and operated by Oracle? All Oracle has to do is add it to OpenJDK, and it will make its way into Android. Google is now on Oracle’s Hamster wheel.

A rough year ahead

In the short term Google’s biggest challenge will be to rip out Harmony and replace it with OpenJDK. They actually have been working on this for a while. It seems this project started in secret already 11 months ago, and is now being merged into the public repository.

All this code and technology churn will have massive implications for Android at a tactical level. Literally millions of lines of code are changing, and the new OpenJDK implementation will often have subtly different correctness or performance behavior than the Harmony code Google used previously. Here you can see Google updating a test for a specific boundary condition in date handling. Harmony had a different behavior than Oracle’s OpenJDK, and the test had to be fixed.

The app ecosystem runs on top of these shifting sands. The Android app store has millions of apps that rely on the Java standard classes, and just as tests have to be fixed, apps will randomly break due to the subtle changes the OpenJDK transition brings. Breakage will not be limited to correctness. Performance variances will be even harder to track down. Past performance workarounds will be obsolete but it will be hard to tell which ones, and entirely new performance problems will pop up. Fun.

And of course, licensing is changing as you can see here. The core of Android’s ecosystem runtime is now powered by GPL2 library code, copyright Oracle.

I also have a very hard time imagining Android N coming out on time with this magnitude of change happening behind the scenes. Google is changing engines mid-flight. The top priority will be to not crash. They won’t have much time to worry about arriving on schedule.

The winner

No matter how you look at this, this is a huge victory for Oracle. Oracle never had much of a mobile game, and all the sudden Oracle gained a good amount of roadmap and technology influence over the most important mobile ecosystem by scale. Oracle is a mobile titan now. I didn’t see that one coming.

The losers

Google, and silicon vendors. The entire middle part of the Android stack will be subject to proprietary Oracle control. Google calls this “reduced fragmentation” in their press release. That’s true, kind of. There will be less fragmentation because Oracle will control anything Java, including Android.

Silicon vendors will be still allowed to do proprietary enhancements if they obtain the same library code under a difference (non-viral, non-GPL2) license from Oracle — for a fee. Oracle has actually already a history of up-selling OpenJDK. They are offering certain components of the Java VM only for royalty payments. You get a basic garbage collector for free. If you want the really good one, it’ll cost you. I expect Oracle to attempt to monetize in similar ways the billions of mobile users it just stumbled upon.

New Adventure

It is with mixed emotions that I end my almost 7 year long journey with Mozilla next week. Working with this team has been one of the peak experiences of my professional life.

I am also extremely excited about the next chapter in that life. I am departing Mozilla to create a new venture in the Internet of Things space, an open field that presents many of the types of challenges and opportunities that drive our passion for the Web.

I feel deeply humbled and honored that I had the chance to be part of such an amazing and passionate group of people for the last many years, building together the Web we want. I leave with fond memories and great respect for this organization and the people who build it each day. It has been a great honor to be your colleague and friend.

Data is at the heart of search. But who has access to it?

In my February 23 blog post, I gave a brief overview of how search engines have evolved over the years and how today’s search engines learn from past searches to anticipate which results will be most relevant to a given query. This means that who succeeds in the $50 billion search business and who doesn’t mostly depends on who has access to search data. In this blog post, I will explore how search engines have obtained queries in the past and how (and why) that’s changing.

For some 90% of searches, a modern search engine analyzes and learns from past queries, rather than searching the Web itself, to deliver the most relevant results. Most the time, this approach yields better results than full text search. The Web has become so vast that searches often find millions or billions of result pages that are difficult to rank algorithmically.

One important way a search engine obtains data about past queries is by logging and retaining search results from its own users. For a search engine with many users, there’s enough data to learn from and make informed predictions. It’s a different story for a search engine that wants to enter a new market (and thus has no past search data!) or compete in a market where one search engine is very dominant.

In Germany, for example, where Google has over 95% market share, competing search engines don’t have access to adequate past search data to deliver search results that are as relevant as Google’s. And, because their search results aren’t as relevant as Google’s, it’s difficult for them to attract new users. You could call it a vicious circle.

Search engines with small user bases can acquire search traffic by working with large Internet Service providers (also called ISPs, think Comcast, Verizon, etc.) to capture searches that go from users’ browsers to competing search engines. This is one option that was available in the past to Google’s competitors such as Yahoo and Bing as they attempted to become competitive with Google’s results.

In an effort to improve privacy, Google began using encrypted connections to make searches unintelligible to ISPs. One side effect was that an important avenue was blocked for competing search engines to obtain data that would improve their products.

An alternative to working with ISPs is to work with popular content sites to track where visitors are coming from. In Web lingo this is called a “referer header.” When a user clicks on a link, the browser tells the target site where the user was before (what site “referred” the user). If the user was referred by a search result page, that address contains the query string, making it possible to associate the original search with the result link. Because the vast majority of Web traffic goes to a few thousand top sites, it is possible to reconstruct a pretty good model of what people frequently search for and what results they follow.

Until late 2011, that is, when Google began encrypting the query in the referer header. Today, it’s no longer possible for the target site to reconstruct the user’s original query. This is of course good for user privacy—the target site knows only that a user was referred from Google after searching for something. At the same time, though, query encryption also locked out everyone (except Google) from accessing the underlying query data.

This chain of events has led to a “winner take all” situation in search, as a commenter on my previous blog post noted: a successful search engine is likely to get more and more successful, leaving in the dust the competitors who lack access to vital data.

These days, the search box in the browser is essentially the last remaining place where Google’s competitors can access a large volume of search queries. In 2011, Google famously accused Microsoft’s Bing search engine of doing exactly that: logging Google search traffic in Microsoft’s own Internet Explorer browser in order to improve the quality of Bing results. Having almost tripled the market share of Chrome since then, this is something Google has to worry much less about in the future. Its competitors will not be able to use Chrome’s search box to obtain data the way Microsoft did with Internet Explorer in the past.

So, if you have ever wondered why, in most markets, Google’s search results are so much better than their competitors’, don’t assume it’s because Google has a better search engine. The real reason is that Google has access to so much more search data. And, the company has worked diligently over the past few years to make sure it stays that way.

Search works differently than you may think

Search is the main way we all navigate the Web, but it works very differently than you may think. In this blog post I will try to explain how it worked in the past, why it works differently today and what role you play in the process.

The services you use for searching, like Google, Yahoo and Bing, are called a search engines. The very name suggests that they go through a huge index of Web pages to find every one that contains the words you are searching for. 20 years ago search engines indeed worked this way. They would “crawl” the Web and index it, making the content available for text searches.

As the Web grew larger, searches would often find the same word or phrase on more and more pages. This was starting to make search results less and less useful because humans don’t like to read through huge lists to manually find the page that best matches their search. A search for the word “door” on Google, for example, gives you more than 1.9 billion results. It’s impractical — even impossible — for anyone look through all of them to find the most relevant page.


Google finds about 1.9 billion results for the search query “door”.

To help navigate the ever growing Web, search engines introduced algorithms to rank results by their relevance. In 1996, two Stanford graduate students, Larry Page and Sergey Brin, discovered a way to use the information available on the Web itself to rank results. They called it PageRank.

Pages on the Web are connected by links. Each link contains anchor text that explains to readers why they should follow the link. The link itself points to another page that the author of the source page felt was relevant to the anchor text. Page and Brin discovered that they could rank results by analyzing the incoming links to a page and treating each one as a vote for its quality. A result is more likely to be relevant if many links point to it using anchor text that is similar to the search terms. Page and Brin founded a search engine company in 1998 to commercialize the idea: Google.

PageRank worked so well that it completely changed the way people interact with search results. Because PageRank correctly offered the most relevant results at the top of the page, users started to pay less attention to anything below that. This also meant that pages that didn’t appear on top of the results page essentially started to become “invisible”: users stopped finding and visiting them.

To experience the “invisible Web” for yourself, head over to Google and try to look through more than just the first page of results. So few users ever wander beyond the first page that Google doesn’t even bother displaying all the 1.9 billion search results it claims to have found for “door.” Instead, the list just stops at page 63, about a 100 million pages short of what you would have expected.

Despite reporting over 1.9 billion results, in reality Google’s search results for “door” are quite finite and end at page 63.

With publishers and online commerce sites competing for that small number of top search results, a new business was born: search engine optimization (or SEO). There are many different methods of SEO, but the principal goal is to game the PageRank algorithm in your favor by increasing the number of incoming links to your own page and tuning the anchor text. With sites competing for visitors — and billions in online revenue at stake — PageRank eventually lost this arms race. Today, links and anchor text are no longer useful to determine the most relevant results and, as a result, the importance of PageRank has dramatically decreased.

Search engines have since evolved to use machine learning to rank results. People perform 1.2 trillion searches a year on Google alone  — that’s about 3 billion a day and 40,000 a second. Each search becomes part of this massive query stream as the search engine simultaneously “sees” what billions of people are searching for all over the world. For each search, it offers a range of results and remembers which one you considered most relevant. It then uses these past searches to learn what’s most relevant to the average user to provide the most relevant results for future searches.

Machine learning has made text search all but obsolete. Search engines can answer 90% or so of searches by looking at previous search terms and results. They no longer search the Web in most cases — they instead search past searches and respond based on the preferred result of previous users.

This shift from PageRank to machine learning also changed your role in the process. Without your searches — and your choice of results — a search engine couldn’t learn and provide future answers to others. Every time you use a search engine, the search engine uses you to rank its results on a massive scale. That makes you its most important asset.