Data is at the heart of search. But who has access to it?

In my February 23 blog post, I gave a brief overview of how search engines have evolved over the years and how today’s search engines learn from past searches to anticipate which results will be most relevant to a given query. This means that who succeeds in the $50 billion search business and who doesn’t mostly depends on who has access to search data. In this blog post, I will explore how search engines have obtained queries in the past and how (and why) that’s changing.

For some 90% of searches, a modern search engine analyzes and learns from past queries, rather than searching the Web itself, to deliver the most relevant results. Most the time, this approach yields better results than full text search. The Web has become so vast that searches often find millions or billions of result pages that are difficult to rank algorithmically.

One important way a search engine obtains data about past queries is by logging and retaining search results from its own users. For a search engine with many users, there’s enough data to learn from and make informed predictions. It’s a different story for a search engine that wants to enter a new market (and thus has no past search data!) or compete in a market where one search engine is very dominant.

In Germany, for example, where Google has over 95% market share, competing search engines don’t have access to adequate past search data to deliver search results that are as relevant as Google’s. And, because their search results aren’t as relevant as Google’s, it’s difficult for them to attract new users. You could call it a vicious circle.

Search engines with small user bases can acquire search traffic by working with large Internet Service providers (also called ISPs, think Comcast, Verizon, etc.) to capture searches that go from users’ browsers to competing search engines. This is one option that was available in the past to Google’s competitors such as Yahoo and Bing as they attempted to become competitive with Google’s results.

In an effort to improve privacy, Google began using encrypted connections to make searches unintelligible to ISPs. One side effect was that an important avenue was blocked for competing search engines to obtain data that would improve their products.

An alternative to working with ISPs is to work with popular content sites to track where visitors are coming from. In Web lingo this is called a “referer header.” When a user clicks on a link, the browser tells the target site where the user was before (what site “referred” the user). If the user was referred by a search result page, that address contains the query string, making it possible to associate the original search with the result link. Because the vast majority of Web traffic goes to a few thousand top sites, it is possible to reconstruct a pretty good model of what people frequently search for and what results they follow.

Until late 2011, that is, when Google began encrypting the query in the referer header. Today, it’s no longer possible for the target site to reconstruct the user’s original query. This is of course good for user privacy—the target site knows only that a user was referred from Google after searching for something. At the same time, though, query encryption also locked out everyone (except Google) from accessing the underlying query data.

This chain of events has led to a “winner take all” situation in search, as a commenter on my previous blog post noted: a successful search engine is likely to get more and more successful, leaving in the dust the competitors who lack access to vital data.

These days, the search box in the browser is essentially the last remaining place where Google’s competitors can access a large volume of search queries. In 2011, Google famously accused Microsoft’s Bing search engine of doing exactly that: logging Google search traffic in Microsoft’s own Internet Explorer browser in order to improve the quality of Bing results. Having almost tripled the market share of Chrome since then, this is something Google has to worry much less about in the future. Its competitors will not be able to use Chrome’s search box to obtain data the way Microsoft did with Internet Explorer in the past.

So, if you have ever wondered why, in most markets, Google’s search results are so much better than their competitors’, don’t assume it’s because Google has a better search engine. The real reason is that Google has access to so much more search data. And, the company has worked diligently over the past few years to make sure it stays that way.

3 thoughts on “Data is at the heart of search. But who has access to it?

  1. You say “a modern search engine analyzes and learns from past queries” but give no hint where this info comes from or what the actual mechanism is supposed to be. How does a searchengine learn from queries? By looking what gets clicked on? How does a click indicate that the page was a good hit? The user did not see the site when he clicked.

    • Search engines provide the most relevant result, not the highest quality one. Relevance is indeed determined by what the user clicks on when presented a choice of results (before actually reading the full page). As an example, many searches these days terminate in Wikipedia, because people think Wikipedia is relevant. Not because Wikipedia is a particularly high quality or authoritative source of information.

  2. “Not because Wikipedia is a particularly high quality or authoritative source of information.”

    I would like to add a “citation needed” to that statement.

    This is exactly the issue I see with this article. More background info is needed where the information comes from. Maybe clicktracking is not used at all in improving search. How can we – the readers – know?

    I think it’s an interesting topic. But without giving any background you could just state the opposite of everything you wrote and it would be just as justified.

    Another point: You write “These days, the search box in the browser is essentially the last remaining place where Google’s competitors can access a large volume of search queries. In 2011, Google famously accused Microsoft’s Bing search engine of doing exactly that”.
    That is not what Google criticised. They criticised that Bing copied their results. They might have done so by tracking what users search in Google and which Google results they clicked on. That means tracking the user around the web. Not just logging what he puts in his search box.

Leave a comment