Search is the main way we all navigate the Web, but it works very differently than you may think. In this blog post I will try to explain how it worked in the past, why it works differently today and what role you play in the process.
The services you use for searching, like Google, Yahoo and Bing, are called a search engines. The very name suggests that they go through a huge index of Web pages to find every one that contains the words you are searching for. 20 years ago search engines indeed worked this way. They would “crawl” the Web and index it, making the content available for text searches.
As the Web grew larger, searches would often find the same word or phrase on more and more pages. This was starting to make search results less and less useful because humans don’t like to read through huge lists to manually find the page that best matches their search. A search for the word “door” on Google, for example, gives you more than 1.9 billion results. It’s impractical — even impossible — for anyone look through all of them to find the most relevant page.
To help navigate the ever growing Web, search engines introduced algorithms to rank results by their relevance. In 1996, two Stanford graduate students, Larry Page and Sergey Brin, discovered a way to use the information available on the Web itself to rank results. They called it PageRank.
Pages on the Web are connected by links. Each link contains anchor text that explains to readers why they should follow the link. The link itself points to another page that the author of the source page felt was relevant to the anchor text. Page and Brin discovered that they could rank results by analyzing the incoming links to a page and treating each one as a vote for its quality. A result is more likely to be relevant if many links point to it using anchor text that is similar to the search terms. Page and Brin founded a search engine company in 1998 to commercialize the idea: Google.
PageRank worked so well that it completely changed the way people interact with search results. Because PageRank correctly offered the most relevant results at the top of the page, users started to pay less attention to anything below that. This also meant that pages that didn’t appear on top of the results page essentially started to become “invisible”: users stopped finding and visiting them.
To experience the “invisible Web” for yourself, head over to Google and try to look through more than just the first page of results. So few users ever wander beyond the first page that Google doesn’t even bother displaying all the 1.9 billion search results it claims to have found for “door.” Instead, the list just stops at page 63, about a 100 million pages short of what you would have expected.
With publishers and online commerce sites competing for that small number of top search results, a new business was born: search engine optimization (or SEO). There are many different methods of SEO, but the principal goal is to game the PageRank algorithm in your favor by increasing the number of incoming links to your own page and tuning the anchor text. With sites competing for visitors — and billions in online revenue at stake — PageRank eventually lost this arms race. Today, links and anchor text are no longer useful to determine the most relevant results and, as a result, the importance of PageRank has dramatically decreased.
Search engines have since evolved to use machine learning to rank results. People perform 1.2 trillion searches a year on Google alone — that’s about 3 billion a day and 40,000 a second. Each search becomes part of this massive query stream as the search engine simultaneously “sees” what billions of people are searching for all over the world. For each search, it offers a range of results and remembers which one you considered most relevant. It then uses these past searches to learn what’s most relevant to the average user to provide the most relevant results for future searches.
Machine learning has made text search all but obsolete. Search engines can answer 90% or so of searches by looking at previous search terms and results. They no longer search the Web in most cases — they instead search past searches and respond based on the preferred result of previous users.
This shift from PageRank to machine learning also changed your role in the process. Without your searches — and your choice of results — a search engine couldn’t learn and provide future answers to others. Every time you use a search engine, the search engine uses you to rank its results on a massive scale. That makes you its most important asset.
Even though this is probably mostly true, Google still *has* to crawl the web do identify new content. As much as history is getting written in real time, the stuff people look up and expect to find also evolves over time.
So, more than the way Google (or other search engines) gets the data, it’s the signal that they use to rank results that changed the most.
It’s also very scary because it accentuates the “winner takes all” trend on the web. If we assume that they only use the user signal, then, it means, that a search engine with 95% market share will litterally *starve* its competitors in terms of usage data…
That’s correct. Some fresh external signal is still necessary to avoid the search engine turning into an echo chamber, just stewing in its own stale juices. Your second point about a “winner takes all” trend is also exactly on point. I will write more about that in the future.
Another parameter. The bubble. Users who have a Google account have also a Google profile for searches. If you do not block all cookies coming from Google. Little by little you will be served as search results what Google thinks should match your taste.
It has another tendency to impoverish the cultural sphere we are evolving too. It was already the case with PageRank which doesn’t link to the most pertinent, but to the most popular. It’s basically people magazine for the Web.