Brendan is back to save the Web

Brendan is back, and he has a plan to save the Web. Its a big and bold plan, and it may just work. I am pretty excited about this. If you have 5 minutes to read along I’ll explain why I think you should be as well.

The Web is broken

Lets face it, the Web today is a mess. Everywhere we go online we are constantly inundated with annoying ads. Often pages are more ads than content, and the more ads the industry throws at us, the more we ignore them, the more obnoxious ads get, trying to catch our attention. As Brendan explains in his blog post, the browser used to be on the user’s side—we call browsers the user agent for a reason. Part of the early success of Firefox was that it blocked popup ads. But somewhere over the last 10 years of modern Web browsers, browsers lost their way and stopped being the user’s agent alone. Why?

Browsers aren’t free

Making a modern Web browser is not free. It takes hundreds of engineers to make a competitive modern browser engine. Someone has to pay for that, and that someone needs to have a reason to pay for it. Google doesn’t make Chrome for the good of mankind. Google makes Chrome so you can consume more Web and along with it, more Google ads. Each time you click on one, Google makes more money. Chrome is a billion dollar business for Google. And the same is true for pretty much every other browser. Every major browser out there is funded through advertisement. No browser maker can escape this dilemma. Maybe now you understand why no major browser ships with a builtin enabled by default ad-blocker, even though ad-blockers are by far the most popular add-ons.

Our privacy is at stake

It’s not just the unregulated flood of advertisement that needs a solution. Every ad you see is often selected based on sensitive private information advertisement networks have extracted from your browsing behavior through tracking. Remember how the FBI used to track what books Americans read at the library, and it was a big scandal? Today the Googles and Facebooks of the world know almost every site you visit, everything you buy online, and they use this data to target you with advertisement. I am often puzzled why people are so afraid of the NSA spying on us but show so little concern about all the deeply personal data Google and Facebook are amassing about everyone.

Blocking alone doesn’t scale

I wish the solution was as easy as just blocking all ads. There is a lot of great Web content out there: news, entertainment, educational content. It’s not free to make all this content, but we have gotten used to consuming it “for free”. Banning all ads without an alternative mechanism would break the economic backbone of the Web. This dilemma has existed for many years, and the big browser vendors seem to have given up on it. It’s hard to blame them. How do you disrupt the status quo without sawing off the (ad revenue) branch you are sitting on?

It takes an newcomer to fix this mess

I think its unlikely that the incumbent browser vendors will make any bold moves to solve this mess. There is too much money at stake. I am excited to see a startup take a swipe at this problem, because they have little to lose (seed money aside). Brave is getting the user agent back into the game. Browsers have intentionally remained silent onlookers to the ad industry invading users’ privacy. With Brave, Brendan makes the user agent step up and fight for the user as it was always intended to do.

Brave basically consists of two parts: part one blocks third party ad content and tracking signals. Instead of these Brave inserts alternative ad content. Sites can sign up to get a fair share of any ads that Brave displays for them. The big change in comparison to the status quo is that the Brave user agent is in control and can regulate what you see. It’s like a speed limit for advertisement on the Web, with the goal to restore balance and give sites a fair way to monetize while giving the user control through the user agent.

Making money with a better Web

The ironic part of Brave is that its for-profit. Brave can make money by reducing obnoxious ads and protecting your privacy at the same time. If Brave succeeds, it’s going to drain money away from the crappy privacy-invasive obnoxious advertisement world we have today, and publishers and sites will start transacting in the new Brave world that is regulated by the user agent. Brave will take a cut of these transactions. And I think this is key. It aligns the incentives right. The current funding structure of major browsers encourages them to keep things as they are. Brave’s incentive is to bring down the whole diseased temple and usher in a better Web. Exciting.

Quick update: I had a chance to look over the Brave GitHub repo. It looks like the Brave Desktop browser is based on Chromium, not Gecko. Yes, you read that right. Brave is using Google’s rendering engine, not Mozilla’s. Much to write about this one, but it will definitely help Brave “hide” better in the large volume of Chrome users, making it harder for sites to identify and block Brave users. Brave for iOS seems to be a fork of Firefox for iOS, but it manages to block ads (Mozilla says they can’t).

Oracle sinks its claws into Android

This is my first blog post since leaving my role as Mozilla’s CTO 6 months ago. As you may have read in the press, a good chunk of the original Firefox OS founding team has moved on from mobile and we created a startup to work on some cool products and technologies for the Internet of Things. You’ll hear more about what we are up to next month.

While I am no longer working directly on mobile, a curious event got my attention: A commit appeared in the Android code base that indicates that Google is abandoning its own re-implementation of Java in favor of Oracle’s original Java implementation. I’ll try to explain why I think this is a huge change and will have far-reaching implications for Android and the Android ecosystem.

Why did Google create its own Java clone?

To run a Java app, you need a runtime library written in Java called the Java standard classes. This library implements basic language constructs like hash tables or strings.

Since the early days, Android didn’t use Sun’s version of the Java standard classes. Instead, the Android team enhanced the open source Apache Harmony Java standard libraries. Harmony is an independent “clean room” open-source implementation of the Java standard libraries maintained by the Apache Foundation.

There is basically no technical advantage in using Harmony. Its a strictly less complete and less correct version of Sun’s original implementation. Why did Android invest all this effort to duplicate Sun’s open source Java standard classes?

Apache vs GPL

Over the course of Android’s meteoric rise, the powers behind Android demonstrated a deep strategic understanding of different classes of open source licenses, their strength, and their weaknesses. Android has from its early days successfully used open source licenses to enable proprietary technology. Sounds counter-intuitive, but it explains why Google rewrote so much open technology for Android.

Java is actually not the only major open technology piece Google reinvented. Since Android 1.0 Google uses bionic as its standard C library. There were very few strong technical reasons to use bionic over open source alternatives such as the GNU libc. Quite to the contrary, at Mozilla in the earlier days of Android we had to constantly fight deficiencies in bionic in comparison to existing open source standard C libraries. Famously, bionic was not thread safe in many cases, crashing multi-threaded applications.

Writing a standard C library from scratch is crazy. Its one of the most commoditized pieces of software. Its almost impossible to do it significantly better than existing implementations, and it costs a ton of time and money and compatibility is a huge pain. Why did Google do it anyway? There is a simple answer: Licensing.

Bionic (as Google’s Java implementation) is licensed under the non-viral Apache 2 (APL) license. You can use and modify APL code without having to publish the changes. In other words, you can make proprietary changes and improvements. This is not possible with the GNU libc, which is under the LGPL. I am pretty sure I know why Google thought that this is important, because as part of launching Firefox OS I talked to many of the same chipset vendors and OEMs that Google works with. Silicon vendors and OEMs like to differentiate at the software level, trying to improve Android code all over the stack. Especially silicon vendors often modify library code to take advantage of their proprietary silicon, and they don’t want to share these changes with the world. Its their competitive moat–their proprietary advantage. Google rewrote bionic — and Java standard classes — because silicon vendors and OEMs probably demanded that most parts of Android are open (ironic use of the word in this context) to this sort of proprietary approach.

UpdateBob Lee who worked for Google at the time commented below that OpenJDK didn’t exist yet when Android 1.0 launched (2007), so Google couldn’t use OpenJDK back then. GNU Classpath (LGPL) did exist since 2004, however. Google still chose Harmony, and stuck with it even after Apache abandoned the Harmony project in 2011. The switch now is clearly driven by the Oracle vs Google lawsuit.

“Open”JDK

OpenJDK is the name for Oracle’s Java that you can obtain under the GPL2, a viral open source license with strong protections. Any changes to OpenJDK have to be published as source code (except if you are Oracle).

Because Oracle has means to control Java beyond source code, OpenJDK is about as open as a prison. You can vote on how high the walls are, and you can even help build the walls, but if you are ever forced to walk into it, Oracle alone will decided when and whether you can leave. Oracle owns much of the roadmap of OpenJDK, and via compatibility requirements, trademarks, existing agreements, and API copyright lawsuits (Oracle vs Google) Oracle is pretty much in full control where OpenJDK is headed. What does this mean for Android?

In short: there is a new sheriff in town. The app ecosystem is at the heart of every mobile OS. Its what made Android and iOS successful, and its what made Firefox OS struggle. The app ecosystem rests on the app stack, in Android’s case Harmony in the past, and going forward OpenJDK. In other words, Oracle has now at least one hand at the steering wheel as well.

Its anyone’s guess what Oracle will do with it, but Google and Oracle have a long history of not getting along, so its going to be quite curious to watch. Java itself aims to be a platform, and it is similarly vast in scope as Android. Java includes its own user interface (UI) library Swing, for example. Google has of course its own Android UI framework. Swing will now sit on every Android phone, using up resources. Its unlikely that Oracle will try to force Google to actually use Swing, but Google has to make sure it works and is present and apps can use it. And, Oracle can easily force Google to include pretty much any other code or service that pleases Oracle. How about Java Push Notifications, specified and operated by Oracle? All Oracle has to do is add it to OpenJDK, and it will make its way into Android. Google is now on Oracle’s Hamster wheel.

A rough year ahead

In the short term Google’s biggest challenge will be to rip out Harmony and replace it with OpenJDK. They actually have been working on this for a while. It seems this project started in secret already 11 months ago, and is now being merged into the public repository.

All this code and technology churn will have massive implications for Android at a tactical level. Literally millions of lines of code are changing, and the new OpenJDK implementation will often have subtly different correctness or performance behavior than the Harmony code Google used previously. Here you can see Google updating a test for a specific boundary condition in date handling. Harmony had a different behavior than Oracle’s OpenJDK, and the test had to be fixed.

The app ecosystem runs on top of these shifting sands. The Android app store has millions of apps that rely on the Java standard classes, and just as tests have to be fixed, apps will randomly break due to the subtle changes the OpenJDK transition brings. Breakage will not be limited to correctness. Performance variances will be even harder to track down. Past performance workarounds will be obsolete but it will be hard to tell which ones, and entirely new performance problems will pop up. Fun.

And of course, licensing is changing as you can see here. The core of Android’s ecosystem runtime is now powered by GPL2 library code, copyright Oracle.

I also have a very hard time imagining Android N coming out on time with this magnitude of change happening behind the scenes. Google is changing engines mid-flight. The top priority will be to not crash. They won’t have much time to worry about arriving on schedule.

The winner

No matter how you look at this, this is a huge victory for Oracle. Oracle never had much of a mobile game, and all the sudden Oracle gained a good amount of roadmap and technology influence over the most important mobile ecosystem by scale. Oracle is a mobile titan now. I didn’t see that one coming.

The losers

Google, and silicon vendors. The entire middle part of the Android stack will be subject to proprietary Oracle control. Google calls this “reduced fragmentation” in their press release. That’s true, kind of. There will be less fragmentation because Oracle will control anything Java, including Android.

Silicon vendors will be still allowed to do proprietary enhancements if they obtain the same library code under a difference (non-viral, non-GPL2) license from Oracle — for a fee. Oracle has actually already a history of up-selling OpenJDK. They are offering certain components of the Java VM only for royalty payments. You get a basic garbage collector for free. If you want the really good one, it’ll cost you. I expect Oracle to attempt to monetize in similar ways the billions of mobile users it just stumbled upon.

New Adventure

It is with mixed emotions that I end my almost 7 year long journey with Mozilla next week. Working with this team has been one of the peak experiences of my professional life.

I am also extremely excited about the next chapter in that life. I am departing Mozilla to create a new venture in the Internet of Things space, an open field that presents many of the types of challenges and opportunities that drive our passion for the Web.

I feel deeply humbled and honored that I had the chance to be part of such an amazing and passionate group of people for the last many years, building together the Web we want. I leave with fond memories and great respect for this organization and the people who build it each day. It has been a great honor to be your colleague and friend.

Data is at the heart of search. But who has access to it?

In my February 23 blog post, I gave a brief overview of how search engines have evolved over the years and how today’s search engines learn from past searches to anticipate which results will be most relevant to a given query. This means that who succeeds in the $50 billion search business and who doesn’t mostly depends on who has access to search data. In this blog post, I will explore how search engines have obtained queries in the past and how (and why) that’s changing.

For some 90% of searches, a modern search engine analyzes and learns from past queries, rather than searching the Web itself, to deliver the most relevant results. Most the time, this approach yields better results than full text search. The Web has become so vast that searches often find millions or billions of result pages that are difficult to rank algorithmically.

One important way a search engine obtains data about past queries is by logging and retaining search results from its own users. For a search engine with many users, there’s enough data to learn from and make informed predictions. It’s a different story for a search engine that wants to enter a new market (and thus has no past search data!) or compete in a market where one search engine is very dominant.

In Germany, for example, where Google has over 95% market share, competing search engines don’t have access to adequate past search data to deliver search results that are as relevant as Google’s. And, because their search results aren’t as relevant as Google’s, it’s difficult for them to attract new users. You could call it a vicious circle.

Search engines with small user bases can acquire search traffic by working with large Internet Service providers (also called ISPs, think Comcast, Verizon, etc.) to capture searches that go from users’ browsers to competing search engines. This is one option that was available in the past to Google’s competitors such as Yahoo and Bing as they attempted to become competitive with Google’s results.

In an effort to improve privacy, Google began using encrypted connections to make searches unintelligible to ISPs. One side effect was that an important avenue was blocked for competing search engines to obtain data that would improve their products.

An alternative to working with ISPs is to work with popular content sites to track where visitors are coming from. In Web lingo this is called a “referer header.” When a user clicks on a link, the browser tells the target site where the user was before (what site “referred” the user). If the user was referred by a search result page, that address contains the query string, making it possible to associate the original search with the result link. Because the vast majority of Web traffic goes to a few thousand top sites, it is possible to reconstruct a pretty good model of what people frequently search for and what results they follow.

Until late 2011, that is, when Google began encrypting the query in the referer header. Today, it’s no longer possible for the target site to reconstruct the user’s original query. This is of course good for user privacy—the target site knows only that a user was referred from Google after searching for something. At the same time, though, query encryption also locked out everyone (except Google) from accessing the underlying query data.

This chain of events has led to a “winner take all” situation in search, as a commenter on my previous blog post noted: a successful search engine is likely to get more and more successful, leaving in the dust the competitors who lack access to vital data.

These days, the search box in the browser is essentially the last remaining place where Google’s competitors can access a large volume of search queries. In 2011, Google famously accused Microsoft’s Bing search engine of doing exactly that: logging Google search traffic in Microsoft’s own Internet Explorer browser in order to improve the quality of Bing results. Having almost tripled the market share of Chrome since then, this is something Google has to worry much less about in the future. Its competitors will not be able to use Chrome’s search box to obtain data the way Microsoft did with Internet Explorer in the past.

So, if you have ever wondered why, in most markets, Google’s search results are so much better than their competitors’, don’t assume it’s because Google has a better search engine. The real reason is that Google has access to so much more search data. And, the company has worked diligently over the past few years to make sure it stays that way.

Search works differently than you may think

Search is the main way we all navigate the Web, but it works very differently than you may think. In this blog post I will try to explain how it worked in the past, why it works differently today and what role you play in the process.

The services you use for searching, like Google, Yahoo and Bing, are called a search engines. The very name suggests that they go through a huge index of Web pages to find every one that contains the words you are searching for. 20 years ago search engines indeed worked this way. They would “crawl” the Web and index it, making the content available for text searches.

As the Web grew larger, searches would often find the same word or phrase on more and more pages. This was starting to make search results less and less useful because humans don’t like to read through huge lists to manually find the page that best matches their search. A search for the word “door” on Google, for example, gives you more than 1.9 billion results. It’s impractical — even impossible — for anyone look through all of them to find the most relevant page.

search-1

Google finds about 1.9 billion results for the search query “door”.

To help navigate the ever growing Web, search engines introduced algorithms to rank results by their relevance. In 1996, two Stanford graduate students, Larry Page and Sergey Brin, discovered a way to use the information available on the Web itself to rank results. They called it PageRank.

Pages on the Web are connected by links. Each link contains anchor text that explains to readers why they should follow the link. The link itself points to another page that the author of the source page felt was relevant to the anchor text. Page and Brin discovered that they could rank results by analyzing the incoming links to a page and treating each one as a vote for its quality. A result is more likely to be relevant if many links point to it using anchor text that is similar to the search terms. Page and Brin founded a search engine company in 1998 to commercialize the idea: Google.

PageRank worked so well that it completely changed the way people interact with search results. Because PageRank correctly offered the most relevant results at the top of the page, users started to pay less attention to anything below that. This also meant that pages that didn’t appear on top of the results page essentially started to become “invisible”: users stopped finding and visiting them.

To experience the “invisible Web” for yourself, head over to Google and try to look through more than just the first page of results. So few users ever wander beyond the first page that Google doesn’t even bother displaying all the 1.9 billion search results it claims to have found for “door.” Instead, the list just stops at page 63, about a 100 million pages short of what you would have expected.

Despite reporting over 1.9 billion results, in reality Google’s search results for “door” are quite finite and end at page 63.

With publishers and online commerce sites competing for that small number of top search results, a new business was born: search engine optimization (or SEO). There are many different methods of SEO, but the principal goal is to game the PageRank algorithm in your favor by increasing the number of incoming links to your own page and tuning the anchor text. With sites competing for visitors — and billions in online revenue at stake — PageRank eventually lost this arms race. Today, links and anchor text are no longer useful to determine the most relevant results and, as a result, the importance of PageRank has dramatically decreased.

Search engines have since evolved to use machine learning to rank results. People perform 1.2 trillion searches a year on Google alone  — that’s about 3 billion a day and 40,000 a second. Each search becomes part of this massive query stream as the search engine simultaneously “sees” what billions of people are searching for all over the world. For each search, it offers a range of results and remembers which one you considered most relevant. It then uses these past searches to learn what’s most relevant to the average user to provide the most relevant results for future searches.

Machine learning has made text search all but obsolete. Search engines can answer 90% or so of searches by looking at previous search terms and results. They no longer search the Web in most cases — they instead search past searches and respond based on the preferred result of previous users.

This shift from PageRank to machine learning also changed your role in the process. Without your searches — and your choice of results — a search engine couldn’t learn and provide future answers to others. Every time you use a search engine, the search engine uses you to rank its results on a massive scale. That makes you its most important asset.

WebVR is coming to Firefox Nightly

In 2014 Mozilla started working on adding VR capabilities to the Web. Our VR team proposed a number of new Web APIs and made an experimental VR build of Firefox available that supports rendering VR content using the Web to Oculus Rift headsets.

Consumer VR products are still in a nascent state, but clearly there is great promise for this technology. We have enough confidence in the new APIs we have proposed that we are today taking the step of integrating them into our regular nightly Firefox builds. Head over to MozVR for all the details, and if you own an Oculus Rift headset or mobile VR-capable hardware we support, give it a spin!

 

It takes many to build the Web we want

Mozilla is announcing today the creation of a WebRTC competency center jointly with Telenor.

Mozilla’s purpose is to build the Web. We do so by building Firefox and Firefox OS. The Web is pretty unusual when it comes to interoperable technology stacks, because it is not built by standards bodies. Instead, the Web is built by browser vendors that implement browsers that implement the Web, which in the end pretty much defines what the Web is.

The Web adds new technologies whenever a majority of browser vendors agree to extend it in an interoperable way. Standards bodies merely help coordinating this process. Very rarely do new Web capabilities originate in a standards body. New Web capabilities merely end up there eventually, once there is sufficient interest by multiple browser vendors to warrant standardization.

Mozilla doesn’t — and can’t — build the Web alone. What makes the Web unique is that it is owned by no-one, and cannot be held back by anyone. It doesn’t take unanimous consent to extend the Web. A mere majority of browser vendors can popularize a new Web capability, forcing the rest of the browser vendors to eventually come along.

While several browser vendors build the Web, Mozilla has a unique vision for the Web that is driven by our mission as a non-profit foundation. Whereas all other browser vendors are for-profit corporations, advancing the Web in the interest of their shareholders, Mozilla advances the Web for users.

The primary browser vendors today are Google, Apple, Microsoft and Mozilla. These four organizations have a direct path to bring new technologies to the Web. While many other technology companies have a strong interest in the Web, they lack the ability to directly move the Web ahead because only these four browser vendors develop a rendering engine that implements the Web stack.

There is one more aspect that sets Mozilla apart from its browser vendor competitors. We are several orders of magnitude smaller than our peers. While this might appear as a market disadvantage at first, combined with our neutral and non-profit status it actually creates a unique opportunity. Many more technology companies have an interest in working on the Web, but if you aren’t Google, Apple, or Microsoft its very difficult to contribute core technologies to the Web. These three companies have direct control over a rendering engine. No other technology company can equally influence the Web. Mozilla is looking to change that.

Jointly with Telenor we are launching a new initiative that will allow parties with a strong technology interest in WebRTC to participate as an equal in the development process of the WebRTC standard. Since standards are really just a result of delivering new Web technologies in a rendering engine, Telenor will assign Telenor engineering staff to work on Mozilla’s implementation of WebRTC in Firefox and Firefox OS.

The goal of this new center is to implement WebRTC with a broad, neutral vision that captures the technology needs of many, not just the technology needs of individual browser vendors.

Mozilla is an open source project where every opinion and technical contribution matters. The WebRTC Competency Center will accelerate the development of WebRTC, and ensure that WebRTC serves the diverse technology interests of many. If you would like to see WebRTC (or any other part of the Web) grow capabilities that are important to you, join us.