User data in the cloud: Lessons from the Sony debacle

Two weeks ago our friends at Sony managed to get personal information of 70 million users stolen from them. I got one of the notification emails a couple days ago myself (I must have signed up for the Playstation Network when I installed the PS3 we bought to do our Cell VM work back at UCI.) In this instance, Sony is a shining example of how NOT to handle user data.

Whats user data?

User data is any piece of identifiable data about a user. It can be all sorts of obvious stuff like your name, address, birth date, passwords (all of these Sony managed to lose), but also less obvious data such as usage history, communications, and what not. Whether Sony lost the latter is not clear. Some of the information Sony lost clearly should never have been stored in the first place. I understand why Sony asked me for my birth date when I signed up for PSN. Some jurisdictions want you to be a certain age before you can engage in virtual mayhem and violence that is modern video gaming. But why the hell did Sony store this information in a database? Why not just flag my account as “remembers the first techno song being played on the radio, definitely old enough”?


Storing data is always risky. If you store any data in the cloud, eventually someone will break into it. Investing a lot of money (expensive equipment, expensive practices, expensive staff) helps delaying that day of reconning, but only within limits. There are a lot of financial incentives to steal this kinda of data. Personal information of 70 million people is a fantastic starting point for all sorts of phishing attacks. And even if only one out of 10,000 people getting that Nigerian email falls for it, that’s still plenty of people to take advantage of. Sony took a risk storing user data. Unfortunately, it was not a well calculated risk. They could have easily reduces the fallout from a breach by storing less information, i.e. by not storing the birth date, or maybe even not storing personal information at all! Shocking proposition, I know.

Not knowing is bliss

Why does Sony need the names of its PSN subscribers in the PSN user database to begin with? Let people choose a handle and a password. If you want to personal the experience, let users chose their name. Dear PSN, you may call me Tracemonkey now. You really don’t need to know my real name and address. As for payment information, I think its ok if PSN asks me once a year to re-enter my credit card info, which is then briefly processed but never stored. Had they followed this simple principle of touching (and storing) as little user data as possible, they would have saved themselves a lot of legal trouble and liability.


I am not ranting about this topic out of thin air. Web browsers handle a lot of personal information, and its tempting for browser vendors to get in on the whole social networks online transactions online identity game. A number of people at Mozilla don’t seem too comfortable with hosting on our infrastructure any user data ever. Its really scary and risky after all (just ask Sony). I think that’s wrong. We absolutely should get into the key areas of social and identity. Why? Because the state of the art is crappy, and we can do better.

Microsoft Passport anyone?

Web identity is a total mess. I have at least 30 accounts in various places with different account names and passwords (well at least I try). Various organizations and services have tried to established a single login. Microsoft Passport was one of the earlier ones. I am really glad that didn’t work out. Can you imagine the evil empire owning all your personal data and online identity? Microsoft has a lot of incentives to use and abuse such a powerful position, and it certainly does. I still get emails from Microsoft about my Passport account on a regular basis, almost a decade later. Its usually an invitation to try this new Microsoft feature Y or maybe try chatting with my passport account or … well whatever. Microsoft sits on a lot of data, and its tempting to monetize it. And of course its not just them. Everyone else is just as bad. Ever noticed how Google and Facebook customize ads for you based on the data they have about you? Creepy. This is why I think Mozilla can do a lot better. We don’t have any hidden agenda. We don’t have any extra services we want to sell you. We don’t have to monetize data we store to turn a profit and make shareholders happy. We simply don’t have any shareholders. We are a company owned by a non-profit foundation that wants to make the web a better place. This puts us in a much better position to do whats best for our users, instead of whats best for our quarterly statement (we don’t publish any of those in case you didn’t notice.)

Playing it safe

We are currently in the process of figuring out how exactly Mozilla should handle user data. I have exactly zero authority when it comes to these kind of decisions, but Mozilla is a pretty open and democratic place and we tend to discuss this stuff pretty openly, giving anyone a voice who wants to speak up. I think its imperative that we follow a couple guiding principles as we explore ways to better serve our users using services such as identity or social:

  1. Always keep the users’ best interest in mind (and only the users’ best interests). I don’t care if we can ship a feature faster or cheaper if we store more user data (or maybe store it not encrypted instead of encrypted). Our new Sync service is a great example for this. Its a total pain in the butt to encrypt the browser history on the client before it is uploaded to our services, but its the right thing to do. It means that in case of Sync we can never see any of your browser history, even if we tried, and your data is safe by default.
  2. Always store as little data as possible in the cloud. If there is a way to implement a feature completely in the client without us ever having to see user data, that’s always the right approach, even if its harder. This is exactly the issue we are facing with our new F1 social browsing feature. It allows you to share websites on Facebook/Twitter/etc as you visit them. Its a really cool feature–I use it all the time. Unfortunately, the protocol Facebook/Twitter/etc offer to authenticate and access their APIs (oauth) is totally broken, and conceptually doesn’t really work for client applications. oauth requires the client (Firefox/F1) and Facebook/Twitter/etc to negotiate a shared secret (called the consumer key). With a pure client solution this secret can never be kept (someone could peek into Firefox/F1 and extract the key). It seems Facebook has blacklisted consumer keys before because people checked them into open-source repositories. The only alternative to this is to put the key behind a service Mozilla runs and then let Firefox/F1 post via that service, but that means we would be able to see (in theory, not intentionally of course) all the Facebook/Twitter/etc status updates of millions of people every day. That’s wrong. As tempting and quick as it would be to setup a Mozilla service that keeps the key hidden and posts for users, we should never put ourselves in a position where we handle user data without an overwhelming need for it. In this particular instance we should simply negotiate with Facebook/Twitter/etc to not enforce the shared secret rule (Twitter already doesn’t it seems, since there are so many Twitter client apps out there), and maybe in parallel we should work on better protocols than oauth.

Going fast

As we were discussing these various architectural aspects of how to handle user data (or how not to handle it) the last few weeks, some people were tempted to go the easy route and store a lot more user data (in particular in the clear) than necessary because it might get us to market faster. I think this is wrong for the above two principles, but its also wrong because it will NOT get us to market faster. Dealing with user data from an infrastructure perspective is a total pain. To handle or even store things like Facebook/Twitter/etc account authentication tokens or user contacts, we have to build out a serious security infrastructure. We need to hire expensive, highly trained personnel and we have to seriously tighten our security practices. That doesn’t mean we are unsafe right now. It just means our current practices match our current threat scenarios. For example we have external IT administrators who don’t even work for Mozilla administering our source code repository access controls. They are simply Mozilla project volunteers. Considering the limited risks, this is acceptable. When it comes to storing user data, entirely different standards will be needed. And getting all that sorted out and implemented will require a lot of audits, careful planning … and a lot of time. So if you want to go fast, go zero user data. Or encrypt the user data on the client so all we get to see are blobs of meaningless zeros and ones. That’s how Sync works, and we got it up and running within a couple months. That’s how you go fast.

What’s next?

We are currently discussing what our process will be to store user data. Expect people with actual authority to make decisions (and to talk about them) to start talking about this publicly in a few weeks. I already know that the result of our internal deliberations will be a policy that will focus on what’s best for our users, and that will minimize risks for them (and in the end, for us). And expect a safe and secure implementation of F1 to show up in your browser really soon. You can already try out the prototype now. It really rocks.

This blog post represents my personal opinion, not the official position of Mozilla.