Facebook You Re Doing It Wrong

Facebook You Re Doing It Wrong - Early today Facebook was down or inaccessible for most of you for about 2.5 hours. This is the worst interruption we've had in over 4 years, as well as we wished to firstly apologize for it. We also wanted to supply far more technical information on what occurred and share one large lesson found out.

What's Wrong With Facebook

Facebook You Re Doing It Wrong


The vital imperfection that triggered this interruption to be so serious was an unfortunate handling of an error problem. An automatic system for verifying configuration values wound up creating much more damage than it dealt with.

The intent of the automated system is to check for arrangement worths that are void in the cache and also change them with upgraded worths from the persistent store. This functions well for a short-term trouble with the cache, but it does not work when the relentless store is invalid.

Today we made a modification to the consistent duplicate of a configuration value that was interpreted as invalid. This suggested that every single customer saw the void value as well as attempted to repair it. Since the repair involves making a question to a collection of data sources, that collection was swiftly bewildered by thousands of thousands of queries a second.

To make matters worse, each time a client obtained a mistake attempting to inquire one of the databases it interpreted it as a void value, and also erased the corresponding cache secret. This indicated that even after the original issue had actually been taken care of, the stream of inquiries continued. As long as the data sources stopped working to service some of the demands, they were triggering much more demands to themselves. We had entered a responses loop that really did not enable the data sources to recover.

The way to quit the responses cycle was fairly unpleasant - we needed to quit all website traffic to this data source cluster, which meant switching off the site. Once the data sources had recovered as well as the source had been repaired, we slowly allowed more individuals back onto the site.

This obtained the site back up as well as running today, as well as in the meantime we have actually switched off the system that attempts to remedy setup values. We're exploring brand-new layouts for this setup system complying with style patterns of other systems at Facebook that deal more beautifully with feedback loops and also short-term spikes.

We say sorry again for the website interruption, as well as we want you to understand that we take the efficiency as well as dependability of Facebook really seriously.