Facebook sorry something Went Wrong

Facebook Sorry Something Went Wrong - Early today Facebook was down or inaccessible for a number of you for roughly 2.5 hours. This is the most awful interruption we've had in over four years, and we intended to to start with apologize for it. We also wanted to give far more technological information on what occurred as well as share one large lesson found out.

What's Wrong With Facebook

Facebook Sorry Something Went Wrong


The crucial problem that triggered this failure to be so serious was an unfavorable handling of an error problem. An automated system for verifying configuration values ended up causing far more damages than it dealt with.

The intent of the automated system is to look for configuration values that are invalid in the cache and also replace them with upgraded values from the consistent store. This works well for a transient problem with the cache, however it does not function when the consistent shop is void.

Today we made a change to the relentless copy of an arrangement value that was interpreted as void. This indicated that each and every single client saw the invalid worth and attempted to repair it. Because the repair entails making a question to a collection of databases, that cluster was promptly bewildered by hundreds of hundreds of questions a 2nd.

To make issues worse, whenever a customer got an error trying to inquire one of the data sources it interpreted it as a void worth, as well as erased the matching cache trick. This indicated that even after the initial issue had actually been taken care of, the stream of queries proceeded. As long as the databases fell short to service some of the requests, they were creating a lot more requests to themselves. We had actually gotten in a feedback loop that didn't allow the data sources to recuperate.

The way to stop the comments cycle was quite unpleasant - we needed to stop all traffic to this database collection, which suggested switching off the website. When the databases had actually recuperated and the root cause had been taken care of, we gradually permitted more people back onto the site.

This got the site back up and running today, and also in the meantime we have actually switched off the system that tries to remedy arrangement values. We're checking out brand-new designs for this arrangement system adhering to design patterns of other systems at Facebook that deal even more gracefully with feedback loopholes and also transient spikes.

We say sorry once more for the website failure, and also we desire you to know that we take the performance as well as dependability of Facebook really seriously.