Whats Wrong with Facebook
Whats Wrong With Facebook
The essential defect that created this interruption to be so extreme was an unfavorable handling of a mistake condition. A computerized system for verifying setup values wound up causing a lot more damage than it taken care of.
The intent of the automatic system is to look for configuration worths that are invalid in the cache and also change them with updated values from the consistent store. This functions well for a transient problem with the cache, but it does not work when the consistent shop is void.
Today we made a change to the consistent duplicate of an arrangement worth that was taken void. This meant that every single client saw the void value and also attempted to repair it. Because the solution involves making an inquiry to a collection of databases, that collection was quickly overwhelmed by numerous thousands of queries a 2nd.
To make matters worse, every time a customer got an error attempting to quiz one of the databases it analyzed it as an invalid worth, and also erased the matching cache key. This implied that also after the initial trouble had been repaired, the stream of inquiries proceeded. As long as the databases fell short to service some of the demands, they were creating even more demands to themselves. We had actually gone into a feedback loophole that really did not permit the databases to recoup.
The way to stop the feedback cycle was rather painful - we had to stop all web traffic to this data source cluster, which indicated switching off the website. As soon as the data sources had actually recouped and the root cause had been dealt with, we gradually permitted more people back onto the website.
This obtained the website back up and running today, as well as in the meantime we've shut off the system that tries to fix setup worths. We're exploring new designs for this arrangement system following design patterns of other systems at Facebook that deal even more beautifully with feedback loops as well as transient spikes.
We ask forgiveness once more for the site failure, as well as we desire you to understand that we take the performance and dependability of Facebook extremely seriously.