Change To Improve Real time Collaboration Caused September 7 Google Docs Outages

On Wednesday, September 7, Google Docs faced outage that lasted one hour and during that time document lists, documents, drawings and Apps Scripts were inaccessible for the majority of our users. In a blog post today on Sep 10, Alan Warren, Engineering Director writes about the incident and what has acutally caused the outages.Warren writes […]

On Wednesday, September 7, Google Docs faced outage that lasted one hour and during that time document lists, documents, drawings and Apps Scripts were inaccessible for the majority of our users. In a blog post today on Sep 10, Alan Warren, Engineering Director writes about the incident and what has acutally caused the outages.

Warren writes "Our automated monitoring noticed that attempts to access documents were failing at an increased rate, and alerted us 60 seconds later after the failure rate increased sharply. The engineering teams diagnosed the problem, determined that it was correlated with the feature change, and started rolling it back 23 minutes after the first alert. In parallel, we doubled the capacity of the lookup service to mitigate the impact of the memory management bug. The rollback completed 24 minutes later, and 5 minutes after that the outage was effectively over as the additional capacity restored normal function."

He further noted that "Since resolution, we've been assembling and scrutinizing the timeline of this event, and have assembled a list of steps which will both reduce the chance of a future event, decrease the time required to notice and resolve a problem, and limit the scope which any single problem can affect."

So what happened? "The outage was caused by a change designed to improve real time collaboration within the document list. Unfortunately this change exposed a memory management bug which was only evident under heavy usage," informed Warren.

"Every time a Google Doc is modified, a machine looks up the servers that need to be updated. Due to the memory management bug, the lookup machines didn't recycle their memory properly after each lookup, causing them to eventually run out of memory and restart. While they restarted, their load was picked up by the remaining lookup machines - making them run out of memory even faster. This meant that eventually the servers couldn't properly process a large fraction of the requests to access document lists, documents, drawings, and scripts which led to the outage you saw on Wednesday," Warren explaiend.