Microsoft 'What Happened on December 30 Hotmail Outage?'

Microsoft shared details about the causes of the recent Hotmail outage that started on December 30. "On Dec 30th, we had an error in a script that inadvertently removed the directory records of a small number of real user accounts along with a set of test accounts. Please note that the email messages and folders […]

Microsoft shared details about the causes of the recent Hotmail outage that started on December 30.

"On Dec 30th, we had an error in a script that inadvertently removed the directory records of a small number of real user accounts along with a set of test accounts. Please note that the email messages and folders of impacted users weren't deleted; only their inbox location in the directory servers was removed. Therefore when they logged in, a new mailbox was automatically created for them on a new storage server that didn't contain their old messages and folders. This is why the accounts received the "Welcome to Hotmail" message."

The post notes, the first ticket reported on Dec 30, there were problems isolating the cause. On Jan 1, Microsoft escalated the priortiy of the ticket, following continued reports.

"Our first step was to restore these users' entries in the directory servers, which we did by early on the morning of Jan 2 PST. We then merged their old email messages and folders with any new mail they'd received throughout the day on Jan 2. This required multiple passes to capture all the accounts and messages, so for some users, service wasn't completely restored until Jan 5. We completed the merge for 16,035 users on Jan 2 and by Jan 5 had completed this for the remaining 1,320 users who were affected by this particular issue."

Microsoft report that no user data was permanently lost in this particular incident, "we had 100% recovery of existing email and folders in the affected accounts."

To prevent similar problems in the future, Microsoft is updating infrastructure "to use a separate code path for provisioning and removing test accounts, so that our testing no longer risks affecting real user accounts." Also, "changing issue alert process so that when multiple users report missing data, these issues get a higher priority and immediate action."

[tags]outage,new year[/tags]

[Source]