Microsoft BPOS Cloud Customers Hit with Email Outage; Service Health Dashboard Now Live!

Microsoft officials warned customers of its BPOS bundle of hosted Exchange, SharePoint and Lync on Monday, May 10, that an upgrade of Exchange Online was slated to begin on May 12. Microsoft didn;t tell users to expect any downtime as a result of the upgrade. It seems something went wrong before May 12's upgrade ever […]

Microsoft officials warned customers of its BPOS bundle of hosted Exchange, SharePoint and Lync on Monday, May 10, that an upgrade of Exchange Online was slated to begin on May 12. Microsoft didn;t tell users to expect any downtime as a result of the upgrade. It seems something went wrong before May 12's upgrade ever began. Users are reporting there have been intermittent, multi-hour Exchange Online outages on May 10, 11 and 12 at different times in North America, Europe and Asia.

On May 12 at around 9 p.m. ET, CVP of Online Services Dave Thompson posted a detailed explanation of the problems that hit Exchange Online, acknowledging that there were problems on Tuesday, as well. Thompson also apologized to customers and announced the availability of the Microsoft's health-status dashboard to public.

Thompson said:

As a result of Tuesday's incident, we feel we could've communicated earlier and been more specific. Effective today, we updated our communications procedures to be more extensive and timely. We understand that it's critical for customers to be as fully informed as possible during service impacting events. We'll continue to improve the timeliness and specificity of our communications. The primary mechanism for communicating to our customers on issues has been and will continue to be the Service Health Dashboard. For North America, that dashboard is at https://health.noam.microsoftonline.com/.

Here's Thompson's explaination of the issue:

On Tuesday at 9:30am PDT, BPOS-S Exchange service experienced an issue with one of the hub components due to malformed email traffic on the service. Exchange has the built-in capability to handle such traffic, but encountered an obscure case where that capability didn't work correctly. The result was a growing backlog of email. By 12:00am PDT, the malformed traffic was isolated and the mail queues cleared. The delays encountered by customers varied, on the order of 6-9 hours. Short term mitigation was implemented and a fix was under development.

At 9:10am PDT today, service monitoring again detected malformed email traffic on the service. The problem was resolved at 10:03am, but users experienced up to 45 minute email delays during this time. A second, but related issue was detected via monitoring at 11:35am PDT, resulting in email stuck in some end users' outboxes. The issue was remediated at 12:04pm PDT. During this time, more than 1.5 million messages had queued on the service awaiting delivery. The backlog was 90% clear by 4:12 PM, but because of this large backlog of email, customers may have experienced delays of as long as 3 hours. We're implementing a comprehensive fix to both problems.

In an unrelated incident, starting at 1:04am PDT, service monitoring detected a failure in the Domain Name Service (DNS) hosting the http://mail.microsoftonline.com domain. This failure, prevented users from accessing Outlook Web Access hosted in the Americas, and partially impacted some functionality of Outlook and Exchange ActiveSync devices. The team diagnosed, and fixed, an underlying problem in the servers hosting DNS for the http://mail.microsoftonline.com domain, and restored service at 4:52am PDT. The team identified a number of improvements in our handling of problems associated with DNS, and will provide a full post mortem of this incident available through Microsoft Support.

[Source: Microsoft Online Services Team Blog]