A pair of issues that were introduced as part of a code update in mid-November helped lead to the Nov. 19 outage on Microsoft’s Azure cloud platform for customers who had multi-factor authentication set up as a requirement.
The service outage lasted for 16 hours and affected customers of Azure Active Directory who were trying to authenticate to Office 365, Dynamics, and other cloud services using MFA. Microsoft officials said that there were three root causes for the problem, two of which were part of a software update at some of the company’s data centers, while the third came into play once the other two issues were triggered.
On Nov. 13, Microsoft started a code update at some of its Azure data centers, and finished the rollout by the end of the week. On the following Monday morning, Azure AD customers who had MFA enabled began noticing timeouts when they tried to authenticate. Microsoft began investigating the issue and eventually was able to mitigate it by the end of the day. It turns out that the problem began when customers in Europe began coming online on Mov. 19, they hit a traffic threshold in the Microsoft data centers that triggered the first problem introduced in the software update, which then led to the second issue, and eventually to the third.
“The first root cause manifested as latency issue in the MFA frontend’s communication to its cache services. This issue began under high load once a certain traffic threshold was reached,” Microsoft said in an explanation of the event.
When the traffic volume kicked off the latency problem, that led to the second issue, which involved the way that the Azure AD frontend interface was processing responses from the MFA server.
“The second root cause is a race condition in processing responses from the MFA backend server that led to recycles of the MFA frontend server processes which can trigger additional latency and the third root cause on the MFA backend,” Microsoft said.
The third factor in the outage was related to the MFA server running out of resources to process any more authentication requests, even though the servers showed up as healthy in Microsoft’s monitoring systems.
“Unfortunately, this change introduced more latency and a race-condition in the new connection management code, under heavy load."
The software update that Microsoft rolled out that led to the outage was designed to improve the connections of Azure AD servers to caching services to speed up service. The outage began in the European data centers but then began to spread as customers in other regions came online and Microsoft’s engineers scrambled to reroute traffic to mitigate the problem, without success.
“Unfortunately, this change introduced more latency and a race-condition in the new connection management code, under heavy load. This caused the MFA service to slow down processing of requests, initially impacting the West EU DCs (which services APAC and EMEA traffic). During this time, multiple mitigations were applied - including changes in the traffic patterns in the EU DCs, disablement of auto-mitigation systems to reduce traffic volumes and eventually traffic which was rerouted to East US DC,” Microsoft said.
“Our expectation was that a healthy cache service in the East US DC would mitigate the latency issues and allow the engineers to focus on other mitigations in the West EU DCs. However, the additional traffic to the East US DC caused the MFA frontend servers to experience the same issue as West EU, and eventually requests started to timeout.”
Once the requests began to time out, the second issue involving the race condition on the front end systems was triggered, and the Azure MFA resources were exhausted. So customers weren’t able to receive MFA messages.
“In order to restore the health of these datacenters, engineers rolled back the recent deployment, added capacity, increased throttling limits, recycled MFA cache servers and frontend servers and applied a hotfix to the frontend servers to bypass the cache. This mitigated the latency issue, but customers (inclusive of US Gov and China) were still reporting issues with MFA, therefore engineers increased their focus in looking for root causes other than the MFA frontend latency issue,” Microsoft said.
Microsoft engineers eventually were able to identify all three of the root causes of the outage and restore the Azure MFA service, 14 hours after the outage began. In the aftermath of the incident, Microsoft engineers looked at ways to prevent future outages, including identifying ways to improve detection times and containment processes to prevent outages from cascading.