Microsoft Azure Outage Stemmed From DDoS Defense Error

Microsoft said that a global outage of several Azure and Microsoft 365 services on Tuesday was exacerbated in part by “an error” in its response to a distributed denial-of-service (DDoS) attack.

The company said that the outage impacted a number of Microsoft services, including Azure App Services, Application Insights, Azure IoT Central, Azure Log Search Alerts, Azure Policy, the Azure portal and Microsoft 365 and Microsoft Purview services. The outage lasted from 11:45 UTC to 19:43 UTC on Tuesday.

While an unspecified “subset of customers” were impacted, the outage was global and reportedly affected a range of industries, from water utilities like Cambridge Water to the HM Courts and Tribunals Service, the UK Ministry of Justice executive agency.

“While the initial trigger event was a Distributed Denial-of-Service (DDoS) attack, which activated our DDoS protection mechanisms, initial investigations suggest that an error in the implementation of our defenses amplified the impact of the attack rather than mitigating it,” according to Microsoft on its Azure status history page.

The incident led to an unexpected usage spike that “resulted in Azure Front Door (AFD) and Azure Content Delivery Network (CDN) components performing below acceptable thresholds, leading to intermittent errors, timeout, and latency spikes,” Microsoft said.

Once the company scoped out the nature of the usage spike, it implemented networking configuration changes that helped mitigate the majority of the impact, and also used a failover process for alternate networking paths.

“We proceeded with an updated mitigation approach, first rolling this out across regions in Asia Pacific and Europe,” according to Microsoft. “After validating that this revised approach successfully eliminated the side effect impacts of the initial mitigation, we rolled it out to regions in the Americas.”

The company said it would complete an internal investigation to better understand the incident, which would be published within 72 hours to share more details.

The incident came almost two weeks after the major global outage, which stemmed from an issue with an update for versions of CrowdStrike’s Falcon EDR product and made Windows machines fail and go into a boot loop state. That outage caused widespread outages for companies and services across the Internet, including banks, airlines, media companies.

Ddos