A DNS outage affected many services in Microsoft Azure between 19:43 and 22:35 UTC on May 2, 2019. The outage affected many services within Microsoft Azure; such as Compute, Storage, App Service, Azure Active Directory, and Azure SQL Database services. Additional Microsoft services were affected as well, including Microsoft 365, Dynamics, Azure DevOps, and more. As you can tell this was a large outage!
Preliminary root cause
The Azure status history page shows the preliminary root cause of the outage to be the following:
Engineers identified the underlying root cause as a nameserver delegation change affecting DNS resolution and resulting in downstream impact to Compute, Storage, App Service, AAD, and SQL Database services. During the migration of a legacy DNS system to Azure DNS, some domains for Microsoft services were incorrectly updated. No customer DNS records were impacted during this incident, and the availability of Azure DNS remained at 100% throughout the incident. The problem impacted only records for Microsoft services.
It appears that one of the reason for the outage was that Microsoft wasn’t using Azure DNS service yet for the DNS needs of Microsoft Azure. We often think that Microsoft uses Azure for everything, even Azure itself. However, an outage like this is a nice reminder that Microsoft is still migrating to Azure themselves. It seems that even Microsoft Azure services aren’t fully using Azure yet either. Although, to be fair they can’t use an Azure service for Azure to run on until Azure has the service in the first place. First Azure DNS needed to be created and be available, then at some point Microsoft needed to invest the time and money into migrating their legacy DNS system over to the new Azure DNS.
Mitigation and Next Steps
Naturally when an outage occurs, Microsoft is very quick to respond, assess, and mitigate the issue with some resolution to get customers services working as quickly as possible. After all, this is the hole reason they offer SLA agreements for their services to guarantee the reliability of their service offerings. However, servers fail, and in this case migration / deployments also do fail at times.
The official Mitigation statement from the Azure status history page reads as follows:
To mitigate, engineers corrected the nameserver delegation issue. Applications and services that accessed the incorrectly configured domains may have cached the incorrect information, leading to a longer restoration time until their cached information expired.
Now that Microsoft has mitigated the DNS outage, their teams are working on putting together a more detailed Root Cause Analysis (RCA) so they can fully address the issue and prevent it form occurring in again in the future. The details RCA will be made available within approximately 72 hours, and will be published to the Azure status history site.
While this outage is frustrating, and caused problems for many customers, it’s nice that Microsoft Azure is “dog fooding” its own services more and more as time goes by. Azure DNS is an extremely reliable global DNS service, and it’s awesome to know that Microsoft is now officially using it to host DNS for Microsoft Azure itself!