DNS network incident

Incident Report for Gandi.net

Postmortem

Summary

On July 16Th between 13:26 UTC and 13:48 UTC we have experienced an incident on our livedns plateform.

Two DNS nodes went down.

‌

Customer impact

Queries towards those two DNS node failed with a timeout.

‌

Root cause analysis

Why did two DNS nodes go down ?

Due to a software bug which triggered a fault and the server stopped.

Why did it have such an impact ?

There are three problems :

First one :

We do anycasting on our DNS infrastructure, for redundancy.

Anycasting means that we announce our DNS nodes' addresses in BGP.

But if the DNS server crashes, we should stop announcing the route from this DNS node. The failsafe used for this case failed.

‌

Second one :

A default in our configuration has made these two nodes announce more IP addresses than expected. The nodes are supposed to announce these anycasted IP addresses :

ns-206-a.gandi.net. - 173.246.100.207

ns-64-b.gandi.net. - 213.167.230.65

ns-110-c.gandi.net. - 217.70.187.111

Domain names that use LiveDNS are served by resolvers named like so:

ns-x-a.gandi.net

ns-x-b.gandi.net

ns-x-c.gandi.net

In normal circumstances, if ns-x-a.gandi.net can't answer the request, ns-x-b.gandi.net and then ns-x-c.gandi.net will be tried next.

But if all three are announced by the same broken node, all queries and retries will fail (the redundancy doesn't work).

Third one :

Before the DNS incident, we were dealing with an internal network incident.

This incident generated a lot of noise in our monitoring systems, leading to a bad interpretation of the alerts triggered on the DNS servers, as we thought they were false positives.

This lead to the issue extending beyond what it should have, because of the time it took us to re-evaluate the LiveDNS monitoring alerts.

Remediation

We will fix the way we disable a node when the dns server is not operating correctly
We will fix our node configuration and setup monitoring to make sure we don't have one node announcing several clusters.
We will rework our monitoring and our internal training/organisation.
We already tracked the bug in our dns server software and a fix is currently being deployed.

Posted Jul 16, 2020 - 16:36 UTC

Resolved

This incident has been resolved.

Posted Jul 16, 2020 - 13:49 UTC

Investigating

Impact for FR customers

Posted Jul 16, 2020 - 13:26 UTC

This incident affected: DNS (LiveDNS, {abc}.dns.gandi.net).