On July 16Th between 13:26 UTC and 13:48 UTC we have experienced an incident on our livedns plateform.
Two DNS nodes went down.
Queries towards those two DNS node failed with a timeout.
Root cause analysis
Why did two DNS nodes go down ?
Why did it have such an impact ?
There are three problems :
First one :
We do anycasting on our DNS infrastructure, for redundancy.
Anycasting means that we announce our DNS nodes' addresses in BGP.
But if the DNS server crashes, we should stop announcing the route from this DNS node. The failsafe used for this case failed.
Second one :
A default in our configuration has made these two nodes announce more IP addresses than expected. The nodes are supposed to announce these anycasted IP addresses :
ns-206-a.gandi.net. - 22.214.171.124
ns-64-b.gandi.net. - 126.96.36.199
ns-110-c.gandi.net. - 188.8.131.52
Domain names that use LiveDNS are served by resolvers named like so:
In normal circumstances, if ns-x-a.gandi.net can't answer the request, ns-x-b.gandi.net and then ns-x-c.gandi.net will be tried next.
But if all three are announced by the same broken node, all queries and retries will fail (the redundancy doesn't work).
Third one :
Before the DNS incident, we were dealing with an internal network incident.
This incident generated a lot of noise in our monitoring systems, leading to a bad interpretation of the alerts triggered on the DNS servers, as we thought they were false positives.
This lead to the issue extending beyond what it should have, because of the time it took us to re-evaluate the LiveDNS monitoring alerts.