Named.conf crash report
Description
The ssh extension that we use to upload changes to our nameserver configuration file (named.conf) segfaulted, causing corruption in it. This corrupted configuration was replicated to other name servers to propagate changes.
Timeline and resolution
Wed Aug 4 10:47 PST 2010 - The ssh extension used for transporting nameserver updates segfaulted.
The named.conf configuration file that was being transported was corrupted in the process, and was then synchronized to other nameservers.
Wed Aug 4 10:55 PST 2010 - A client reported an issue with DNS.
We found the corruption and started working on a fix.
Wed Aug 4 11:05 PST 2010 - We manually generated a new named.conf file, and uploaded it to the nameservers.
The new valid named.conf propagated to the other nameservers.
Prevention
To prevent this from happening in the future, we are taking the following action:
We will cease to use ssh as transport and will create a local daemon which will update named.conf directly from our database on each NS server. This daemon will be ready within the next 24 hours.
Estimated Impact
18 minutes.
Regards,
The Scalr Team
August 4th, 2010 - 16:12
Keep the good work up and inform us in the way you have done here when something goes wrong. Thanks a lot!
August 4th, 2010 - 17:14
Thanks for the report guys – nicely put. Keep doing the good work.
August 4th, 2010 - 17:29
There’s nothing wrong with ssh as a transport. Maybe it would be best to run named-checkconf before reloading the bind server?
August 11th, 2010 - 01:24
True, but our use of it (push) caused a single point of failure. We’ve added integrity checks, and each nameserver pulls the conf file.