On the morning of April 19th, our phones started ringing, and our email clients started filling up quick. Our software, Woople, was suddenly inaccessible to a large majority of our users. My initial reaction was that the application servers had come crashing down. Woople’s usage has been growing fairly quickly in the past couple weeks. New Relic RPM showed us lower than normal usage, but our cluster was still responding as expected.
It was quickly discovered that Media Temple was having a critical DNS outage. Meaning domains & email using Media Temple’s domain name service disappeared. If it isn’t clear yet, we were using Media Temple’s provided name service. Woople uses subdomains for individual accounts on the system. What’s that sound you ask? It’s shit hitting the fan.
We immediately setup a FREE plan on Zerigo, and updated our name servers at the domain registrar. Almost instantly the query counter on Zerigo started to increase. Users were regaining access to woople. (Rest assured, we’ve since upgraded to a plan that can handle quite a bit more than what the FREE offering had.) This solution worked so well for us because our original name servers were not responding. The natural next step for any domain routing system is a hop to the next available name server.
Our next steps are to completely decommission use of Media Temple’s name servers. To be fair, they have been fine for the past two years, but we need something we can trust. A specialized service from Zerigo fits that request now. Taking this even further, setting up a master / slave relationship to another provider is also a viable option, and I’ve been looking at Route 53 and the Dynect platform for this.
DNS redundancy comes cheap with the help of specialized service offerings available now, and it is often overlooked until it bites you in the ass.