Yesterday we installed extra RAM in all of our physical servers. We took this opportunity to also upgrade the BIOS and the kernel to the latest release. All of our servers are now up to date and we don’t expect we need to plan any more downtime for the foreseeable future.
After the physical upgrade all virtual machines of our customers have been directly upgraded to our updated plans as well. Enjoy! :)
Some people might have noticed we had some problems with our website yesterday and that DNS resolving on a VPS was seriously impacted as well. We’re all for open communication to our customers, not just when things are good, but also when things break down. So here’s what happened:
Our primary DNS server was down, because the server it runs on was being upgraded as well. What we completely overlooked whas that our webapp depends on the primary DNS database because it also manages DNS records. And when it was unavailable, it broke down. Oops! That shouldn’t happen and we’re going to fix that. With regard to DNS resolving: Our slave DNS server was available, but a VPS will (by default) only contact it when the primary nameserver times out. We’re also going to fix that. Read on.
The reason it took some time to get our primary DNS server back up and running was because we had a bit of a cyclic dependency problem. It took some on-site hacking to get things going again. In a nutshell, what went wrong was that we couldn’t fetch the VPS configuration for our primary DNS server from our webapp because the webapp was down. And the webapp was down because the primary DNS server was down. We had to hack out the DNS bits from our webapp to get it online. After that, things were starting to look better.
So here is what we’re going to do to avoid these issues in the future:
- We are going to move the DNS database to our database cluster, which will ensure better availability.
- Our webapp should not collapse when it can’t contact our DNS database. (It’s not a critical component and DNS records will be periodically synced anyway.)
- Our primary DNS server will become a hidden master and only used for replication to the slaves.
- We will add a new DNS slave and use floating/virtual IP’s for the internal resolvers (10.0.0.6 and 10.0.0.8). This will fix the timeout problem when the first nameserver is down. The IP will simply move to the other slave if that happens.
So for all people impacted by our downtime: Sorry for the inconvenience and rest assured that all issues will be fixed!
Update 2010-09-08: Our new DNS infrastructure is in place. Any single points of failure with regard to DNS and our webapp have been eliminated.