Rants and Ruminations 1 of 1 article InfoSyndicate: full/short
Failure is not an option - if your system keeps on running...   13 Jan 05
[print link all ]
It is nice to have Linux servers run for months on end. What is not so nice, is that your system administration skills and policies have to match that…

I had moved willemvandenende.com over to a new high-bandwidth server outside my house, all was well, except there was a strange idiosyncracy, when starting the web-server, it complained it couldn’t find a script with a strange name (net.defaultroute~ - I couldn’t figure out what the tilde was doing). Oh, what the … I thought, I’ll figure it out when I have some time, the system doesn’t seem to be running any worse…

Until the folks operating the (virtual) server boxes decided to install a kernel patch (there was a security whole in the Linux kernel - very rare). They were nice enough to send us and advance warning that they were going to reboot our machine. I thought, well, not much can go wrong.

I was, of course, wrong! When rebooting the machine, it started to look for the net.defaultroute~ (notice the tilde), and it couldn’t find it (of course not). Now the machine couldn’t give itself a name, and was hence unreachable.

As Murphy would have it, this happens in the busiest week of the year. And I still don’t know how to solve the defaultroute~ thingy, so it remains down. After a few days, I finally realized, that I have a backup server (the one in my attic that runs all the other domains) and that I could simply point www.willemvandenende.com to that…

Then of course, when I woke up this morning… no network… the ADSL modem had given up (this happens about once a year).

Lessons learned:

  • Most of my system administration troubles happen in late November and throughout December (the whole thing seems to have an MTBF (Mean Time Between Failure) of about a year, and then everything comes at once. The rest of the year it is just simple power failures and such.
  • Having known flaws in your system administration, even if they are minor is not an option, just as with XP style programming: it works best when there are no known defects.
  • (again) test rebooting your server, and see if all services come up again.

Copyright © 2009 Willem van den Ende