| Failure is not an option - if your system keeps on running...
|
|
13 Jan 05 |
|
It is nice to have Linux servers run for months on end. What is not so
nice, is that your system administration skills and policies have to match
that…
I had moved willemvandenende.com over to a new high-bandwidth server
outside my house, all was well, except there was a strange idiosyncracy,
when starting the web-server, it complained it couldn’t find a script
with a strange name (net.defaultroute~ - I couldn’t figure out what
the tilde was doing). Oh, what the … I thought, I’ll figure it
out when I have some time, the system doesn’t seem to be running any
worse…
Until the folks operating the (virtual) server boxes decided to install a
kernel patch (there was a security whole in the Linux kernel - very rare).
They were nice enough to send us and advance warning that they were going
to reboot our machine. I thought, well, not much can go wrong.
I was, of course, wrong! When rebooting the machine, it started to look for
the net.defaultroute~ (notice the tilde), and it couldn’t find it (of
course not). Now the machine couldn’t give itself a name, and was
hence unreachable.
As Murphy would have it, this happens in the busiest week of the year. And
I still don’t know how to solve the defaultroute~ thingy, so it
remains down. After a few days, I finally realized, that I have a backup
server (the one in my attic that runs all the other domains) and that I
could simply point www.willemvandenende.com to
that…
Then of course, when I woke up this morning… no network… the
ADSL modem had given up (this happens about once a year).
Lessons learned:
- Most of my system administration troubles happen in late November and
throughout December (the whole thing seems to have an MTBF (Mean Time
Between Failure) of about a year, and then everything comes at once. The
rest of the year it is just simple power failures and such.
- Having known flaws in your system administration, even if they are minor is
not an option, just as with XP style programming: it works best when there
are no known defects.
- (again) test rebooting your server, and see if all services come up again.
|