Failures Galore - Our hardware is starting to show its age at work. That, and something fishy is going on.
We've had several system failures in the last few months, some of them obviously related to our ancient systems (6+ now), and some just plainly weird. As there have been so many failures, the net effect is much worse than we'd like.
For example, our backup system has been intermittently failing (a loose wire in a drive caddy), which has thinned our 2-week backup set (a few backups are missing). One of our primary webservers has had some nic instability, which also reduces the hit-rate of the backup system, and we're running low on off-site drives and have missed a few off-sites recently. It's a recipe for disaster ... one that came to a head this week.
So we have a flakey/shallow backup, some failing NICs (still don't know why), and then we start to see some strange drive failures. Two systems this week showed some flakey behaviour, and we scanned the systems and swapped out the suspect hardware. On reboot the systems are dead; the raid-arrays have no mbr or partition table, and the rest of the data on the drives is severly borked.
The logs and clam scans for the systems are clean, and the custom tripwire scripts didn't detect an intrusion, but it's still the likely cause. The only clue that would hint otherwise is that the systems are so old, leaving so many possible points of failure ... except that it happened to two systems in two weeks. I couldn't find enough data on the drives to prove anything either way, and none of our scanners caught anythign obvious. But we're looking for rootkits on the rest of the servers, and we've cycled passwords/locks/keycodes for the entire building.
The kicker, of course, is that we didn't have recent backups for either system, as the backup system has its own problems. We had the primary partitions imaged (dd + netcat rule), but a lot of data has to be built from crumbs around the network. Nothing critical was lost, but it's still pain.
What are we going to change?
1. Failed backups will result in someone making the backup by hand. Ignoring a broken backup will result in failure, eventually.
2. We're moving our main production sites to a managed farm. We write software, we don't manage servers. I've been pulling this IT group out of a hole that it may never get out of. Too few people, too little experience, too little time to fix it all.
3. Buying better hardware does not guarantee success. The previous incarnations of this IT group believed that spending more money meant better uptime. It just isn't true anymore ... these servers are dual-cpu, multi-nic, multi-raid array machines ($10k CDN), and they fail every 1-3 years. And the failures are often hard, despite the raid arrays. I'd kill for cheaper hardware, where I could swap in a new machine at will. Instead we're stuck troubleshooting old, expensive hardware. Replacing drives costs us more than makes sense, as there is an obligation to continue using the high-end SCSI stuff, despite it not making sense anymore.
One reality I'm learning is that legacy is always a problem. The principle I take from that is that decisions need to be as orthogonal as possible, to make changes in the future easier. Smaller, simpler, fewer, cheeper.
bender Lives - My pet blogging tool is shaping up. I've written a lot of requirements, and worked on some design, and am part way through converting my site data (my site is the prototype). I've written most of the components now, at least in basic form, and am testing various pieces of functionality.
The next release will likely only contain a few of the UI-CGIs, so I can get my new site up. I'm really tired of the current backend uses textpattern, which looked good at first ... but doesn't look like it's intended to remain free forever. I used to use blosxom, which was good -- but I found it had too few built-ins (absolutely everything was a plugin). Bender will be a lot like blosxom, but will contain the essentials as part of the core, things like configuration/meta, a text-backend, some auto-markup stuff, and a basic webmin interface (as well as a command-line tools).
And this is one of the reasons why I think diversity in software is good. A great project like blosxom gets people thinking, dreaming of how it could be better. The result is better software, as it's the stuff of our dreams.
zeenix - It was a weird epiphany for me that day, based on a whole bunch of reading that came to a head. It's one of those things that should be obvious, but I've been dense and naive. The "So" was a cheep trick, or my inability with the language. But thanks for the kudos.