darktrain.nuxx.net Server Issues and Disk Replacement

My current webserver, darktrain.nuxx.net, has been working well for a couple years, despite needing a proactive (due to bad BIOS chip) motherboard replacement and the normal quirks. This past Saturday morning, about 10am, one of the hard drives failed. Due to the use of a ZFS mirror pool for the root filesystem this shouldn’t have caused any problems, but it did. On top of that, due to not rebooting the server in 600-some days I ran into a few other quirks. Here’s what all happened, in chronological order, to get it running stable again:

  • Second hard disk, /dev/ada1, fails. ZFS throws up on itself and the storage basically falls out from under the OS. As a result, everything not in memory and database-backed websites fail.
  • An OS initiated reboot wouldn’t work (seemed to loop during sync) I powered off the server manually.
  • Upon powering the server up disk performance was really bad until /dev/ada1 was removed from the mirror pool. After this point disks settled out and all was good.
  • Outbound email from server wasn’t working due to DKIM-Milter / OpenDKIM failing to start. This could be bypassed, but this wasn’t a good solution because the MMBA Forum sends a fair bit of email notifications. DKIM-Milter failed to start because OpenSSL had been rebuilt due to Heartbleed  bug, but as I hadn’t restarted it since upgrading OpenSSL I didn’t notice the issue.
  • DKIM-Milter couldn’t be upgraded from Ports because FreeBSD 9.0-RELEASE (which was still running) had been depreciated and Ports intentionally broken on this release.
  • OS upgraded to FreeBSD 9.2-RELEASE-p6 using freebsd-update. DNS and mail broke, but this was fairly easy to fix. Update otherwise went smoothly.
  • Ports updated, OpenDKIM rebuilt, mail working again.
  • Upgraded ZFS on remaining disk with zpool upgrade -a command, then wrote new bootcode to ada0 using gpart bootcode -b /mnt2/boot/pmbr -p /mnt2/boot/gptzfsboot -i 1 ada0.

At this point the server was stable and I was able to replace the failed disk. The previous setup was with two Seagate ST1000DM003 disks (the mirror pool) and one Crucial M4 SSD (L2ARC). The biggest difficulty in replacing the disk is not the $54.44 cost of the replacement purchase; it’s setting up time to access the server in the data center. Since there was still one free disk bay in the server, instead of just replacing the one failed disk I decided to put two new ones in. These will then be configured into a three-way mirror pool with the SSD L2ARC. It cost a bit more, but now when the next magnetic disk dies (remember, all parts die eventually) I can drop it from the pool and still have two properly working drives, all without another data center visit.

During lunch today I headed over to the facility housing the server in Southfield (conveniently, only 15-20 minutes from work) and within the span of 12 minutes I’d met the escorts, downed the server, swapped the disks, and brought it back up confirming that they are in place and functional.

After getting the disks back I used hints from the FreeBSD Root on ZFS (Mirror) using GPT article to get the new disks partitioned for swap and boot, then added the /dev/ada1p3 and /dev/ada2p3 partitions to the mirror pool and made sure the L2ARC was working. Now everything’s (essentially) back to functionally normal, hopefully with better reliability than before.

So, what’s next? Probably a FreeBSD 10.0-RELEASE upgrade, and better staying on top of patch levels so I don’t suffer the same fate as last time. Being a whole version upgrade there’ll need to be a good bit more planning and testing than this go around, but so long as I’m doing it less urgently, all should be good.

Leave a comment