I recently reviewed one of my most spectacular (System Administration related) mistakes: I broke a few file systems and lost a good amount of data, and it was pretty much all my fault. Since I wrote it up anyway, I figured I might as well post it here, even though it's been a few years...
In early 2006, while working at Stevens Institute of Technology as a System Administrator, we bought a new Apple Xserve RAID storage device and populated its 14 disk slots with 400 GB drives, yielding (after RAID, file system overhead and the well known discrepancy in binary versus decimal units) two RAID5 configurations with a capacity of 2.2 TB each, an impressive amount of affordable storage space at the time.
The "left" RAID was dedicated to storing large amounts of video and audio data made available to clients running Mac OS X. We connected the RAID controller via FibreChannel to a SAN switch, and from there to an Apple XServe network server, which managed the HFS+ file system on this storage component.
The second 2.2 TB of storage space, the "right" side of the array, was meant to become the central data space for all workstations in the Computer Science and Mathematics departments as well as their laboratories. Up until then, this file space had been provided via NFS from a two-module SGI Origin 200 server running IRIX, managing a few internal SCSI disks as well as some FibreChannel direct attached storage. We intended to migrate the data onto the XServe RAID, and to have it served via a Solaris 10 server, allowing us to take advantage of several advanced features in the fairly new ZFS and to retire the aging IRIX box.
Neatly racked, I connected the second RAID controller and the new Solaris server to the SAN switch, and then proceeded to create a new ZFS file system. I connected the FibreChannel storage from the IRIX server and started to copy the data onto the new ZFS file system. As I was sitting in the server room, I was able to see the XServe RAID; I noticed the lights on the left side of the array indicate significant disk activity, but I initially dismissed this as not out of the ordinary. But a few seconds later, when the right side still did not show any I/O, it dawned on me: the Solaris host was writing data over the live file system instead of onto the new disks!
I immediately stopped the data transfer and even physically disconnected the Solaris server, but the damage was done: I had inadvertently created a new ZFS file system on the disks already containing (and using) an HFS+ file system! As it turns out, I had not placed the Solaris server into the correct SAN zone on the FibreChannel switch, meaning the only storage device it could see was the left side of the array. But since both sides of the array were identical (in size, RAID type, manufacturer), it was easy for me not to notice and proceed thinking that due to proper SAN zoning, it was safe for me to write to the device.
Now it was interesting to note that at the same time as I was overwriting the live file system, data was still being written to and read from the HFS+ file system on the Apple server. I was only able to observe intermittent I/O errors. Thinking I could still save the data, I made my next big mistake: I shut down the Apple server, hoping a clean boot and file system check could correct what I still thought was a minor problem.
Unfortunately, however, when the server came back up, it was unable to find a file system on the attached RAID array! It simply could not identify the device. diskutil(8) displayed the disk's partition table as:
/dev/disk3 #: type name size identifier 0: GUID_partition_scheme *2.2 TB disk3 1: 6A85CF4D-1DD2-11B2-99A6-08002073 128.0 MB disk3s1 2: 6A87C46F-1DD2-11B2-99A6-08002073 128.0 MB disk3s2 3: 6A898CC3-1DD2-11B2-99A6-08002073 2.2 TB disk3s7 4: 6A945A3B-1DD2-11B2-99A6-08002073 8.0 MB disk3s9
In retrospect, this is no surprise: the Solaris server had constructed a new (and different) file system on the device (the output of diskutil(8) actually displays the Solaris partition table here) and destroyed all the HFS+ specific file system meta data stored at the beginning of the disks. Logically, mounting the disk failed:
$ mount /dev/disk3s7 /mnt3 /dev/disk3s7 on /mnt3: Incorrect super block.
That is, even though the blocks containing the data were likely not over written, there was no way to identify them. One of the many futile attempts to recover the data included me pulling one of the disks out of the RAID and inserting a new one (which of course had no effect, since the RAID did just what it was supposed to do: it rebuilt the disk as it had been before the array was degraded, bogus disklabel and all). Another attempt included trying to recreate an identical disklabel from the second half of the XRaid and applying that to the first half: a wonderful mistake, since using diskutil(8) to apply the disk label also triggered a newfs, thus further obliterating any chances of recovering data.
After many hours of trying to recreate the HFS+ meta data, I had to face the fact this was simply impossible. What was worse, I had neglected to verify that backups for the server were done before putting it into production use -- fatal mistake number three! The data was irrevocably lost; the only plus side was that I had learned a lot about data recovery, SAN zoning, ZFS, HFS+ and file systems in general.
Looking back at this experience six years later, I'm still amazed that most of the damage could have been avoided had I realized that keeping the OS X server running was the right course of action: it still was able to access the file system, despite a second server having created file system metadata on the disk. The metadata of the ZFS system was small in comparison to the entire file system, and once writes to the ZFS file system had stopped, I should have been able to recover a lot of the existing files. Instead, I panicked and pulled the plug, making the situation much worse. It won't surprise you that I haven't made this series of mistakes again since...2012-09-23