Fixing nuxx.net’s Backups

nuxx.net has a backup system which uses rsync to back up the server (and everything it hosts) to my desktop iMac, deduplicating the files by way of rsync’s --link-dest argument to create hard links (detailed info here). Maintaining a 30 day backup history it only uses the space of one full backup plus 29 days of change, which isn’t much given the utility allowed. It’s been running fine for a few years, but about a month or so ago I began getting errors where the target disk for the backups was full, which seemed a bit odd to me. As a stopgap I was able to clean up some other things and keep backups working, but the space was a lot more than normal.

Today I managed to get this resolved, but it took a bit of digging. Below the fold you can see what I found.

After a bit of initial investigation, this is what I knew:

· Some files, generally those written before a certain date, were appropriately being deduplicated via hard links and thus had the same inode.
· Newer files weren’t getting deduplicated and had different inodes.
· The lack of deduplication (especially with the large nightly database dumps) was contributing to the volume becoming full.
· Looking at the Number of files transferred: entries in each of the daily rsync logs the numbers weren’t increasing, as one would expect if existing files weren’t being found locally, so the backup must have been referencing the files in --link-dest. They generally stayed at 100-300 per day, for the last many months.

By setting up two test cases running rsync -vvvv (a very verbose mode) against one file that wasn’t being deduplicated and one that was, I saw these lines in the not-getting-deduplicated log file needed it’s UID and GID changed:

set uid of IMG_7118.jpg from 99 to 1001
set gid of IMG_7118.jpg from 99 to 1001

In both cases I saw that the file wasn’t being transferred. Both looked similar to this:

Number of files: 1
Number of files transferred: 0
Total file size: 4779530 bytes
Total transferred file size: 0 bytes

Checking this file on the target disk showed that it’s UID and GID were not changed as described in the verbose log, so that gave me a place to start.

I then looked at the disk from the machine itself, and while I could create files, I couldn’t change their ownership:

quicksilver:darktrain.nuxx.net root# touch blah
quicksilver:darktrain.nuxx.net root# ls -alsn blah
0 -rw-r--r-- 1 99 99 0 Nov 29 13:17 blah
quicksilver:darktrain.nuxx.net root# chown 1001:1001 blah
quicksilver:darktrain.nuxx.net root# ls -alsn blah
0 -rw-r--r-- 1 99 99 0 Nov 29 13:17 blah
quicksilver:darktrain.nuxx.net root#

Based on this, I figured I’d see what fsck had to say:

quicksilver:~ root# fsck_hfs -f /dev/disk3s3
** /dev/rdisk3s3 (NO WRITE)
        Using cacheBlockSize=32K cacheTotalBlock=65536 cacheSize=2097152K.
   Executing fsck_hfs (version diskdev_cmds-557~393).
** Verifying volume when it is mounted with write access.
   Journal need to be replayed but volume is read-only
** Checking Journaled HFS Plus volume.
** Detected a case-sensitive volume.
   The volume name is Cruft
** Checking extents overflow file.
** Checking catalog file.
** Checking multi-linked files.
   Incorrect number of file hard links
** Checking catalog hierarchy.
** Checking extended attributes file.
** Checking volume bitmap.
** Checking volume information.
** The volume Cruft was found corrupt and needs to be repaired.
quicksilver:~ root#

Well, that’s something… Maybe diskutil, Apple’s utility for repairing disks, can help out:

quicksilver:~ root# diskutil repairVolume /Volumes/Cruft/
Started file system repair on disk3s3 Cruft
Checking file system
Checking Journaled HFS Plus volume
Detected a case-sensitive volume
Checking extents overflow file
Checking catalog file
Checking multi-linked files
Incorrect number of file hard links
Checking catalog hierarchy
Checking extended attributes file
Checking volume bitmap
Checking volume information
Repairing volume
Orphaned file inode (id = 108285)
Orphaned file inode (id = 3609024)
Look for missing items in lost+found directory
Rechecking volume
Checking Journaled HFS Plus volume
Detected a case-sensitive volume
Checking extents overflow file
Checking catalog file
Checking multi-linked files
Checking catalog hierarchy
Incorrect folder count in a directory (id = 2)
(It should be 13 instead of 12)
Checking extended attributes file
Checking volume bitmap
Checking volume information
Repairing volume
Rechecking volume
Checking Journaled HFS Plus volume
Detected a case-sensitive volume
Checking extents overflow file
Checking catalog file
Checking multi-linked files
Checking catalog hierarchy
Checking extended attributes file
Checking volume bitmap
Checking volume information
The volume Cruft was repaired successfully
Volume repair complete
Updating boot support partitions for the volume as required
Finished file system repair on disk3s3 Cruft
quicksilver:~ root#

Unfortunately, that didn’t make it better either. After reading up a bit on Apple’s Ignore ownership setting for external drives (which I explicitly had to disable when setting this system up and had somewhat forgotten about) I found this setting is stored in /var/db/volinfo.database:

quicksilver:darktrain.nuxx.net root# cat /var/db/volinfo.database
AB8A61B83C455765: 00000001
D3572298D5E48D13: 00000001
66408CB4FDEFB639: 00000001
quicksilver:darktrain.nuxx.net root#

Hmm, okay. Only three entries… That seems odd. I have more drives than that. Apparently this setting is also be reflected as the noowners option on mount:

quicksilver:darktrain.nuxx.net root# mount
/dev/disk0s2 on / (hfs, local, journaled)
devfs on /dev (devfs, local, nobrowse)
/dev/disk1s2 on /Volumes/Thunder (hfs, local, journaled)
map -hosts on /net (autofs, nosuid, automounted, nobrowse)
map auto_home on /home (autofs, automounted, nobrowse)
/dev/disk2s2 on /Volumes/WIN7 (ntfs, local, nodev, nosuid, read-only, noowners)
/dev/disk5s2 on /Volumes/Time Machine (hfs, local, nodev, nosuid, journaled)
/dev/disk3s3 on /Volumes/Cruft (hfs, local, nodev, nosuid, journaled, noowners)
quicksilver:darktrain.nuxx.net root#

Ah hah! Now to get rid of noowners on /Volumes/Cruft…

quicksilver:~ root# diskutil enableOwnership /Volumes/Cruft/
File system user/group ownership enabled
quicksilver:~ root#

Are there now more entries in /var/db/volinfo.database?

quicksilver:darktrain.nuxx.net root# cat /var/db/volinfo.database
AB8A61B83C455765: 00000001
D3572298D5E48D13: 00000001
66408CB4FDEFB639: 00000001
A345AB3A480EFF2A: 00000001
quicksilver:darktrain.nuxx.net root#

Yep! So, did it work?

quicksilver:darktrain.nuxx.net root# mount
/dev/disk0s2 on / (hfs, local, journaled)
devfs on /dev (devfs, local, nobrowse)
/dev/disk1s2 on /Volumes/Thunder (hfs, local, journaled)
map -hosts on /net (autofs, nosuid, automounted, nobrowse)
map auto_home on /home (autofs, automounted, nobrowse)
/dev/disk2s2 on /Volumes/WIN7 (ntfs, local, nodev, nosuid, read-only, noowners)
/dev/disk5s2 on /Volumes/Time Machine (hfs, local, nodev, nosuid, journaled)
/dev/disk3s3 on /Volumes/Cruft (hfs, local, nodev, nosuid, journaled)
quicksilver:darktrain.nuxx.net root#

Mount looks good…

quicksilver:darktrain.nuxx.net root# ls -alsn blah
0 -rw-r--r-- 1 0 80 0 Nov 29 16:53 blah
quicksilver:darktrain.nuxx.net root# chown 1001:1001 blah
quicksilver:darktrain.nuxx.net root# ls -alsn blah
0 -rw-r--r-- 1 1001 1001 0 Nov 29 16:53 blah
quicksilver:darktrain.nuxx.net root#

Looks like it did! Now to run two backup jobs (one to ensure the data is all there with the right permissions, and the second to reference it) and see if the those two most recent backups (backup.0 and backup.1) have the previously-problematic file now sharing the same inode:

quicksilver:darktrain.nuxx.net root# ls -i backup.*/c0nsumer/gallery2_data/nuxx.net/albums/acquired_stuff/IMG_7118.jpg
80384029 backup.0/c0nsumer/gallery2_data/nuxx.net/albums/acquired_stuff/IMG_7118.jpg
80384029 backup.1/c0nsumer/gallery2_data/nuxx.net/albums/acquired_stuff/IMG_7118.jpg
80144307 backup.2/c0nsumer/gallery2_data/nuxx.net/albums/acquired_stuff/IMG_7118.jpg
79905924 backup.3/c0nsumer/gallery2_data/nuxx.net/albums/acquired_stuff/IMG_7118.jpg
[...snip...]
quicksilver:darktrain.nuxx.net root#

They do! So, that’s what the problem was… Because the ownership of the files on the destination (the iMac) couldn’t be changed, rsync apparently created a copy instead of a hard link. Thus the increase in space. But, apparently because the checksum of the files matched it was never re-transferred, which is why there was not a consistently increasing Number of files transfered in the rsync log.

That brings about the next question: how did this happen? I suspect that when I replaced my failed SSD in the iMac (where the OS resides) /var/db/volinfo.database didn’t get restored by Time Machine. Or, maybe it did and the external drive was given a new identifier? I’m not sure what the entries in volinfo.database refer to, as they aren’t the volume GUIDs, so I can’t really check that.

Still, I’m glad this problem is sorted out and I know why it wasn’t working. I’ll now just wait 30 days for the old, only-partially-deduplicated backups to rotate out and everything will be good again.