Using git-annex to Mirror My Music Collection

January 4, 2016

I’ve been interested in git-annex lately, as an offsite backup solution, a synchronization tool, and as a solution to bit rot. @encryptio recently wrote a ‘special remote’ backend for Backblaze’s ultra-cheap B2 storage API, so I figured it was time to give it a try.

Disclaimers

Before attempting any of this, make sure you have a good backup of whatever you’re switching over to git-annex. git-annex heavily modifies your working directory, and if something goes wrong, you could lose data.

I don’t have a proper full-disk offsite backup solution configured yet, partially due to my requirements for encryption and entirely incremental updates on a low-budget. I’m working on a solution involving acd_cli and S3QL. Stay tuned for more updates on that later.

However, I do have keep a full-disk incremental backup using duplicity. This means I can quickly jump back in time through daily snapshots for the last couple months, so experimenting with different file management technologies becomes pretty safe.

My Current Music Setup

I use beets to manage music metadata, and mpd to play my music. I have 110 GiB of music, most of which is in flac format. This means the average file is 20-60 MiB in size, with some MP3s and other formats as outliers. Optimal settings may vary based on total repository size, number of files, and the average size of files.

I manage my music library on my desktop, which runs Debian Jessie. I use beets to compress the music as ogg files before syncing it to my portable devices using rsync. In this case, I don’t care about using git-annex for syncing between devices, though I may later. Instead I’m using it as a secondary backup and for bit rot protection.

Getting a Recent Version of git-annex

git-annex is available in the Debian repositories, which should come as no surprise, since Joey Hess, git-annex’s developer, was a Debian Developer until recently.

I’m on Debian stable, but I’m interested in some of the newer features of git-annex. Fortunately, the NeuroDebian team maintains a frequently updated git-annex-standalone package.

Initializing git-annex

This is easy.

$ cd ~/music
$ git init
$ git annex init
$ git annex assistant

Setting up git-annex-remote-b2

Installation is easy. Just download a binary, and toss it in your $PATH. I keep a bin directory in my $HOME for this purpose.

Configuration is slightly more complicated.

git-annex-remote-b2 needs your Account ID and Application Key, as well as a bucket name.

Click on My Account to get to the buckets list, where you can find your Account ID and Application Key
Click on ‘My Account’ to get to the buckets list, where you can find your Account ID and Application Key

If you name a bucket that doesn’t exist yet, it’ll create one for you. Just understand the bucket naming restrictions:

When you create a bucket, you get to pick the name for the bucket. The name you pick must be a unique name that has not been used before, by you or by anybody else.

Bucket names can consist of upper-case letters, lower-case letters, numbers, and “-”. A bucket name must be at least 6 characters long, and can be at most 50 characters long. These are all allowed bucket names: myBucket, backblaze-images, and bucket-74358734. Bucket names that start with “b2-” are reserved for Backblaze use.

Source

For encryption, I used the “shared” option. This stores the encryption key in the repository, but that’s not an issue, because we’ll be encrypting the non-annex portions of the git repository too.

Git-annex has support for chunking, which allows for resumable uploads, and ensures things work smoothly if an excessively large file gets dropped in the repository. The documentation suggests a chunk size of 1 MiB as a starting point, but the because of how the b2 backend works, I find I get much better throughput with larger chunks. I chose 25 MiB chunks, which seems like a good compromise.

$ git annex initremote b2 type=external externaltype=b2 accountid=xxxxxxxxxxxx \
                          appkey=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
                          encryption=shared chunk=25MiB bucket=bgw-music

Your machine may hang here for a minute as git-annex waits for enough entropy to generate a key.

Tidbit

As of writing, B2 won’t let you delete a bucket with anything in it, and the web interface won’t work with large buckets. This is annoying, but understandable as B2 is still in beta. To delete all the things in a bucket created by git-annex-remote-b2, you can install the B2 command line tool and run something like:

$ b2 ls --long bgw-music | awk '{print $6, $1}' | \
  parallel -j 100 -C ' ' b2 delete_file_version {1} {2}

Syncing Metadata

Your data is ready to be backed up to the cloud, but it’s useless without metadata. If you only care about bit rot protection and syncing, and not backup, you can skip this section. If you fully trust the remote you’re pushing to, you can avoid using gpg and git-remote-gcrypt.

If you aren’t using gpg, you’ll need to set it up first. Of importance, you should make sure that you set a default-key and turn on use-agent in your ~/.gnupg/gpg.conf. If your desktop environment doesn’t already provide it, you’ll also need to run a gpg-agent.

I like AWS CodeCommit for personal private repositories, as it costs $1/mo, and easier than self-hosting. However, any git or ssh server should work.

$ git remote add codecommit \
      gcrypt::ssh://git-codecommit.us-east-1.amazonaws.com/v1/repos/annex-music
$ git config gcrypt.publish-participants true     # avoids having to try all keys when decrypting
$ git config remote.codecommit.annex-ignore true  # metadata only
$ git annex sync codecommit

The remote.<name>.annex-ignore option ensures git-annex will only ever sync metadata to this remote. The man page states:

If set to true, prevents git-annex from storing file contents on this remote by default. (You can still request it be used by the –from and –to options.)

This is, for example, useful if the remote is located somewhere without git-annex-shell. (For example, if it’s on GitHub). Or, it could be used if the network connection between two repositories is too slow to be used normally.

This does not prevent git-annex sync (or the git-annex assistant) from syncing the git repository to the remote.

Bugs and Workarounds

I had a ton of issues getting git-remote-gcrypt, git-annex, and AWS CodeCommit to play nicely together. It turns out that CodeCommit doesn’t like ssh connection caching1, but git-annex automatically turns it on for performance. git-remote-gcrypt makes multiple ssh connections within a short period of time, creating the perfect storm.

$ git config annex.sshcaching false

Presumably if you were using CodeCommit’s HTTPS clone url, this problem would also be avoided.

Configuring mpd and beets

Beets expects to be able to modify files, while git-annex uses symlinks by default to prevent this. This allows git-annex to keep track of how many copies of a file exist, and avoids duplicating files in the working directory, and in the .git directory. Fortunately, we can turn this off in git-annex by using direct mode:

$ git annex direct

FYI: Git-annex 6.x will contain support for “unlocked” files, which deprecates direct mode. I’m using direct mode here, because that functionality still experimental.

Since git-annex adds a .git directory, I needed to add a .mpdignore file:

$ echo '.git' > ~/music/.mpdignore

No further configuration is needed for beets, because ignores hidden files by default.

Assistant

While I can manually run git annex sync, every time I update my music library, I’d like something more automatic. Fortunately, git annex assistant can monitor repositories using inotify, and sync automatically.

The assistant will automatically monitor repositories it knows about. You can check ~/.config/git-annex/autostart for a list of repository paths.

To start the assistant on all known repositories, simply run

$ git annex assistant --autostart

which should then output something like

git-annex autostart in /home/bgw/music
ok

For desktop sessions that support it, the Debian package adds a /etc/xdg/autostart/git-annex.desktop file that runs this on login.

Webapp

Git-annex also comes with a built-in webserver that displays the current repositories and sync status. It also includes some configuration tools for making new remotes and repositories, though it lacks support for git-annex-remote-b2 and git-remote-gcrypt.

You can run it with

git annex webapp
The git-annex webapp.
The git-annex webapp.

Or, for desktop sessions that support it, the Debian package adds an entry to your application menu that runs this.

Additional Configuration

We can use the webapp to configure a few more options.

Consistency Checks

In the “configuration” tab, we can choose to enable consistency checks. These use git annex schedule to run git annex fsck periodically.

You can see how this happens under the hood by running

$ git annex schedule .

which should give something like

fsck self 1h weeks divisible by 2 at 1 AM

Standard Groups

Git-annex has some built-in settings for “preferred content”. Preferred content settings tell git-annex what content to automatically sync or get.

I placed my local repository in the “client” group, and my B2 special remote in the “full backup” group.

Placing a repository in a group sets the “wanted” option to “standard”, and applies the group.

$ git annex wanted .    # should output 'standard'
$ git annex wanted b2   # should output 'standard'
$ git annex group .     # should output 'client'
$ git annex group b2    # should output 'backup'

For more technical details, see the git annex help man pages for wanted, group, and preferred-content.

Restoring

You can’t trust a backup until you’ve successfully restored from it.

$ cd ~/tmp
$ git clone gcrypt::ssh://git-codecommit.us-east-1.amazonaws.com/v1/repos/annex-music
$ git annex enableremote b2
$ git annex get --from b2 'adele/21/01 rolling in the deep.flac'
get adele/21/01 rolling in the deep.flac (from b2...) 
(checksum...) ok        
(recording state in git...)

What Does the Cloud See?

Backblaze gets some rough information on post-compression file size, somewhat obscured by chunking. Depending on use, they may also get some minor information on file access patterns.

File names are hashed and contents are encrypted.
File names are hashed and contents are encrypted.

AWS gets even less: encrypted git objects. This may reveal the rough size of the repository, and some usage patterns.

AWS gets encrypted git objects.
AWS gets encrypted git objects.

Conclusions

Git-annex is an impressive piece of software. However, it’s excellent customization comes at great cost. Non-standard configurations, like mine with B2 and git-remote-gcrypt can add a lot of complexity. The extensive documentation helps here, but is hindered by occasional issues, like the AWS CodeCommit and ssh caching bug I faced.2

The webapp and assistant make a lot of sense in trying to simplify the process by satisfying simpler more straightforward use-cases, papering over some of the complexities underneath. Unfortunately, the moment you veer off that path, it feels like learning git all over again.

I’m interested to see how git-annex evolves.


  1. This required hours of sleuthing and some reading of git-annex’s source code to figure out.

  2. Technically I think this is AWS’ fault, but it still hurts the usability.

The views expressed on this site are my own and do not reflect those of my employer.