Using git-annex to Mirror My Music Collection
January 4, 2016
Update (2019): My requirements changed, and I no longer use this solution. I have since switched to a combination of restic and syncthing.
I’ve been interested in git-annex lately, as an offsite backup solution, a synchronization tool, and as a solution to bit rot. @encryptio recently wrote a ‘special remote’ backend for Backblaze’s ultra-cheap B2 storage API, so I figured it was time to give it a try.
Disclaimers
Before attempting any of this, make sure you have a good backup of whatever you’re switching over to git-annex. git-annex heavily modifies your working directory, and if something goes wrong, you could lose data.
My Current Music Setup
I use beets to manage music metadata, and mpd to play my music. I have 110 GiB of music, most of which is in flac format. This means the average file is 20-60 MiB in size, with some MP3s and other formats as outliers. Optimal settings may vary based on total repository size, number of files, and the average size of files.
I manage my music library on my desktop, which runs Debian Jessie. I use beets to compress the music as ogg files before syncing it to my portable devices using rsync. In this case, I don’t care about using git-annex for syncing between devices, though I may later. Instead I’m using it as a secondary backup and for bit rot protection.
Getting a Recent Version of git-annex
git-annex
is available in the Debian repositories, which
should come as no surprise, since Joey Hess, git-annex’s developer, was
a Debian Developer until recently.
I’m on Debian stable, but I’m interested in some of the newer
features of git-annex. Fortunately, the NeuroDebian team maintains a
frequently updated git-annex-standalone
package.
Initializing git-annex
This is easy.
$ cd ~/music
$ git init
$ git annex init
$ git annex assistant
Setting up git-annex-remote-b2
Installation is easy. Just download a
binary, and toss it in your $PATH
. I
keep a bin
directory in my $HOME
for this
purpose.
Configuration is slightly more complicated.
git-annex-remote-b2
needs your Account ID and
Application Key, as well as a bucket name.

If you name a bucket that doesn’t exist yet, it’ll create one for you. Just understand the bucket naming restrictions:
When you create a bucket, you get to pick the name for the bucket. The name you pick must be a unique name that has not been used before, by you or by anybody else.
Bucket names can consist of upper-case letters, lower-case letters, numbers, and “-”. A bucket name must be at least 6 characters long, and can be at most 50 characters long. These are all allowed bucket names: myBucket, backblaze-images, and bucket-74358734. Bucket names that start with “b2-” are reserved for Backblaze use.
— Source
For encryption, I used the “shared” option. This stores the encryption key in the repository, but that’s not an issue, because we’ll be encrypting the non-annex portions of the git repository too.
Git-annex has support for chunking, which allows for resumable uploads, and ensures things work smoothly if an excessively large file gets dropped in the repository. The documentation suggests a chunk size of 1 MiB as a starting point, but the because of how the b2 backend works, I find I get much better throughput with larger chunks. I chose 25 MiB chunks, which seems like a good compromise.
$ git annex initremote b2 type=external externaltype=b2 accountid=xxxxxxxxxxxx \
appkey=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
encryption=shared chunk=25MiB bucket=bgw-music
Your machine may hang here for a minute as git-annex waits for enough entropy to generate a key.
Tidbit
As of writing, B2 won’t let you delete a bucket with anything in it,
and the web interface won’t work with large buckets. This is annoying,
but understandable as B2 is still in beta. To delete all the things in a
bucket created by git-annex-remote-b2
, you can install the
B2
command line tool and run something like:
$ b2 ls --long bgw-music | awk '{print $6, $1}' | \
parallel -j 100 -C ' ' b2 delete_file_version {1} {2}
Syncing Metadata
Your data is ready to be backed up to the cloud, but
it’s useless without metadata. If you only care about bit rot protection
and syncing, and not backup, you can skip this section. If you fully
trust the remote you’re pushing to, you can avoid using gpg and
git-remote-gcrypt
.
If you aren’t using gpg, you’ll need to set it up first. Of
importance, you should make sure that you set a default-key
and turn on use-agent
in your
~/.gnupg/gpg.conf
. If your desktop environment doesn’t
already provide it, you’ll also need to run a
gpg-agent
.
I like AWS CodeCommit for personal private repositories, as it costs $1/mo, and easier than self-hosting. However, any git or ssh server should work.
$ git remote add codecommit \
gcrypt::ssh://git-codecommit.us-east-1.amazonaws.com/v1/repos/annex-music
$ git config gcrypt.publish-participants true # avoids having to try all keys when decrypting
$ git config remote.codecommit.annex-ignore true # metadata only
$ git annex sync codecommit
The remote.<name>.annex-ignore
option ensures
git-annex
will only ever sync metadata to this remote. The
man page states:
If set to true, prevents git-annex from storing file contents on this remote by default. (You can still request it be used by the –from and –to options.)
This is, for example, useful if the remote is located somewhere without git-annex-shell. (For example, if it’s on GitHub). Or, it could be used if the network connection between two repositories is too slow to be used normally.
This does not prevent git-annex sync (or the git-annex assistant) from syncing the git repository to the remote.
Bugs and Workarounds
I had a ton of issues getting git-remote-gcrypt
,
git-annex
, and AWS CodeCommit to play nicely together. It
turns out that CodeCommit doesn’t like ssh connection caching1, but git-annex
automatically turns it on for performance.
git-remote-gcrypt
makes multiple ssh connections within a
short period of time, creating the perfect storm.
$ git config annex.sshcaching false
Presumably if you were using CodeCommit’s HTTPS clone url, this problem would also be avoided.
Configuring mpd and beets
Beets expects to be able to modify files, while git-annex uses
symlinks by default to prevent this. This allows git-annex to keep track
of how many copies of a file exist, and avoids duplicating files in the
working directory, and in the .git
directory. Fortunately,
we can turn this off in git-annex by using direct mode:
$ git annex direct
FYI: Git-annex 6.x will contain support for “unlocked” files, which deprecates direct mode. I’m using direct mode here, because that functionality still experimental.
Since git-annex adds a .git
directory, I needed to add a
.mpdignore
file:
$ echo '.git' > ~/music/.mpdignore
No further configuration is needed for beets, because ignores hidden files by default.
Assistant
While I can manually run git annex sync
, every time I
update my music library, I’d like something more automatic. Fortunately,
git annex assistant
can monitor repositories using inotify,
and sync automatically.
The assistant will automatically monitor repositories it knows about.
You can check ~/.config/git-annex/autostart
for a list of
repository paths.
To start the assistant on all known repositories, simply run
$ git annex assistant --autostart
which should then output something like
git-annex autostart in /home/bgw/music
ok
For desktop sessions that support it, the Debian package adds a
/etc/xdg/autostart/git-annex.desktop
file that runs this on
login.
Webapp
Git-annex also comes with a built-in webserver that displays the
current repositories and sync status. It also includes some
configuration tools for making new remotes and repositories, though it
lacks support for git-annex-remote-b2
and
git-remote-gcrypt
.
You can run it with
git annex webapp

Or, for desktop sessions that support it, the Debian package adds an entry to your application menu that runs this.
Additional Configuration
We can use the webapp to configure a few more options.
Consistency Checks
In the “configuration” tab, we can choose to enable consistency
checks. These use git annex schedule
to run
git annex fsck
periodically.
You can see how this happens under the hood by running
$ git annex schedule .
which should give something like
fsck self 1h weeks divisible by 2 at 1 AM
Standard Groups
Git-annex has some built-in
settings for “preferred content”. Preferred content settings tell
git-annex what content to automatically sync
or
get
.
I placed my local repository in the “client” group, and my B2 special remote in the “full backup” group.
Placing a repository in a group sets the “wanted” option to “standard”, and applies the group.
$ git annex wanted . # should output 'standard'
$ git annex wanted b2 # should output 'standard'
$ git annex group . # should output 'client'
$ git annex group b2 # should output 'backup'
For more technical details, see the git annex help
man
pages for wanted
,
group
,
and preferred-content
.
Restoring
You can’t trust a backup until you’ve successfully restored from it.
$ cd ~/tmp
$ git clone gcrypt::ssh://git-codecommit.us-east-1.amazonaws.com/v1/repos/annex-music
$ git annex enableremote b2
$ git annex get --from b2 'adele/21/01 rolling in the deep.flac'
get adele/21/01 rolling in the deep.flac (from b2...)
(checksum...) ok
(recording state in git...)
What Does the Cloud See?
Backblaze gets some rough information on post-compression file size, somewhat obscured by chunking. Depending on use, they may also get some minor information on file access patterns.

AWS gets even less: encrypted git objects. This may reveal the rough size of the repository, and some usage patterns.

Conclusions
Git-annex is an impressive piece of software. However, it’s excellent
customization comes at great cost. Non-standard configurations, like
mine with B2 and git-remote-gcrypt
can add a lot of
complexity. The extensive documentation helps here, but is hindered by
occasional issues, like the AWS CodeCommit and ssh caching bug I
faced.2
The webapp and assistant make a lot of sense in trying to simplify the process by satisfying simpler more straightforward use-cases, papering over some of the complexities underneath. Unfortunately, the moment you veer off that path, it feels like learning git all over again.
I’m interested to see how git-annex evolves.
The views expressed on this site are my own and do not reflect those of my employer.