Git annex worked example

2022-01-11

This is a tutorial for the Weigel group’s difflines project. I use difflines analyses here so that our group has concrete examples to work with. Obviously none of this will work for you as-is if you’re not part of our group, but hopefully this is helpful by analogy.

Part 1: Background and repositories

I’ll be assuming you’re already familiar with git itself. If you’re not, or would like a refresher, I suggest either the git tutorial, or the software carpentry git course. Also, please re-read my generic git-annex tutorial if you haven’t seen it recently. To briefly recap, git is a tool that tracks changes within some set of files, typically software source code. It is a distributed version control system, meaning than each version of each file exists in its history for each copy of the repository. As this doesn’t handle large files well, git-annex is a git extension to allow tracking changes among data files, and to ease the process of sharing only parts of a git repository between checkouts (e.g. servers or colleagues).

As with most projects using git, we have a shared centralised copy of the repository, and we each have our own checkouts of the repo in which we do our work. When we have finished some feature, we commit the changes (including to any data files), and push/sync our changes with the centralised repository. This will be familiar to any of you who have contributed to an open source project: for example, I have contributed some small changes to the htslib library. To do so, I forked the htslib repo to my github account, then cloned my fork to my local computer (git clone https://github.com/kdm9/htslib), then made my changes, committed, and pushed. I then made a merge request, which the samtools authors merged.

We will use an analogous process for interacting with our shared code and data repository using git annex. Our “upstream” repository will always be at /ebio/abt6_projects7/diffLines_20/annex on the ebio storage. To avoid conflicting changes if multiple people work within this directory at once, we will each use a personal copy of the repository to run our analyses (ideally under /tmp/global2/$USER). We each may have additional personal copies on our laptops. We can then make our changes, commit, and run git annex sync to push these changes to our upstream repository. Thus, the latest copy of our combined code and data should always be in the “upstream” repository.

Some brief background on how git annex actually works: in essence, it tracks large files as pointers (symlinks) to checksummed content. That’s all that git itself sees. Separately, git annex can synchronise the target content of these pointers between checkouts. This can happen either automatically, according to some rules, or manually, and subsets of the total list of content can be copied. These pointers are immutable, but can be temporarily turned back into a normal file to allow editing.

Part 2: A worked example with my pangenome work.

OK, so how do we actually use it? The git annex workflow is very similar to that of git. One makes changes, then stages, commits, and pushes them to a remote. What follows is an example of how I ran an analysis, and made the result available to Luisa via our git annex repository.

For background, I have been working on getting the NLR Pangenome difflines has created into a form usable by make_prg and gramtools. This involves taking Max’s genome assemblies and Luisa’s NLR annotations, extracting gene sequences, aligning them with MAFFT, and running make_prg to create Pangenome Reference Graphs (PRGs). These can then be used as input for Gramtools to genotype short reads against.

Step 1: A personal workspace

So that I can work in peace, and that my changes are not intermingled with those of others, we should all work in separate directories. So, let’s create a clone of the repository on my temporary storage ¹.

cd /tmp/global2/kmurray
git clone /ebio/abt6_projects7/diffLines_20/annex difflines-km

Git clone only knows about files within git itself, either the actual files in the case of code, or the pointers to large files in the case of data. Therefore, in addition to cloning the repository, we need to tell git annex to git annex get the data files we need for our workflow. I’ve already added the reference genomes and assemblies under genes-and-pangenome, so we do git annex get genes-and-pangenome/input. Under the hood this will rsync the content of the files in genes-and-pangenome/input from our upstream repo to my new personal clone. I recommend you git annex get the minimal set of files you can to do your analysis to save space on /tmp.

Recap: gitattributes

One important thing to configure is some rules for what git-annex should consider code, and what should be considered data. These rules are stored in the .gitattributes file at the repository root. At writing, this looks like:

* annex.largefiles=largerthan=10kb
*/input/**  annex.largefiles=anything
*/output/**  annex.largefiles=anything
*/data/**  annex.largefiles=anything
*.sh annex.largefiles=nothing
*.txt annex.largefiles=nothing
*.R annex.largefiles=nothing
*.py annex.largefiles=nothing

This says means that by default, all .sh, .txt, .R, and .py files are considered code (‘small files’), and any files below a directory named input, output, or data, or any file larger than 10KB are considered data. This can be overridden with the git annex add --force-small and git annex add --force-large options.

Step 2: Making a change

Now we need code to actually do something. In my case, this will be a small Snakemake workflow, consisting of a few rules files and python scripts, and a Snakefile to coordinate everything and set up defaults. (N.B.: this was already in the repo as of writing, but this is what I did).

vim Snakefile make_prg.rules genes.rules # write my code
git annex sync # always pull the latest changes first
git annex add . # should add files as small files i.e. straight to git
git commit -v # actually commit your changes.
git annex sync # publish your changes to our shared upstream ASAP

Once we have made our source code changes, we need to add and commit them. First, we should always pull any new changes others have made. This helps keep a nice linear git history, and ensure that we don’t have conflicting changes that need manual merging (let me know ASAP if this happens, it’s a hard to explain but easy fix so I’ll just do it).

We then add our changes to the index, the list of changes to be committed. Both git add and git annex add will look up if files should be “large” (see previous section about gitattributes), and then either commit their changes traditionally, or commit them as large files. There is one caveat to that: git annex add (but not git add) takes a --force-large or --force-small argument to override the configs and force adding files as either large or small (traditional) files. This is useful if you have e.g. a very large data file that is named data.txt, which with the above config would normally be considered a small file.

We then commit these changes. I would use git commit -v, which will bring up $EDITOR (vi by default I think) so you can edit the commit message in a nice editor. These messages are for your collaborators, and your future self, so spend 30 seconds to make them useful. Ideally write a very small paragraph describing why the change was made if it is not self-evident. Ideally start the commit message with a word for your bigger analysis, e.g. “pangenome” or “varcalls”.

For example, here’s a good commit message

varcalls: update TE annotation after contig rename

EDTA truncates contig names, so all EDTA output had contig names that didn't
match the reference genome or any other derived annotation. These new files are
the ones Max has fixed. This also updates the derived output files.

Whereas the following is not very helpful:

update TEs

We then should share the commit with our changes back to the upstream repo. We do git annex sync to sync the git changes. This pulls any upstream change, merges our changes, and pushes the result back ². Again, this is only the plain-git changes, not the file contents. If you have updated or added any large files, you also need to do git annex copy --to upstream . or git annex sync --content --push upstream to push the new large file contents.

Part 3: getting someone else’s work

Enough waffling, let’s actually do something. We are all going to set up our respective workspaces, and then obtain the results of a recent analysis to our own workspace. I’d recommend that we do this on our laptops. Let’s all do this part interactively.

Git annex setup

So we need a more recent version of git annex than what the OS packages currently provide. This is pretty annoying, but for now let’s install the pre-compiled standalone binary. I’m assuming you’re all using Linux. If you’re not, you’ll need to find how to install it on OS X or Windows. I think OS X has packages installable the same way, Windows I have no idea. Maybe do it on taco/chimi.

mkdir -p ~/.local/opt
cd ~/.local/opt
wget https://downloads.kitenet.net/git-annex/linux/current/git-annex-standalone-amd64.tar.gz
tar xvf git-annex-standalone-amd64.tar.gz

mkdir -p ~/.local/bin
cd ~/.local/bin
ln -s ~/.local/opt/git-annex.linux/git ./
ln -s ~/.local/opt/git-annex.linux/git-annex ./
ln -s ~/.local/opt/git-annex.linux/git-shell ./
ln -s ~/.local/opt/git-annex.linux/git-annex-shell ./
ln -s ~/.local/opt/git-annex.linux/git-receive-pack ./
ln -s ~/.local/opt/git-annex.linux/git-send-pack ./

command -v git
command -v git-annex

Hopefully you should now see that command -v says your git and git-annex binaries are under ~/.local/bin. If not sing out and I’ll see what went wrong.

Cloning the repo

So now we need to set up our local personal workspace. You do this by cloning the upstream repo with git clone.

# adjust the following to whereever you want this to be stored.
mkdir -p ~/work/workspaces
cd ~/work/workspaces

git clone chimi:/ebio/abt6_projects7/diffLines_20/annex difflines-annex
cd difflines-annex

Now there are a couple of config flags we should set. Ask what these do if it’s not obvious.

# if this is the first time using git on this machine, or you haven't done this before:
git config --global --set user.name "Firstname Surname"
git config --global --set user.email "your-email@domain.com"

By default, git annex sync will commit any changes before syncing, unlike git pull which would normally error and ask you to either stash or commit them ³. The following config disables this, and makes git annex behave like normal git regarding automatic commits.

git config --global --set annex.autocommit false

Getting some large files

Now you’re set up, but you only have the content git itself knows of. Take a look in one of the directories: script, configuration, and metadata files will be there, but any data files will be broken symlinks to files like .git/annex/objects/GIBBERISH. This is how git annex works: just track that a file exists, and separately synchronise the contents.

So, now lets get some large data files.

# first, always sync to ensure you have the latest update
git annex sync

# then get everything in the output of the 'genes-and-pangenome' analysis
git annex get genes-and-pangenome/output

This should have rsynced many files to your personal workspace. The symlink targets of the files in the genes-and-pangenome/output directory should now be present, and you should be able to have read-only access to these files.

Part 4: Some exercises

So hopefully you now know the basics you need to use git annex. So here are some exercises you can do. Let me know if there are any issues.

First, set up a little test dataset and dummy script.

# make a dir for this little exercises
cd /path/to/difflines-annex # where you cloned it
mkdir -p git-annex-tutorial/$USER # make a dir for you
cd git-annex-tutorial/$USER
head -c 1000000 /dev/urandom > $USER.dat
printf "#!/bin/bash\necho test\necho hello!\n" > $USER.sh

Now, add the script and data file to git. Please fill in the ...s with the correct thing.

git annex ...  # always sync first
git annex ...  # Add files to the list of changes to be committed
git ...   # actually commit the change. Make sure you use a decent commit message
git annex ...  # Share you change with the upstream repo
git annex ...  # copy the large file contents up to the upstream repo

Now each of you have made and shared a change. I’ll now do git annex sync on the upstream repo to merge all these changes together.

Then, you should sync with the upstream repo again and try getting someone else’s large file.

git annex .... # Sync again with upstream repo
ls git-annex-tutorial/SOMEONE_ELSE  # use someone else's name. You should see their .sh file there, and a broken symlink to their large file.
git annex ... # get the content of the other persons's large file
ls git-annex-tutorial/SOMEONE_ELSE  # list their dir again. You should now see that the symlink points to some (fake) data.

Any questions, let me know.

This research was supported by a Marie Skłodowska-Curie Actions fellowship from the European Research Council.

Some notes: this should be on /tmp for a few reasons: one, to improve performance, as /tmp is on SSD-backed storage while /ebio is on spinning iron. Secondly, it keeps the size of things on /ebio down. If you are worried about potentially losing work, don’t. /tmp is reasonably stable, and in any case you should be committing and pushing to the upstream repository (and thus protecting your work from /tmp failure) about daily, which is the backup frequency of /ebio anyway. If you just finished some particularly important analysis code, but it will run for a while and so can’t commit its output yet, then just commit and sync your code and separately add your data later once it has finished. ↩︎
N.B.: git annex sync like this won’t actually change the files as they are on disk in the upstream repo. This is because git refuses to update a checked-out branch when pushed to. To actually update the checked out files, one needs to do git annex sync also within the upstream repo. Don’t worry though, as all the history and a copy of the files does exist deep in git’s internals. ↩︎
See the git annex sync --commit, --no-commit CLI options, and git config annex.autocommit to alter this behaviour. ↩︎