This is a tutorial for the Weigel group’s difflines project. I use difflines analyses here so that our group has concrete examples to work with. Obviously none of this will work for you as-is if you’re not part of our group, but hopefully this is helpful by analogy.
Part 1: Background and repositories
I’ll be assuming you’re already familiar with git itself. If you’re not, or would like a refresher, I suggest either the git tutorial, or the software carpentry git course. Also, please re-read my generic git-annex tutorial if you haven’t seen it recently. To briefly recap, git is a tool that tracks changes within some set of files, typically software source code. It is a distributed version control system, meaning than each version of each file exists in its history for each copy of the repository. As this doesn’t handle large files well, git-annex is a git extension to allow tracking changes among data files, and to ease the process of sharing only parts of a git repository between checkouts (e.g. servers or colleagues).
As with most projects using git, we have a shared centralised copy of the repository, and we each have our own checkouts of the repo in which we do our work. When we have finished some feature, we commit the changes (including to any data files), and push/sync our changes with the centralised repository. This will be familiar to any of you who have contributed to an open source project: for example, I have contributed some small changes to the htslib library. To do so, I forked the htslib repo to my github account, then cloned my fork to my local computer (
git clone https://github.com/kdm9/htslib), then made my changes, committed, and pushed. I then made a merge request, which the samtools authors merged.
We will use an analogous process for interacting with our shared code and data repository using
git annex. Our “upstream” repository will always be at
/ebio/abt6_projects7/diffLines_20/annex on the ebio storage. To avoid conflicting changes if multiple people work within this directory at once, we will each use a personal copy of the repository to run our analyses (ideally under
/tmp/global2/$USER). We each may have additional personal copies on our laptops. We can then make our changes, commit, and run
git annex sync to push these changes to our upstream repository. Thus, the latest copy of our combined code and data should always be in the “upstream” repository.
Some brief background on how git annex actually works: in essence, it tracks large files as pointers (symlinks) to checksummed content. That’s all that git itself sees. Separately, git annex can synchronise the target content of these pointers between checkouts. This can happen either automatically, according to some rules, or manually, and subsets of the total list of content can be copied. These pointers are immutable, but can be temporarily turned back into a normal file to allow editing.
Part 2: A worked example with my pangenome work.
OK, so how do we actually use it? The git annex workflow is very similar to that of git. One makes changes, then stages, commits, and pushes them to a remote. What follows is an example of how I ran an analysis, and made the result available to Luisa via our git annex repository.
For background, I have been working on getting the NLR Pangenome difflines has created into a form usable by
gramtools. This involves taking Max’s genome assemblies and Luisa’s NLR annotations, extracting gene sequences, aligning them with MAFFT, and running
make_prg to create Pangenome Reference Graphs (PRGs). These can then be used as input for Gramtools to genotype short reads against.
Step 1: A personal workspace
So that I can work in peace, and that my changes are not intermingled with those of others, we should all work in separate directories. So, let’s create a clone of the repository on my temporary storage 1.
Git clone only knows about files within git itself, either the actual files in the case of code, or the pointers to large files in the case of data. Therefore, in addition to cloning the repository, we need to tell git annex to
git annex get the data files we need for our workflow. I’ve already added the reference genomes and assemblies under
genes-and-pangenome, so we do
git annex get genes-and-pangenome/input. Under the hood this will rsync the content of the files in
genes-and-pangenome/input from our upstream repo to my new personal clone. I recommend you
git annex get the minimal set of files you can to do your analysis to save space on
One important thing to configure is some rules for what git-annex should consider code, and what should be considered data. These rules are stored in the
.gitattributes file at the repository root. At writing, this looks like:
This says means that by default, all .sh, .txt, .R, and .py files are considered code (‘small files’), and any files below a directory named
data, or any file larger than 10KB are considered data. This can be overridden with the
git annex add --force-small and
git annex add --force-large options.
Step 2: Making a change
Now we need code to actually do something. In my case, this will be a small Snakemake workflow, consisting of a few rules files and python scripts, and a Snakefile to coordinate everything and set up defaults. (N.B.: this was already in the repo as of writing, but this is what I did).
Step 3: Sharing a change
Once we have made our source code changes, we need to add and commit them. First, we should always pull any new changes others have made. This helps keep a nice linear git history, and ensure that we don’t have conflicting changes that need manual merging (let me know ASAP if this happens, it’s a hard to explain but easy fix so I’ll just do it).
We then add our changes to the index, the list of changes to be committed. Both
git add and
git annex add will look up if files should be “large” (see previous section about gitattributes), and then either commit their changes traditionally, or commit them as large files. There is one caveat to that:
git annex add (but not
git add) takes a
--force-small argument to override the configs and force adding files as either large or small (traditional) files. This is useful if you have e.g. a very large data file that is named
data.txt, which with the above config would normally be considered a small file.
We then commit these changes. I would use
git commit -v, which will bring up
$EDITOR (vi by default I think) so you can edit the commit message in a nice editor. These messages are for your collaborators, and your future self, so spend 30 seconds to make them useful. Ideally write a very small paragraph describing why the change was made if it’s not self-evident. Ideally start the commit message with a word for your bigger analysis, e.g. “pangenome” or “varcalls”. For example:
We then should share the commit with our changes back to the upstream repo. We do
git annex sync to sync the git changes. This pulls any upstream change, merges our changes, and pushes the result back 2. Again, this is only the plain-git changes, not the file contents. If you have updated or added any large files, you also need to do
git annex copy --to upstream . or
git annex sync --content --push upstream to push the new large file contents.
Part 3: getting someone else’s work
Enough waffling, let’s actually do something. We are all going to set up our respective workspaces, and then obtain the results of a recent analysis to our own workspace. I’d recommend that we do this on our laptops. Let’s all do this part interactively.
Git annex setup
So we need a more recent version of git annex than what the OS packages currently provide. This is pretty annoying, but for now let’s install the pre-compiled standalone binary. I’m assuming you’re all using Linux. If you’re not, you’ll need to find how to install it on OS X or Windows. I think OS X has packages installable the same way, Windows I have no idea. Maybe do it on taco/chimi.
Hopefully you should now see that
command -v says your git and git-annex binaries are under
~/.local/bin. If not sing out and I’ll see what went wrong.
Cloning the repo
So now we need to set up our local personal workspace. You do this by cloning the upstream repo with
Now there are a couple of config flags we should set. Ask what these do if it’s not obvious.
By default, git annex sync will commit any changes before syncing, unlike git pull which would normally error and ask you to either stash or commit them 3. The following config disables this, and makes git annex behave like normal git regarding automatic commits.
Getting some large files
Now you’re set up, but you only have the content git itself knows of. Take a look in one of the directories: script, configuration, and metadata files will be there, but any data files will be broken symlinks to files like
.git/annex/objects/GIBBERISH. This is how git annex works: just track that a file exists, and separately synchronise the contents.
So, now lets get some large data files.
This should have
rsynced many files to your personal workspace. The symlink targets of the files in the
genes-and-pangenome/output directory should now be present, and you should be able to have read-only access to these files.
Part 4: Some exercises
So hopefully you now know the basics you need to use git annex. So here are some exercises you can do. Let me know if there are any issues.
First, set up a little test dataset and dummy script.
Now, add the script and data file to git. Please fill in the
...s with the correct thing.
Now each of you have made and shared a change. I’ll now do
git annex sync on the upstream repo to merge all these changes together.
Then, you should sync with the upstream repo again and try getting someone else’s large file.
Any questions, let me know.
Some notes: this should be on /tmp for a few reasons: one, to improve performance, as /tmp is on SSD-backed storage while /ebio is on spinning iron. Secondly, it keeps the size of things on /ebio down. If you are worried about potentially losing work, don’t. /tmp is reasonably stable, and in any case you should be committing and pushing to the upstream repository (and thus protecting your work from
/tmpfailure) about daily, which is the backup frequency of
/ebioanyway. If you just finished some particularly important analysis code, but it will run for a while and so can’t commit its output yet, then just commit and sync your code and separately add your data later once it has finished. ↩︎
git annex synclike this won’t actually change the files as they are on disk in the upstream repo. This is because git refuses to update a checked-out branch when pushed to. To actually update the checked out files, one needs to do
git annex syncalso within the upstream repo. Don’t worry though, as all the history and a copy of the files does exist deep in git’s internals. ↩︎
git annex sync --commit,
--no-commitCLI options, and
git config annex.autocommitto alter this behaviour. ↩︎