Version control systems

Mutagen is designed to work in tandem with version control systems (VCSs), allowing you to, for example, clone and edit a project while mirroring it to a remote system and testing it as your make edits. This helps you to avoid needing a push/pull cycle every time you make a change that you want to test.

When using Mutagen with a VCS repository, there are a few “best practices” of which you should be aware.

Ignoring VCS directories

Caution: In addition to the reasons listed below, the most important reason that VCS directories should not be synced is that they often contain scripts that are executed when certain VCS commands are run (for example, Git hooks). These hooks are generally designed to help with VCS workflows, but they’re also a potential attack vector for remote code execution if synchronizing a VCS directory with an untrusted remote.

VCS directories (for example, .git, .svn, .hg, etc.) can be synchronized by Mutagen like any other directory, but they shouldn’t be for a number of reasons. The reasons are more or less the same for each VCS, but we’ll cover the common case of a .git directory. These reasons are also not specific to Mutagen — they apply to any file synchronization tool or service.

The first reason is that the Git index data structure (which resides in .git) records inode numbers, device ids, and modification times that are specific to the filesystem on which it resides. If you move it to another system, then the next time you run git status (or any command relying on similar Git infrastructure), Git is going to have to do a full re-hash of the working tree and will then write a new copy of the index with the inode numbers, device ids, and modification times for the working tree on which it was just run. This is fine if you just want to move a Git repository once, since you’ll just incur a little extra penalty the first time you run git status, but it won’t play well with constant synchronization. The Git index can also be a bit large (up to tens of MB for very large working trees) and is rewritten every time you run certain Git commands (for example, git status), so you’d be constantly resynchronizing it.

A second reason is that Git’s object store is not homogenous or immutable. Some objects are stored as loose objects and some are stored in pack files, and it will be completely dependent on the history of a particular copy of a Git repository. They can also be pruned or relocated into pack files at any point by Git’s garbage collection. This will not play well with synchronization for a variety of reasons that are a bit too numerous to go into, but it will be more than a performance nuisance like the Git index — it may actually cause Git to complain about duplicate objects, or cause weird behavior when Git does its garbage collection. Again, this doesn’t matter when you’re just copying a Git repository one time, since in that case you’re not continuing to synchronize against it.

The third reason is that Git isn’t expecting concurrent modifications of its .git directory. In fact it has an index lock that has to be held by Git processes specifically for this reason.

There are a number of other reasons, but it basically comes down to the fact that only Git is in a position to be in control of what’s in its .git directory (at least when it comes to the index and object stores).

Recommended workflow

The recommended workflow for using Mutagen with VCS repositories is to ignore VCS directories, keeping a copy of the VCS directory on only one side of the synchronization session and synchronizing only the working tree around it. You can think of the side with the VCS directory as the “master” and the side without as the “slave”, even though the synchronization is bidirectional. You can even have multiple “slaves” with a hub-spoke model. With this model, you can invoke VCS commands on the “master” side (usually your actual workstation) and, if any changes are made to the working tree, those changes will be synchronized out to the “slaves”.

Supporting build tools

Some build systems expect their working directory to be part of a VCS working tree so that they can extract metadata about the repository (for example, the current commit) to use as part of the build. This can require that at least a subset of the VCS directory be present on the remote. Typically this requirement only includes configuration data, so a useful workaround for these cases is to create a secondary session that unidirectionally mirrors a subset of the VCS directory. The configuration for a setup like this using Git might look something like:

sync:
    code:
        alpha: "path/to/code"
        beta: "<remote>/path/to/remote/code"
        ignore:
            vcs: true
    vcs:
        alpha: "path/to/code/.git"
        beta: "<remote>/path/to/remote/code/.git"
        mode: "one-way-replica"
        ignore:
            paths:
                - "index"

In this case, we’ve excluded the Git index file from synchronization since it’s large, modified frequently, and unnecessary to synchronize. Depending on your build tools’ exact needs, you could potentially synchronize just the Git configuration file, for example:

sync:
    code:
        alpha: "path/to/code"
        beta: "<remote>/path/to/remote/code"
        ignore:
            vcs: true
    vcs:
        alpha: "path/to/code/.git"
        beta: "<remote>/path/to/remote/code/.git"
        mode: "one-way-replica"
        ignore:
            paths:
                - "*"
                - "!config"

Workflows to avoid

In addition to avoiding direct synchronization of VCS directories, there are other setups that are also probably best avoided.

For example, you could imagine a set up with a Git repository where you synchronize two copies of the repository, each with a .git directory, but exclude the .git directories from synchronization. At first this will appear to work, because both will show you the same result for git status and git diff. However, as soon as you do a git commit operation on one side, the other side will still be on the previous commit and see modified files while the committed side will show a clean working tree (even though both repositories have the exact same files in their working trees). The only way to propagate the commit to the other side would be to do a push/pull cycle, but to pull the commit you’d need to stash your working tree changes, which would revert your working directory back to the previous commit, which in the mean time would be sync’d back to the committed side, which would then show modifications, at least until you pulled down the commit on the other side, etc., etc. This is theoretically safe, but it is very clunky and likely to cause confusing behavior.