Martin Pool's blog

What's wrong with Arch?

[comments welcome]

I gave a talk about new version-control systems the other week at our LUG. Tridge challenged me with ok, so what's wrong with Arch? I think it's important to see the bad points in whatever you're adovcating. Distributed version control is pretty new, even the stable ones are themselves an experiment. The differences between competing systems are not just accidents of implementation, but also fundamentally different ideas about what software version control means, and how it should be done.

So: what's wrong with Arch? I like it quite a lot, but I'm going to put that aside and, just for this article, look for problems.

There is an elegant underlying simplicity to Arch, but it is expressed in a complex way: there are many commands printed by tla --help and that can confuse the novice user. It's actually possible to get by with a reasonably small subset, but the tutorial does not make that very clear.

Many of the commands expose lower-level operations that might be useful in writing scripts or fixing problems. For example, tla sync-tree lets you tell arch pretend I've merged these patches, without actually merging the text, a little cheat which can be useful in resolving some merges. Exposing atomic operations is an admirable goal; more programs should do it. But perhaps splitting them out into separate programs would make it easier to understand. I think to some extent this is driven by Tom's expressed desire for Arch to become a platform for consulting work, rather than primarily something people can just install and use. (Perhaps the project is moving back from that position now.)

Perhaps this will make Arch a more desirable option for larger projects which want to do more complex operations.

It seems bizarre that despite all these commands there are some glaring gaps. For example, there is no single command to revert a file to its previous state. It is suggested instead that one get the diff and apply it through patch --reverse, or that one copy it from the pristine previous version. Both of these work, certainly, and they can be scripted, but it's puzzling that they're not built in.

Another gaping hole is that there is no command to find just the changesets that touched a particular file. Accomodating renames makes this slightly harder than in CVS, but only very slightly, since the file has a persistent ID. I often do svn log CPU.cpp, but on Arch I have to do without. Darcs can do this too, with darcs changes CPU.cpp.

On the other hand there is an excellent multi-level tla undo, which saves the removed changes so that they can be put back with tla redo if you change your mind.

In general Arch is prone to "there's more than one way to do it", which can be both good and bad. For example it handles renamed files very well, by associating a file id that remains constant for the life of the file even if it is renamed. This allows Arch to correctly merge changes across renames, something notably lacking (last I looked) from Subversion. File ids are a a fine and elegant design. However, the implementation is complex and confusing: the id can be stored in an external file, can be derived from the name, or can be stored in the file in either of two different syntaxes. You can mix these methods within a single tree, and can customize to some extent the rules on which one is used. I guess you can make a case for any particular case being useful, but the end result is complex and hard to get to understand. Choosing only one method might not have hurt too much, and might have simplified the system.

Another area where Arch can be criticized for too much choice is in handling non-versioned files. Most vc systems have to accomodate files which exist in the source directory but that should not be versioned. The classic example is *.o files. CVS handles this fairly with a list of patterns in .cvsignore. Fine.

Arch allows you to classify files using regexps into Source, Junk, Precious and Backup. Each class is treated slightly differently, but personally I am never sure if my .o files are more accurately Junk or Precious. I suppose there are cases where the distinction would be useful, but again I wonder if it would not have been simpler to simply follow CVS in saying *.o is ignored. Leave it up to the user to decide which files ought to be automatically deleted and when. Being able to customize it to have simple behaviour is not as good as just being simple.

Some people think it uses too much disk: in some configurations you will have four inodes per source file. (Source, it's id, pristine source and pristine source id.) This is pretty much constrained to people working on very large trees on very old hardware, and I don't think it is a general argument against arch. In arch's favour, it can intelligently manage hardlinked trees so that additional working copies are very cheap.

To share your source, Arch depends on having a read-only web server. This is an enormous advance over CVS, which requires a special cvs pserver. On the other hand, it is substantially harder than the current stanadrd method of mailing a patch. I asked a while ago if this could be added, and despite some confusion about how it would be done it looks like it might go in eventually. Darcs has this already, which I count as a major feature.

Florian Weimer collected a long list of design issues, which caused a lively discussion.

Arch has a bit of a fetish about long names: one regularly has to type identifiers like mbp@sourcefrog.net--2004/librsync--callback--0.11. This would be less painful if it were possible to use relative names more often: if I type an incomplete name it could be interpreted relative to wherever I'm standing at the moment. Unfortunately common operations like merging between a local and remote archive require giving a full name. (OK, it's not all that bad if you can copy&paste, or go back through command history. But it's a bit gross that it is necessary.)

By contrast, Darcs has barely any naming at all: branches are filesystem directories, and identified by their directory name (and hostname, if remote.) You can arrange directories in whatever organization most makes sense to you, and of course give commands like darcs pull ../upstream to move between them.

Finally I have one issue which I think has not been mentioned before, which is a kind of meaning/mechanism mismatch in the way distributed operation works. Arch has excellent support for maintaining and merging between multiple branches. It also has good disconnected support: I can take my laptop to a desert island for a month, hack away, and come back and import all my changes, along with their history. Importantly I can also integrate those changes with whatever has happened while I've been away. So far, so wonderful.

The way I set up to do work on my laptop is to create a new branch, stored in an archive on my laptop. Suppose the main branch is mbp@foo.org--2004/foo--main--0, and on my laptop jolly I have mbp@foo.org--jolly/foo--main--0. This is pretty clean: I can commit to the branch stored on my laptop when I'm offline, and I can merge back into the main branch when I'm online.

The problem is that this mixes mechanism with meaning. I don't want changes done on my laptop to look any different from those done online. I want to only create different branches for different streams of development, nof for changes that happen to occur on disconnected machines.

Once the changes have been merged upstream you can still see what was done, but only indirectly: all the individual commits get wrapped up in a single change called something like merge from jolly, unless I manually go through and commit them.

I think this is a bit of a problem. I like the ability to zip up changes from a downstream branch when applying them as a single unit to an upstream branch. But I want to be able to do disconnected work completely orthogonal to which branch I'm working on, and without needing to create new branches.

Darcs never wraps up commits into larger commits, as far as I can tell. All of my commits, once merged upstream, appear as part of the same branch because it doesn't really remember which branch a change was originally made on. That solves the immediate problem. But it does seem like in some projects you really would want to remember the way patches got bundled up...

I don't know if there is a perfect solution. Maybe either of them is good enough. What do you think?

Archives 2008: Apr Feb 2007: Jul May Feb Jan 2006: Dec Nov Oct Sep Aug Jul Jun Jan 2005: Sep Aug Jul Jun May Apr Mar Feb Jan 2004: Dec Nov Oct Sep Aug Jul Jun May Apr Mar Feb Jan 2003: Dec Nov Oct Sep Aug Jul Jun May