What's wrong with Arch?
[comments welcome]
I gave a talk about new version-control systems the other week at our
LUG. Tridge challenged me
with ok, so what's wrong with Arch?
I think it's important to see the bad points in whatever you're
adovcating. Distributed version control is pretty new, even the
stable ones are themselves an experiment. The differences between
competing systems are not just accidents of implementation, but also
fundamentally different ideas about what software version control
means, and how it should be done.
So: what's wrong with Arch? I like it quite a lot, but I'm going to put that aside and, just for this article, look for problems.
There is an elegant underlying simplicity to Arch, but it is expressed in a complex way: there are many commands printed by tla --help and that can confuse the novice user. It's actually possible to get by with a reasonably small subset, but the tutorial does not make that very clear.
Many of the commands expose lower-level operations that might be
useful in writing scripts or fixing problems. For example,
tla sync-tree
lets you tell arch pretend I've merged these patches, without actually merging the text
,
a little cheat which can be useful in resolving some merges. Exposing atomic
operations is an admirable goal; more programs should do it. But perhaps splitting them
out into separate programs would make it easier to understand.
I think to some extent this is driven by Tom's expressed desire for
Arch to become a platform for consulting work, rather than
primarily something people can just install and use. (Perhaps
the project is moving back from that position now.)
Perhaps this will make Arch a more desirable option for larger projects which want to do more complex operations.
It seems bizarre that despite all these commands there are some glaring gaps. For example, there is no single command to revert a file to its previous state. It is suggested instead that one get the diff and apply it through patch --reverse, or that one copy it from the pristine previous version. Both of these work, certainly, and they can be scripted, but it's puzzling that they're not built in.
Another gaping hole is that there is no command to find just the
changesets that touched a particular file. Accomodating renames makes
this slightly harder than in CVS, but only very slightly, since the
file has a persistent ID. I often do svn log CPU.cpp, but on
Arch I have to do without. Darcs can do this too, with darcs
changes CPU.cpp.
On the other hand there is an excellent multi-level tla undo, which saves the removed changes so that they can be put back with tla redo if you change your mind.
In general Arch is prone to "there's more than one way to do it", which can be both good and bad. For example it handles renamed files very well, by associating a file id that remains constant for the life of the file even if it is renamed. This allows Arch to correctly merge changes across renames, something notably lacking (last I looked) from Subversion. File ids are a a fine and elegant design. However, the implementation is complex and confusing: the id can be stored in an external file, can be derived from the name, or can be stored in the file in either of two different syntaxes. You can mix these methods within a single tree, and can customize to some extent the rules on which one is used. I guess you can make a case for any particular case being useful, but the end result is complex and hard to get to understand. Choosing only one method might not have hurt too much, and might have simplified the system.
Another area where Arch can be criticized for too much choice is in handling non-versioned files. Most vc systems have to accomodate files which exist in the source directory but that should not be versioned. The classic example is *.o files. CVS handles this fairly with a list of patterns in .cvsignore. Fine.
Arch allows you to classify files using regexps into Source, Junk,
Precious and Backup. Each class is treated slightly differently, but
personally I am never sure if my .o files are more accurately
Junk or Precious. I suppose there are cases where the distinction
would be useful, but again I wonder if it would not have been simpler
to simply follow CVS in saying *.o is ignored.
Leave
it up to the user to decide which files ought to be automatically
deleted and when. Being able to customize it to have simple behaviour is not as good as just being simple.
Some people think it uses too much disk: in some configurations you will have four inodes per source file. (Source, it's id, pristine source and pristine source id.) This is pretty much constrained to people working on very large trees on very old hardware, and I don't think it is a general argument against arch. In arch's favour, it can intelligently manage hardlinked trees so that additional working copies are very cheap.
To share your source, Arch depends on having a read-only web server. This is an enormous advance over CVS, which requires a special cvs pserver. On the other hand, it is substantially harder than the current stanadrd method of mailing a patch. I asked a while ago if this could be added, and despite some confusion about how it would be done it looks like it might go in eventually. Darcs has this already, which I count as a major feature.
Florian Weimer collected a long list of design issues, which caused a lively discussion.
Arch has a bit of a fetish about long names:
one regularly has to type identifiers like
mbp@sourcefrog.net--2004/librsync--callback--0.11.
This would be less painful if it were possible to use relative
names
more often: if I type an incomplete name it could be interpreted relative to
wherever I'm standing at the moment. Unfortunately common operations like
merging between a local and remote archive require giving a full name.
(OK, it's not all that bad if you can copy&paste,
or go back through command history.
But it's a bit gross that it is necessary.)
By contrast, Darcs has barely any naming at all: branches are
filesystem directories, and identified by their directory name
(and hostname, if remote.) You can arrange directories in whatever
organization most makes sense to you, and of course give
commands like darcs pull ../upstream to move
between them.
Finally I have one issue which I think has not been mentioned before, which is a kind of meaning/mechanism mismatch in the way distributed operation works. Arch has excellent support for maintaining and merging between multiple branches. It also has good disconnected support: I can take my laptop to a desert island for a month, hack away, and come back and import all my changes, along with their history. Importantly I can also integrate those changes with whatever has happened while I've been away. So far, so wonderful.
The way I set up to do work on my laptop is to create a new branch, stored in an archive on my laptop. Suppose the main branch is mbp@foo.org--2004/foo--main--0, and on my laptop jolly I have mbp@foo.org--jolly/foo--main--0. This is pretty clean: I can commit to the branch stored on my laptop when I'm offline, and I can merge back into the main branch when I'm online.
The problem is that this mixes mechanism with meaning. I don't want changes done on my laptop to look any different from those done online. I want to only create different branches for different streams of development, nof for changes that happen to occur on disconnected machines.
Once the changes have been merged upstream you can still see what was
done, but only indirectly: all the individual commits get wrapped up
in a single change called something like merge from jolly
,
unless I manually go through and commit them.
I think this is a bit of a problem. I like the ability to zip up changes from a downstream branch when applying them as a single unit to an upstream branch. But I want to be able to do disconnected work completely orthogonal to which branch I'm working on, and without needing to create new branches.
Darcs never wraps up
commits into larger commits, as far as I
can tell. All of my commits, once merged upstream, appear as part of
the same branch because it doesn't really remember which branch a
change was originally made on. That solves the immediate problem.
But it does seem like in some projects you really would want to
remember the way patches got bundled up...
I don't know if there is a perfect solution. Maybe either of them is good enough. What do you think?
posted Mon 21 Jun 2004 in /software/vc/arch | link
Archives 2008: Apr Feb 2007: Jul May Feb Jan 2006: Dec Nov Oct Sep Aug Jul Jun Jan 2005: Sep Aug Jul Jun May Apr Mar Feb Jan 2004: Dec Nov Oct Sep Aug Jul Jun May Apr Mar Feb Jan 2003: Dec Nov Oct Sep Aug Jul Jun May
Copyright (C) 1999-2007 Martin Pool.