Martin Pool's blog

Integrals and derivatives

The fundamental difference between Subversion and Arch/Darcs is that Subversion keeps snapshots of a tree, and the distributed systems collect changesets.

You can think of a tree at any point in time as being the sum of changesets to date. It is a bit like the sum or integral of those changesets. Conversely, the changesets are just the differences between the different states of the trees, like a derivative in calculus.

changesets = d/dt(snapshots)

snapshots = integral(changesets)

Obviously they form a kind of duality, we can transform from one to the other. We can, in fact, choose which one is most useful for solving a particular problem. In engineering we might choose to work in either the time or the frequency domain — and sometimes switching will make a previously intractable problem seem easy. A good notation can be more than half the solution.

The user model presented by Subversion is primarily one of snapshots of trees: I make some changes to r42, commit, and now I have r43. Darcs and Arch seem to work primarily in the changeset domain: I make some changes, commit, and now I have a new changeset in my history.

I'm speaking here only about the model presented to the user: internally, it is more complex. Subversion stores deltas to save disk space; Arch caches snapshots at various points to speed retrieval. I don't think my description here would be new to any of the people working on these systems.

Snapshots might be easier for novice users to understand because they correspond directly to what is in the working directory. It is easy to point at a directory and say it is equal to r23. It's also similar to the model used by CVS.

Working in terms of changesets, or at least having the option to do so allows more powerful operation.

For example, consider repeated merges among a related set of trees. Arch and Darcs handle this well, because they can easily remember which changesets have already come across. Subversion and CVS tend to handle it poorly, because merely tracking which version from the other tree has merged doesn't really capture the right information.

I'm not sure what the consequences of this are. I think it may mean Subversion is going to be a bit limited until it gets a more developed and natural notion of changesets.

Tom Lord said this in Diagnosing Subversion:

Suppose you have the same intuition that Walter expressed a while back, which I'll paraphrase as: "The first and most fundamental task of a revision control system is to take snapshots of working directories."

If you don't believe that that's a seductive (even though wrong) intuition, go back and look at how I replied. It took many, quite abstract paragraphs. What revision control is really about (archival, access, and manipulation of changesets) is subtle and _non_-intuitive. (Anecodtally, a few years before arch, I made an earlier attempt at revision control based on, guess what: snapshotting.) What's worse is that a set of working tree snapshots combined with a little meta-data is a kind of dual space to the kinds of records rev ctl is really about (they're logically interconvertable representations). Anything you say to a snapshotting fan about what you want to do with a changeset-librarian orientation they can reply to with "Yeah, but we could do that, too." So it's not even that the snapshot intuition is completely wrong: it's just putting an emphasis on the wrong details.

Now the transactional filesystem DB takes snapshots handily. It's ideal for that. So if you have the snapshot-intuition, and the transactional fs hammer -- you're apt to leap to a wrong conclusion: you've solved the problem!

Greg Hudson replies in Undiagnosing Subversion. I think the two mails together go a long way towards defining the different ideas of VC, though they require a bit of background.

[A] transactional FS, with a few annotations, is exactly the right hammer for version control as conceived of by Subversion ("taking snapshots of trees"). As much as one might fervently believe that this is the wrong conception of version control, it's a workable and very intuitive conception...

This is a subtle and important point, one which divides the centralized or tree-oriented version control systems (Perforce, Clearcase, CVS, Subversion) from the changeset-oriented ones (Bitkeeper, Arch). A full treatment of this issue could fill multiple journal articles, but one should recognize that it is an issue with two sides:

* Changeset-oriented version control is more powerful, but it is power which is largely unnecessary in all but the most chaotic of development projects.

* Changeset-oriented version control is harder to learn. In many environments, a shallow learning curve is the most important feature of a version control system.

* Changeset-oriented version control is hard to get right. Perhaps the best support for this statement can be found in a March 2003 note from Larry McVoy to the linux-kernel list:

http://www.ussg.iu.edu/hypermail/linux/kernel/0303.1/0130.html

* Changeset-oriented version control can be built on top of a tree-oriented foundation, although it will have all the disadvantages listed above. As Tom himself notes, tree-oriented storage is a dual to changeset-oriented storage. svk (http://svk.elixus.org/) serves as a working prototype of changeset-oriented version control implemented on top of Subversion.

I agree with Greg that changesets are harder to understand, much as calculus requires a certain a-ha moment to ascend beyond algebra. I think that suggests that all version control systems, if they want to be easy to learn, should let you do basic update-commit operations while pretending you're only working on snapshots.

I don't think you need a very complex project to get value out of changesets, though the more complex your project gets, the more you will want them. As soon as the project grows a second branch, the developers are going to want to merge between branches, and to maintain them going on. Changesets inherently handle this better than snapshots.

If the branch is stubby (one bugfix to a release), or infrequently merged, or there are few developers, then you can probably cope. People have coped in CVS for years, though with some swearing and sweating and some bugs that accidentally came in through incorrect merges. The whole point of changing is to get something which will make development more efficient and pleasant. I think the point where changesets can start to pay off is pretty low.

Archives 2008: Apr Feb 2007: Jul May Feb Jan 2006: Dec Nov Oct Sep Aug Jul Jun Jan 2005: Sep Aug Jul Jun May Apr Mar Feb Jan 2004: Dec Nov Oct Sep Aug Jul Jun May Apr Mar Feb Jan 2003: Dec Nov Oct Sep Aug Jul Jun May