Martin Pool's blog

Bazaar gedit integration

Javier Derderian is working on , the GNOME standard text editor, so that you can very easily record changes, push them to a server, and so on. Bazaar's model that a branch is just a directory with extra metadata fits pretty well here. He just made another exciting release (or should that be "excited"? :-)

Bundle Buggy, for list-based code review

My favourite thing about Bazaar is actually a tool built on top of it, Bundle Buggy, a tool for managing contributions and reviews.

When you send a patch or a Bazaar bundle to the developer mailing list, Bundle Buggy gets a copy of it, sees the proposed merge, creates a web page about it, linked back to the mail thread and to a bug number if there is one.

By catching all patches to the list it makes sure that things are not accidentally dropped. The important thing though is the getting-things-done workflow and the integration with Bazaar. Seeing a long list of mails can make it hard to decide where to begin, which is an invitation to procrastinate. BB tries to sort out patches which are ready to merge, which have been reviewed by others, which you ought to review yourself, which need other work and other states. It knows about our quasi-voting process, where we try to get at least one core contributor to review all patches, and you can do the reviews entirely over mail.

BundleBuggy is (just about?) all Aaron Bentley's work, and credits Jeremy Kerr's Patchwork tool for inspiration.

We could do a lot more here, I'm sure: a nicer review interface than just showing the diff, perhaps automatically firing off a merge if the patch is approved, telling the bug and its subscribers if we believe a fix has been merged.

Version control that doesn't make your eyes bleed

Nice positive comments on the benefits of Bazaar's simple and usable interface from the Java-GNOME maintainer.

Loggerhead: new bzr branch browser

Robey wrote a new and better web bzr branch browser, Loggerhead.

Launchpad as a directory for bzr

Launchpad is the Ubuntu bug tracker, and more than that. It handles bugs, translations, support requests and so on, and can relate them to the Ubuntu packages and the original upstream source. jamesh writes of his recent work to use Launchpad as a directory for Bazaar branches.

Also, the Jokosher audio editor is switching to the Launchpad bug tracker, and they have some interesting comments.

Introduction to dvcs and Bazaar for students

Presentation by Matthieu Moy for an introduction to version control for students, using Bazaar.

Commit from bzr into Subversion

Jelmer Vernoiij has been hacking on Bazaar-Subversion integration and produced some sexy results.

Basically what this is getting to is being able to make a checkout from Subversion using Bazaar, use all Bazaar's distributed features (make branches, commit, diff, etc, all offline) and then eventually merge back into Subversion. And do it all over again, remembering what was already merged, and without requiring any particular setup on the svn server.

bazaar-ng 0.0.7; into Mandriva Cooker; plans for next release

We released bzr 0.0.7, which has several nice improvements, particularly to merge handling.

Michael Sherer put builds of bazaar-ng into Mandriva Cooker, their unstable development system.

Aaron started on a tool to display bazaar-ng branch histories as graphs, using dot. Still early days but could be fun.

A few more people have come across from working on the Arch-based Bazaar code. Robert Collins has a great intuitive grasp of OOP patterns and test-driven development and I'm very happy to have his help. James Blackwell, who was previously the GNU Arch release manager, is working on the wiki.

In our next release we plan to change to weave-based storage, which will be much more compact and also allow smarter merges. We've had code for this for a while but not turned it on because of working on the UI and framework, but the time has come.

bzr in brazil

I haven't posted here for a long time, which is bad. It's funny how many people got through phases of just not wanting or getting around to blogging.

bzr (aka bazaar-ng) is coming along pretty well. We just passed 1000 commits onto the mainline since it started self-hosting in March of this year, and we're regularly pulling in changes from contributors branches using bzr.

The emphasis has changed a lot since I started. At the time it was meant to be a research prototype to explore ideas to move back into systems based on GNU Arch. Since then, bitkeeper has exploded, and many people have started picking up the pieces, and the original schedule looks rather leisurely in retrospect.

One important consequence is that we're trying to get bzr to a finished state in its own right.

The core versioning/branching functionality is there and working: branch, push, pull, merge, status, diff, add, commit etc. Some are rather nicely polished; some less so.

I'm trying to define some specific goals for the next few months, and focus on getting them done. Near the top of the list are more compact storage and smarter mesh merging.

One outcome of the research side is modules to do the fundamental vc algorithms of 3-way merge and weave in pure Python. A few people have suggested that merge3 in particular might usefully be contributed back into the Python standard library; after a bit more maturation I'll look into that. These have practical use in allowing it to be installed and run without any extra dependencies like GNU diff; this is moderately important on windows.

I've been in Brazil for the last couple of weeks with other folks from Canonical.com, mostly doing architecture-ish work on Launchpad, our application set for building and supporting a free software distribution. Not so much bzr hacking for now.

canonical-praia-vemilhia

Perhaps the most immediately interesting of these is Rosetta, a web-based software translation system that already used by a thousand translators. (If you consider the pool of people who are fluent in at least two languages and work on translating free numbers then a thousand is a pretty large number.)

There are some other good concepts in there — a bug tracker that really deeply understands the way code (and bugs) spread across different free software distributions and packages. They're going to need some UI love before it's easy to understand what's happening inside, but I think that will happen.

The other very cool thing is getting to spend time with Mark and other bright people here.

Bugzilla and Bazaar-NG

I was talking to Luis Villa at LCA and had an idea of an interesting application for the kind of distributed version control enabled by bzr.

Essentially every Bugzilla installation is a fork of the project: people install it once, tend to customize it to fit their own use, and only intermittently install updates from the maintainers. In a previous job, I was assigned the task of removing all references to "bugs" from the installation, on the grounds that our product only had "issues", not "bugs". I understand Bugzilla is now getting to be more parameterized but there is still some customization and in any case the customization needs to be merged with the new upstream defaults.

If bugzilla was distributed with a small amount of bzr metadata it might be much easier for people to either integrate their changes into new releases, or to submit improvements for upstream integration.

I stepped in to do a talk on bzr when another speaker dropped out. It was received pretty well, and sparked a lot of interesting conversations.

Patch queues in bzr

The response to bazaar-ng so far has been very positive. Many people are contributing bugs to fix platform problems or to add features they urgently want. It's a very pleasant experience.

One consequence is that, as for many other projects, there are patches submitted that I haven't tested and merged yet. One thing to do would be to put them in an issue tracking system of some kind, but not every project wants that. So I was thinking about what bzr can do to help within this source management problem.

Very shortly we plan to allow submission of changesets by email, as regular diffs with extra annotations. When those come in, they can be stored in a directory on my laptop or on the webserver. Because the changesets have universally unique IDs and descriptive metadata we can then ask interesting questions about the patches:

This queue could be filled by a robot that looks for patches on a mailing list, or in the public trees of past contributors. Although I call it a queue it's not necessarily a FIFO queue; the integrator (me) can pull things out in whatever order. As well as the branches that are normally captured by scm systems, bazaar-ng also helps you manage a bag of patches that don't necessarily fit together yet.

In general: human integrators have a valuable role but a tool can help make them more productive.

You can imagine extending this towards something like Rusty's trivial patch monkey, or perhaps towards filtering out security critical patches as candidates for merging into a stable series.

No more bitkeeper for kernel developers

KernelTrap, Slashdot.

The KernelTrap story seems to pretty much just print a Bitmover press release. I suppose the other side(s) of the story will come out soon.

Much of what I would say about it is already said by other people in the kerneltrap or slashdot comments. In particular This comment is spot on:

If BitMover stated up front that all licenses would be withdrawn from all Linux developers in the event that any single Linux developer tried to reverse engineer BitKeeper, then Linus was a total idiot for agreeing to that license.

If BitMover did not state those conditions up front, then they are being evil and manipulative in yanking licenses from unrelated parties in a fit of pique over what one person is doing in his own time.

To head off any FUD at the pass: there is copious documentation of where the bazaar-ng design ideas came from. Almost all are from places other than bitkeeper; where bitkeeper has had an influence it is through things people have said about it e.g. in the kernel BK usage docs.

I have heard from some kernel maintainers that even before this they weren't using bk except to sync from Linus.

I'd really like bazaar-ng to be a great kernel development system, and I think the performance will be good, but it's going to take a while to get enough features in place for even early adopters, maybe 2-3 months.

I suspect Larry understands that if he cut them off in 2006, no one would care so much, because the open systems would be good enough. Perhaps that's why it's been done now.

It's a great demonstration of the risks of putting critical data in a system whose licence can be revoked at any time.

I was rather amused by Larry's notion that it's unacceptable for a platform vendor to compete with their application vendors. Microsoft or Apple would never do that, oh no.

The fact is that any successful product is going to attract competition. It's a predictable outcome of a free market, and not any kind of moral failing of the open source community. There is no point whining about it.

If you've been left out in the cold by Bitkeeper I encourage you to check out bazaar-ng.

file renames in bazaar-ng

Renames are completely(?) working in bzr now, after the last couple of days work. There are two commands: move moves one or more files into a different directory, and rename renames a file (and optionally also moves it.)

(I think this is one case where the unix unification of both these things into mv really causes much more trouble than it's worth.)

There is also a renames command that shows which files have moved relative to the last revision.

Demo:

demo% ls -a
./  ../
demo% bzr init
demo% echo hello > one
demo% bzr rename one two
bzr: error: can't rename: old name 'one' is not versioned
demo% bzr add one
demo% bzr rename oen two
bzr: error: can't rename: old working file 'oen' does not exist
demo% bzr rename one two
one => two
demo% bzr status
A       two
demo% bzr commit -m 'add file'
demo% bzr rename two three
two => three
demo% bzr status
R       two => three
demo% bzr renames
two => three
demo% bzr commit -m 'renamed'
demo% mkdir subdir subdir/d2
demo% bzr add subdir subdir/d2
demo% bzr status
A       subdir/
A       subdir/d2/
demo% bzr move three subdir/d2/
three => subdir/d2/three
demo% cd subdir/d2
d2% bzr rename three ../four
subdir/d2/three => subdir/four
d2% bzr renames
three => subdir/four
d2% bzr status
A       subdir/
A       subdir/d2/
R       three => subdir/four
d2% cd ../../
demo% bzr status
A       subdir/
A       subdir/d2/
R       three => subdir/four

The code is not published yet; it'll be up later this week.

bzr tutorial in Japanese

Tez Kamihara translated the bzr tutorial/mockup into Japanese

Tasteful and attractive composites

Havoc wrote something that describes very well how I am trying to run bazaar-ng:

One of the more annoying properties of the Internet is that no matter what you post to your blog (or mailing list, or chat) people add comments like: "that isn't new, the Amiga had it in 1987" or "that isn't new, we did that with punch cards in 1953" or "Longhorn has that already" or whatever. These comments are especially popular in places like osnews and slashdot.

I usually add a disclaimer to my posts specifically to head this off, but it never helps. (Shocking!)

Why post ideas? It's not to get credit for originality. It's because in this specific context, at this specific time, we should discuss and possibly implement those ideas.

Side point: there's much to be gained by simply doing something better than it's been done in the past. Apple's new Pages app, maps.google.com, there are countless examples. They aren't really "new ideas" per se, they are well-done and tasteful composites of many old ideas. And they were finished, and made available. Not a trivial thing in the modern software industry.

Some people say there is nothing really new in bzr but they like it anyhow; to me that counts as success.

Rocking the Bazaar

bzr is really coming along very well.

I have stepped up to give a seminar about it at linux.conf.au in place of a speaker who had to pull out. I didn't originally submit an abstract because I thought it wouldn't be sufficiently ready to talk about, but as it turns out it is.

If I can get remote operation working before the conference I'd like to put these commands in the abstract in the online program:

  bzr get http://bazaar-ng.org/bzr/sandbox
  cd sandbox
  vi hello.txt
  bzr send

One really important use case for a version control system is that of being told how to do basic operations by someone else. This is the first encounter most people have: wanting to try something out in the development head of some project kept in CVS. It's very important that this be as simple as possible, and that it work reliably. Doubly so because the person will typically not get their instructions from the program's manual, but rather second hand from whatever person or web page is helping them get started on that particular project.

Development has gone well in the last week or so: there is now a rudimentary mv command, a nice ignored that shows what is ignored and why.

Daylight saving finished in Australia, which correctly tested that bzr records dates as GMT + timezone, somewhat like email:

....
----------------------------------------
revno: 100
committer: mbp@sourcefrog.net
timestamp: Sun 2005-03-27 00:41:53 +1100
message:
  - add test case for ignore files
----------------------------------------
revno: 101
committer: mbp@sourcefrog.net
timestamp: Sun 2005-03-27 19:14:45 +1000
message:
  change default ignore list

jdub loves the info command, saying he'd run it all the time to see the numbers ticking over.

mbp@hope% bzr info
branch format: Bazaar-NG branch, format 0.0.4
 
in the working tree:
       101 unchanged
         0 modified
         0 added
         0 removed
         0 renamed
         2 unknown
       262 ignored
         4 versioned subdirectories
 
branch history:
       161 revisions
         2 committers
        23 days old
   first revision: Wed 2005-03-09 04:08:15 +0000
  latest revision: Fri 2005-04-01 18:27:01 +1000
 
text store:
       330 file texts
      2392 kB
 
revision store:
       161 revisions
        55 kB
 
inventory store:
       161 inventories
      3219 kB

Progress on bazaar-ng

(I should write more, but I've been busy.)

Bazaar-NG has been announced, and gained a cautiously positive reaction from the community, including in a slashdot thread about Bitkeeper. There have been some good suggestions but I think we're mostly on the right track.

The most contentious points seem to be:

  1. Choice of Python as a development language, and specifically whether the performance will be OK.
  2. Fused branches and directories: two specific points, one is that separate repositories are good for backups, and that shared branches are good for many development methods.

I've been using it every day for managing its own development, and it's coming along well, though there is still much to do.

I have a couple of busy weeks in April with linux.conf.au and Ubuntu Down Under. Before that starts I'd like to put in simple branch and merge commands and fix some other small things.

And there is a first snapshot release!

The next-generation Bazaar

We have a web site, bazaar-ng.org, for Canonical's prototype version-control tool. There are lots of docs, though I do have to warn that everything is still subject to change. There is not much point at the moment in trying out the code (though you are welcome to), but comments on the documents would be warmly welcomed.

My current code is in Python, and is written from scratch but takes many ideas from many other systems. So far it can do these commands to some extent: add, remove, commit, status, diff, log, help, export. I don't know if there will end up being any truly novel ideas, but perhaps the combination and presentation will appeal.

An amalgam of distributed version control

I've been hanging out with some of the GNU Arch hackers. I had started out trying to describe what GNU Arch ought to take from other systems — primarily simplicity in both model and interface. There are some good ideas in Arch, but also a lot of what is technically known as "crack". [*This somewhat offensive term seems to have acquired the meaning in GNOME of "unnecessary user-hostile complexity".] I have ended up thinking that Arch is not the right foundation for this.

A huge amount of good thinking has gone into various distributed systems over the last few years. I don't personally find any of them totally satisfying, but I think trying to pick the best ideas from each into a new codebase may work well.

There may be some code in a while.

The high points:

More later. I have a lot more documentation and the start of some code. I have shown a mockup to both Arch developers and Arch haters and both sets liked it.

Havoc has posted a list of desired features. I think my design can address most or all of them, some in quite entertaining ways. Some of the more workflow-related requirements I think I would keep out of the core tool.

Other comments from Tom and Colin.

An amalgam of distributed version control

I've been hanging out with some of the GNU Arch hackers. I had started out trying to describe what GNU Arch ought to take from other systems — primarily simplicity in both model and interface. There are some good ideas in Arch, but also a lot of what is technically known as "crack". [*This somewhat offensive term seems to have acquired the meaning in GNOME of "unnecessary user-hostile complexity".] I have ended up thinking that Arch is not the right foundation for this.

A huge amount of good thinking has gone into various distributed systems over the last few years. I don't personally find any of them totally satisfying, but I think trying to pick the best ideas from each into a new codebase may work well.

There may be some code in a while.

The high points:

More later. I have a lot more documentation and the start of some code. I have shown a mockup to both Arch developers and Arch haters and both sets liked it.

Havoc has posted a list of desired features. I think my design can address most or all of them, some in quite entertaining ways. Some of the more workflow-related requirements I think I would keep out of the core tool.

Other comments from Tom and Colin.

Tom Lord interview, and related things

Interview with Tom Lord, designer of Arch. Slashdot, LWN coverage.

To be brief and a bit brutal: Arch is very clever in many ways. However, Tom is way too aggressive as an advocate. Arch might scale up to large projects, but it doesn't scale down very well to beginning users on small projects. It's complex to get started, and I'm worried by signs that work is going into adding more complex features rather than reducing it. Although you can make it very fast, that's not the default.

Earlier versions were very much bound into projects being run the way Tom wanted them: wierd file conventions, only committing from clean trees, and so on. It's fine to suggest them, but trying to force them on people at the same time as they learn a new system is not a good idea. Tool designers need to know where they want to force change, and where they want comfortable familiarity.

I hope these issues are fixed. Arch is probably the most promising large-project version control system at the moment, but it really needs to get over the usability hump to realize its full potential. I feel they have about a 75% chance of getting there in the next one or two years.

One remarkable thing about the LWN page is that Larry McVoy confirms that BitMover refused to sell a BitKeeper licence to the employer of a person involved with free version control products. It's his right to refuse to sell, or to revoke a revocable licence, but this is a risk that needs to be considered.

Fixing log messages in Subversion

It turns out there is a way to fix commit messages in Subversion. I just didn't know it. James very kindly points out:

In your weblog post you talked about committing a revision to a Subversion repository with an incomplete log message. You can actually correct this in an SVN repo without too much trouble.

The log message is stored as a revision property, so you can print it with the following command:

svn propget svn:log --revprop -r N

It is possible to change the property with svn propset provided that a "pre-revprop-change" hook exists for the repository (the default hook is to disallow all changes, because they aren't versioned and you might not want to allow users to change them anyway). So if the hook allows it, and you have the full log message in a file, you could run the following:

svn propset svn:log --revprop -r N -F log-message-file

Despite this, I still think darcs unrecord is far more friendly, and it handles other cases than just fixing the message. I don't think I would bother doing all that (or finding out how to do it) with Subversion if I just got the message a bit wrong, but I use unrecord moderately often.

I guess there is a fundamental difference here between centralized and distributed VC: if there is just One Tree, you have to be more careful. If everyone is allowed to make a branch on their workstation then you can give them more freedom to make mistakes, and just refuse to take the changes back.

Forgiveness in version control

I just typed half a commit message in Subversion and then accidentally committed. I didn't explain the change in as much detail as I really wanted to.

(My natural reaction would be to hit C-c after exiting vi in the hope of catching it before the commit really took place. But from past bitter experience this is likely to lock up Subversion's database, so I have tried to break the habit.)

I'm sure any programmers reading must have done this too, more than once. It's not the end of the world. In this case I'll just do without it. The other pattern people sometimes use is to make a following empty commit which has the rest of the message. It's OK, but it's a bit kludgy.

The darcs version control system has a very interesting fix for this: darcs unrecord.

Most version-control systems are strictly write-forward: once an action has been done, it cannot be undone. This is supposed to give people assurance that their work cannot be lost, but it has negative side-effects, such as being unable to fix incorrect commits. Getting the log wrong is only a relatively minor problem: committing something that should remain confidential, or mixing together two things that should be separate commits are worse. The write-forward model does not have the desirable UI property of forgiveness. Of course you can kludge it in many systems, but that's not forgiving and it tends to be dangerous.

To quote from the GNOME Human Interface Guidelines:

We all make mistakes. Whether we're exploring and learning how to use the system, or we're experts who just hit the wrong key, we are only human. Your application should therefore allow users to quickly undo the results of their actions.

I think darcs gets a reasonably good balance between allowing people to undo mistakes, and protecting them from accidentally losing work.

In darcs, you can remove commits from history — with some limitations. If the change has already been merged into a different tree, or is included in a tag, or is depended upon by something else, then you can't get rid of it without backing that out too. This is as it should be: by the time you ship it, or give the change to someone else, it's too late to discover the mistake. But if it's only in a single distributed repository, it's very friendly to allow it to be backed out. The basic simplifying insight is: if you want to not lose your work, make backups.

Darcs is currently my favourite tool for smallish projects. It's simple and powerful.

[More on darcs]

Liberal Media Bias

John Sequeira says that I write a lot about version control (which is true) and that I have a bias towards distributed version control, which is also true.

  1. Distributed development is the whole point of open source. People should be able to contribute without needing prior permission; and to work with you even if they're on another continent.
  2. I live in Australia, also known as the arse end of the earth. Roundtrips to the US are slow. I don't want to spend any more time waiting for CVS to open an SSH tunnel if I can possibly avoid it.
  3. Version control, properly conceived, ought to offer distribution at no added cost. Andrew Morton's quilt does it in a couple of hundred lines of shell. Given you can have it for free, why not? It might be useful. The challenge is to make it sufficiently simple and reliable, but I think some new systems come close.
  4. Cheap branches can be useful; distributed systems where they need never leave your workstation are a good way to get it.

I really should try Monotone (again) and Svk.

To be fair, here's what I like about Subversion:

  1. It is extremely easy to learn if you're used to CVS. If you work with people who see VC as a cost, rather than a benefit, then it may be the easiest switch.
  2. It fixes the most annoying parts of CVS: you can rename files, version tags, etc.
  3. There is a good book, a selection of GUIs, and it runs on many platforms.

On the downside, it is a bit prone to crashing and you get something only incrementally better than CVS.

Backlinks

OpenACS thread about changing VC systems. To answer some points they raise:

Arch vs tla

Google asked me: what is the difference between arch and tla?

The short answer is that they are two names for the same thing. The project was originally conceived of as arch: I suppose the idea of an arch connotes elegance, and it has a r-c sound to suggest revision control.

However arch is already in use as a command on Unix: it prints the machine architecture (e.g. i686). It's kind of a waste of a word, but nevertheless it exists and is depended upon by some scripts. So the program can't be actually called arch. For a while it was called larch, and there were forks with different command names. Some people say that Arch is the design and tla is the implementation.

Now it has settled on tla, which is either Tom Lord's Arch, three letter acronym, or doesn't stand for anything at all.

The short story is that Arch and tla are interchangeable when talking to people, but for computers you need to spell it tla.

Loss of a server in Arch and Darcs

I wrote a while ago on some things I think are less than perfect in Arch.

I think the one that bugs me most is that branches are bound to a particular location, rather than being purely distributed. (I use the word branch here for the comfort of a general audience; in Arch they would strictly be called versions which I think is a bit misleading.) I want to try to explain this a bit more.

The machine hosting sourcefrog.net crashed because of hardware problems the other week and was offline for a couple of days. I wanted to work on two projects which are hosted there, librsync and distcc. Because I am a version-control gourmet, distcc is in Arch and librsync is stored in Darcs.

Because sourcefrog is quite close to where I live, I normally work directly against its repository from Arch. I would have the choice of making downstream repositories on each machine I work on, but that would introduce a lot of "noise" merges every time I moved code from those machines onto sourcefrog. Since there's only one distcc branch, and I'm the only person who commits, I'd rather just work directly to that branch.

A consequence of this is that when sourcefrog is down, I can't commit or update at all. I am stuck.

Or almost stuck. In fact, I can cheat: make an archive on my laptop and a new branch in that archive, and commit from my working copy onto that branch. When the main machine is back up, I can merge from my branch back to sourcefrog.

This is pretty neat. I don't think I could easily do it in either Subversion or CVS. With those systems, I'd probably keep hacking and just make one big commit at the end. (Which is not really such a bad thing, but not ideal.) At best, I could keep snapshots of the tree at different and commit each one by hand as a separate patch.

On the other hand, what I did is not documented, and I'm not sure it's entirely kosher. It does require a certain amount of understanding Arch internals and fiddling to get the merge to work back. It is a testament to the elegance and flexibility of the Arch design that it's possible to use it in this unintended way.

By contrast in Darcs having your server go down makes no difference at all, except that you can't publish to that particular server. Because everything is always committed locally and then pushed up the natural way of working means there's little dependency on anything but the local machine. All of this doesn't leave any major permanent record, because revision names don't depend on the machine to which they were originally committed. With the server offline you can make changes, record them, roll them back, and make branches. If the machine's going to be down for a while you can start committing to a different server, or email your changesets to someone else.

You can do this in Arch but it's more natural in Darcs.

I think at the moment I would compare them like this:

Arch has a lot of structure and metadata to let you see the history of every changeset and to organize large trees. That might be good for very large projects. It's good for small projects, though the sheer complexity can be a disincentive.

Darcs is much simpler. I think you can show someone all they need to know in ten minutes. It's naturally very distributed. I rarely or never need to wait for network traffic.

Integrals and derivatives

The fundamental difference between Subversion and Arch/Darcs is that Subversion keeps snapshots of a tree, and the distributed systems collect changesets.

You can think of a tree at any point in time as being the sum of changesets to date. It is a bit like the sum or integral of those changesets. Conversely, the changesets are just the differences between the different states of the trees, like a derivative in calculus.

changesets = d/dt(snapshots)

snapshots = integral(changesets)

Obviously they form a kind of duality, we can transform from one to the other. We can, in fact, choose which one is most useful for solving a particular problem. In engineering we might choose to work in either the time or the frequency domain — and sometimes switching will make a previously intractable problem seem easy. A good notation can be more than half the solution.

The user model presented by Subversion is primarily one of snapshots of trees: I make some changes to r42, commit, and now I have r43. Darcs and Arch seem to work primarily in the changeset domain: I make some changes, commit, and now I have a new changeset in my history.

I'm speaking here only about the model presented to the user: internally, it is more complex. Subversion stores deltas to save disk space; Arch caches snapshots at various points to speed retrieval. I don't think my description here would be new to any of the people working on these systems.

Snapshots might be easier for novice users to understand because they correspond directly to what is in the working directory. It is easy to point at a directory and say it is equal to r23. It's also similar to the model used by CVS.

Working in terms of changesets, or at least having the option to do so allows more powerful operation.

For example, consider repeated merges among a related set of trees. Arch and Darcs handle this well, because they can easily remember which changesets have already come across. Subversion and CVS tend to handle it poorly, because merely tracking which version from the other tree has merged doesn't really capture the right information.

I'm not sure what the consequences of this are. I think it may mean Subversion is going to be a bit limited until it gets a more developed and natural notion of changesets.

Tom Lord said this in Diagnosing Subversion:

Suppose you have the same intuition that Walter expressed a while back, which I'll paraphrase as: "The first and most fundamental task of a revision control system is to take snapshots of working directories."

If you don't believe that that's a seductive (even though wrong) intuition, go back and look at how I replied. It took many, quite abstract paragraphs. What revision control is really about (archival, access, and manipulation of changesets) is subtle and _non_-intuitive. (Anecodtally, a few years before arch, I made an earlier attempt at revision control based on, guess what: snapshotting.) What's worse is that a set of working tree snapshots combined with a little meta-data is a kind of dual space to the kinds of records rev ctl is really about (they're logically interconvertable representations). Anything you say to a snapshotting fan about what you want to do with a changeset-librarian orientation they can reply to with "Yeah, but we could do that, too." So it's not even that the snapshot intuition is completely wrong: it's just putting an emphasis on the wrong details.

Now the transactional filesystem DB takes snapshots handily. It's ideal for that. So if you have the snapshot-intuition, and the transactional fs hammer -- you're apt to leap to a wrong conclusion: you've solved the problem!

Greg Hudson replies in Undiagnosing Subversion. I think the two mails together go a long way towards defining the different ideas of VC, though they require a bit of background.

[A] transactional FS, with a few annotations, is exactly the right hammer for version control as conceived of by Subversion ("taking snapshots of trees"). As much as one might fervently believe that this is the wrong conception of version control, it's a workable and very intuitive conception...

This is a subtle and important point, one which divides the centralized or tree-oriented version control systems (Perforce, Clearcase, CVS, Subversion) from the changeset-oriented ones (Bitkeeper, Arch). A full treatment of this issue could fill multiple journal articles, but one should recognize that it is an issue with two sides:

* Changeset-oriented version control is more powerful, but it is power which is largely unnecessary in all but the most chaotic of development projects.

* Changeset-oriented version control is harder to learn. In many environments, a shallow learning curve is the most important feature of a version control system.

* Changeset-oriented version control is hard to get right. Perhaps the best support for this statement can be found in a March 2003 note from Larry McVoy to the linux-kernel list:

http://www.ussg.iu.edu/hypermail/linux/kernel/0303.1/0130.html

* Changeset-oriented version control can be built on top of a tree-oriented foundation, although it will have all the disadvantages listed above. As Tom himself notes, tree-oriented storage is a dual to changeset-oriented storage. svk (http://svk.elixus.org/) serves as a working prototype of changeset-oriented version control implemented on top of Subversion.

I agree with Greg that changesets are harder to understand, much as calculus requires a certain a-ha moment to ascend beyond algebra. I think that suggests that all version control systems, if they want to be easy to learn, should let you do basic update-commit operations while pretending you're only working on snapshots.

I don't think you need a very complex project to get value out of changesets, though the more complex your project gets, the more you will want them. As soon as the project grows a second branch, the developers are going to want to merge between branches, and to maintain them going on. Changesets inherently handle this better than snapshots.

If the branch is stubby (one bugfix to a release), or infrequently merged, or there are few developers, then you can probably cope. People have coped in CVS for years, though with some swearing and sweating and some bugs that accidentally came in through incorrect merges. The whole point of changing is to get something which will make development more efficient and pleasant. I think the point where changesets can start to pay off is pretty low.

Google's opinion of Subversion

A lot of people seem to end up at my version control scribblings when asking Google for help with Subversion database crashes.

I guess I should put some content here, because mentioning this will probably increase the effect. So...

The immediate fix is to run svadmin recover /var/svn/fooproject, sigh, and get on with your life. It's gross that when Subversion is used as directed, people often need to run a operation which warns of the possibility of serious corruption of the repository.

The good news (well, less-bad news) is that although recovery is annoying, it rarely or never loses data. It just interrupts your work — possibly for a long time if you need a sysadmin's intervention.

A common cause of database crashes for svn+ssh is the permissions getting set incorrectly. (I don't understand why this isn't fixed in subversion; it shouldn't be all that hard.) If the permissions on the repository are screwed up then you might need to reset them. Typically on Unix the repository should have group set appropriately for the developers of the project, and should be g+w plus setuid directories.

The Subversion FAQ suggests that you avoid permission problems by not using Subversion over SSH, but using the Apache module or svn-over-tcp instead. I much prefer the model of using only SSH for access, but if you're on a closed network and permissions are biting hard you might want to change.

In the absence of proper handling of permissions in Subversion, you need to make sure all your users have a umask of something like umask 002 on the Subversion server. You can set this either in somewhere like /etc/profile or in a wrapper around svnserve.

If you're suffering Subversion database crashes, you're not alone! I've had it need recovery quite a number of times. Some people never get it though — I suppose it depends on how you share your repo, how often you interrupt operations and similar things.

It's good to make regular backups of your repository, just in case. We copy it several times a day, and make a complete dump of all repos every night. We also use the glorious rdiff-backup to keep snapshots of working directories going back three months.

The Subversion developers seem to acknowledge BDB crashes as a problem and are moving to a better storage system. Garrett tells me (30 seconds after I posted!)

The new non-berkeley db filesystem backend will actually be available in version 1.1 of Subversion (I use it now, and it's great), which should have it's first beta release any time now. Expect an actual release supporting the non-bdb back end in a few weeks, depending on how the beta goes.

In my personal experience Arch and Darcs don't get screwed up as often as Subversion does. (I don't pretend that my anyone else would necessarily have the same results.) I don't think the effect is so strong as to make you switch, but if you're thinking about moving from CVS to something else, you might keep that in mind.

CVS-style development with Darcs

Thread on CVS-style centralized development with Darcs, and strong and weak points when used in that mode.

Archives 2008: May Apr Feb 2007: Jul May Feb Jan 2006: Dec Nov Oct Sep Aug Jul Jun Jan 2005: Sep Aug Jul Jun May Apr Mar Feb Jan 2004: Dec Nov Oct Sep Aug Jul Jun May Apr Mar Feb Jan 2003: Dec Nov Oct Sep Aug Jul Jun May