Copyright (C) 1998, 1999, 2000 by Martin Pool
New 2004-10-05: The ext3cow filesystem is an active project to add copy-on-write versioning features into ext3.
NOTE: This project is incomplete, abandoned and unsupported. No work has been done on it since 2000. It is presented here only for the information of people interested in implementing something similar.
Hi, I have looked over the old snapfs code, and I would like to ask you what you figure needs to be done to bring it into an acceptable state on Linux 2.6.
The only way would be to try. I don't think the filesystem interface has changed a lot, so it should not be too hard. Of course the existing code is not complete.
Also why you stopped developing it ?
I ran out of time.
I also have a question if it would not be a good idea to layer it ontop of an existing logging fs ?
That would probably be equally hard in other ways, since you would might need to patch the filesystem to make sure that makes consistency points. Perhaps you could layer it under the filesystem using a special block device.
At this time it is probably easier to just create snapshots using EVMS or LVM, with a filesystem that works properly with it.
In 1998 I started writing a filesystem for Linux that I called snapfs. The project is no longer active, but this web page serves as partial documentation for anyone who is interested.
I started writing snapfs after attending a seminar about NetApp fileserver appliances. These seem to be excellent devices in many different ways, but the feature that particularly stood out to me was their facility to save snapshots of the filesystem. I wasn't in the market to buy $10k+ network-attached storage, but I did want this feature on my Linux workstation.
I was generally interested in learning more about the way the kernel works, and so I thought I would try to achieve the same features under Linux.
The code is all written from scratch, although it draws quite heavily on the excellent ext2fs code, and on reading whitepapers from many sources.
NetApp filers are network-attached storage that talks to other machines across NFS, SMB, and similar protocols. I understand that there's some kind of BSD kernel inside the box, with a completely customized filesystem that they call WAFL, for Write-Anywhere File Layout.
Apparently integration gives them a much better price/performance/size metric than other solutions -- more power to them if so.
The WAFL acronym indicates one strength of tight integration from filesystem down to the hardware: the filesystem knows the current location of the drive heads and so when it needs to write blocks out it can write to the free block nearest to the heads, rather than forcing a location decision.
The coolest feature, to my mind, was snapshots, which give the user the ability to see the filesystem as it was at a predetermined point in the past. Accidental deletion of overwriting of files or even whole directories becomes much less of a problem, as they can be trivially recovered from the snapshot taken five minutes, an hour, or a day ago. There is some kind of administrative command (ioctl?) that says "mark the current state as a snapshot", which typically would be issued from a cron job.
Snapshots are implemented by a kind of copy-on-write scheme, where blocks marked as belonging to any previous snapshot are treated as read-only. If the system wants to write to a data block, inode, or directory entry used by a previous snapshot, then the filesystem transparently allocates a block elsewhere on disk. Everything has to be updated to point to the newly allocated block, including indirect blocks or inodes. These in turn will have to re-allocated. So just after a snapshot there will be a ripple of reallocation until the blocks containing the directories and inodes being accessed and all their parents have been copied. After that, things will quieten down as the newly allocated blocks can be written multiple times until the next snapshot. (This makes the scheme much more viable in terms of disk space than simple WORM.) It's possible all this extra allocation would cause a lot of fragmentation. I decided not to worry about that for the moment.
When a snapshot's later deallocated, the blocks become available for reuse. So the extra space consumed is proportional to the sum of the differences between the snapshots: if the disk is 70% full and the working set of files touched in a week is only 5% then it should be fine.
The other great advantage this gives is that if all the filesystem metadata is flushed before making the snapshot, then those blocks will be known to be consistent even if the system crashes. By definition nothing in them can have been changed. So, if the system crashes it can just roll back to the most recent fully-committed version and continue with no fscking delay.
I imagine in practice a fsck would still be necessary to preen the filesystem, as on journalling systems. But in the long term if the code and hardware was well trusted then perhaps it could be ommitted on most startups.
AT&T Plan 9 has a similar kind of scheme, and it has more than a bit in common with HSM (hierarchical storage management) systems. It is quite distinct from journalled or log-structured filesystems: all of the ones I've seen are concerned only with allowing roll forward or back to some consistent point, and not with general time-travel. Perhaps that would be possible?
I never quite decided how to make the snapshot versions available: it was likely to be either magic changes to the filename (/home/mbp/~2~/hello.c) or mounting the filesystem again with special options (mount -o version=2 /dev/hda3 /home/v2).
For me, and at least in a prototype, the snapshot feature was so cool that it overrode considerations of disk space, performance, and so on. You can buy more disk. You can't buy back clobbered files.
Once one steps into filesystem development even in a very humble way there are a multitude of other issues begging to be addressed. I was not necessarily planning to try and fix them all in snapfs, but I did want to leave space for the solution and I learned a bit about them as I progressed. (More notes are in the source tarball.)
It would be nice if Linux supported ACLs, for other systems like WinNT and Solaris do. There is a POSIX draft (perhaps now a specification) that defines a quite clean way for them to integrate with traditional Unix mode bits, so that one can say "owned by mbp, r/w by staff, but also read by wwwdata".
Not having ACLs is often less of a problem in practice than one might expect: on platforms that do have the they seem to be rarely used. Anecdotally on NT they are used in a way that is perversely less maintainable than owner/group/other permissions, because they are harder to summarize at a glance.
I think another reason why ACLs are not so critical is that most users of Unix machines never log into them: rather, their access is mediated by a web server, database server, or vertical application. In general each of these exists as a single userid under Unix, and imposes its own appropriate security scheme.
Nevertheless, it would be good if Linux offered ACLs to those who want them. There are some unofficial patches in circulation. Perhaps they will eventually be mainstream.
ext2fs stores directory entries in a linear structure, so scanning a directory for a particular file takes time proportional to the number of entries. The kernel hashes names in memory, but this will only help when there is plenty of RAM relative to the working set of disk files. In pratice this means that directories start to become quite slow once there is more than a couple of thousand files.
The typical solution is to store directory entries in some kind of tree or (perhaps) hash structure. There is stubbed support for this in the ext2fs on-disk format, and I think it's going into ext3fs.
IDE disks on PCs have caused massive annoyance by arbitrary limits at 508MB and 8GB. A filesystem being designed now ought to run all the way up to 2**64-byte files with no problems, even though that amount's not viable on typical Linux platforms today.
Using uint64s everywhere would waste space, so perhaps some kind of 32bit segment + 32bit offset would be appropriate. XFS does this.
Keeping track of single blocks in the inode and indirection blocks and in the free-block map becomes inefficient for large files. Better would be to allocate "extents" of blocks, and to keep them as close to contiguous as possible.
I didn't implement this, but it would probably be the way to go.
Hard links are perhaps the biggest problem with the traditional Unix filesystem: they're not often used except for representing parent directories, but they introduce complexity in several places. Application-level programs are a little more complicated.
The worst effect is that unlike most other filesystems, the metadata for a file (it's stat_t, more or less) is not the same as its directory entry. This is reflected inside the Linux kernel, in the twins struct inode and strict dentry, as they were in 2.0. In general the inode will not be close to the dentry on disk, so this costs us an extra IO if they're not cached, or additional cache usage if they are cached. It also imposes extra cost in the kernel by extra pointer indirections and memory allocations.
I would have liked to explore optimizing the common case by storing inodes directly in the directory. As a start, we could simply deny creation of hard links, as the VFAT system does -- I expect in many cases symbolic links would be almost as good.
QNX supports hardlinks, but treat them as a special case: in most cases the inode and directory entries are combined. This makes things a little more complicated when the primary entry for an inode is deleted, but presumably makes the common case of readdir+stat much faster. Therefore listing a directory should be much faster.
ext2fs stores short symlinks within the inode itself. It would be possible to even do the same with tiny files, although a little hard under linux -- bmap and the default operations don't understand about reading a little part of a block containing inodes. reiserfs is a better solution.
Code running within the kernel naturally enough uses an entirely different programming interface to user mode programs. This makes plenty of sense given that kernel modules do IO through the buffer cache and interact directly with kernel data structures.
Still, testing and debugging kernel code is much harder than working in user-mode: there's no gdb, only printk and bugs can clobber the whole machine or worse. So it would be nice to run the code in user mode to test it, and then switch to kernel mode.
In addition all filesystems have to have code that runs in userspace, to do things like mkfs, fsck, and so on. I suppose it would be possible to have everything in the kernel and an ioctl to kick off one of these operations, but that would be unusual at least.
So, at the bottom level there is IO code which comes from the environment (either the kernel or libc). In the middle there is code to parse directories and inodes and do similar tasks common to both userspace and kernelspace. Further up there are tasks normally done only in one or the other: laying out initial structures for mkfs, or extending a file for the kernel. (At this level the kernel may in fact be a strict subset of userspace, as it's possible that debugfs or fsck tools will potentially want to use every possible operation.) And right at the top there is an interface to the caller: either the VFS layer or the command line.
The Linux NTFS driver (at least as of 1999) has a single codebase with an abstract IO layer that interfaces to the buffer cache when linked into the kernel, and to mmap when in userspace. ext2fs, by contrast has completely separate code except for header files for the two uses, or at least it did when I last looked. In the ext2fs case, of course, the same structures and userspace programs will be used on several platforms: Linux and GRUB at least, and possibly others. Both NTFS and ext2fs seemed clean and maintainable in different ways, so I found no real guidance.
I thought about using userfs to run my code in userspace but plugged into the kernel. At the time it seemed to be broken or unmaintained. Another alternative would be to write a user mode NFS server and mount it locally, although the NFS interface is not exactly the same as the local kernel VFS interface.
In any case, since this was my first significant kernel work I was pretty happy to see all the issues before trying to abstract them away. In the future I think I would use userfs or something similar.
Another smaller issue in this area is translation between the on-disk representation and the in-memory representation. On disk we have to use standard endianness, whereas in core we can use native integers and pointers to things in memory. It's almost a shame, really, because in other respects the buffer cache makes interaction with on-disk structures so transparent. This is not particularly a problem, but just something to be aware of.
I tried to implement this in the filesystem layer, but there are other possibilities.
One option is to do things as a layer on top of the filesystem that preserves individual versions of files. This would be fine, but probably wouldn't protect against damage to the whole directory tree, which is what I wanted to achieve.
Rather than building a new filesystem, one could write a block device with snapshot support, and then run ext2fs on top. This would allow less control: for example I think the filesystem would have to be remounted read-only for a moment to take a consistent snapshot, and ext2fs's allocation optimizations would be confounded if the blocks were actually not allocated linearly. This could perhaps plug in through the loopback block device, network block device or some similar hook. I suspect it would be much less code than to implement a whole filesystem from scratch.
At the time I was working entirely on my home machine, wistful.humbug.org.au. As was probably inevitable, buggy code eventually corrupted the buffer cache and whacked out my /home/ partition. Fortunately I had some backups, but the experience was scary enough that I decided to put the project aside until I had a sacrificial machine on which to test.
To some extent the project was a success: I learnt a lot about the internals of the filesystem area of the kernel, and had a lot of fun trying to design a decent filesystem.
So, the current code is written to the 2.0 VFS layer, and will probably need to be updated to plug into 2.2 or 2.4. It will, more or less, let you
Snapshots are partly implemented but they are not usable.
The code is licensed under the GNU GPL and available for download here: snapfs-19990401.tar.bz2. It is not usable as a filesystem at the moment, but you are welcome to do with it what you wish, within the licensing conditions. I'd be happy to hear comments.
I would still like to go back to it sometime, or perhaps implement the same feature inside some other filesystem or at a different level. Certainly there are many interesting development projects happening in Linux filesystems these days, with reiserfs, XFS, AFS, and ext3fs among the most promising.