Martin Pool's blog

Is swap space obsolete?

There was a thread on the CLUG list recently about whether it was still useful to have swap space, now that it's quite affordable to have a gigabyte or more of memory on a desktop machine. I think it is.

Some people have the idea that touching swap space at all is a sign that the machine is very overloaded, and so you ought to avoid it at all costs, by adding enough memory that the machine never needs to swap. This may have been true on Unix ten years ago, and may still be true on some systems for all I know but it's not true for Linux.

The meaning of the term swap has changed over time. It used to mean that entire tasks were moved out to disk, and they'd stay there until it was necessary to run them again. You can do this on machines without paged MMUs, and perhaps it was simpler to implement. However, these days almost all machines have MMUs, and so we use paging instead, where particular chunks of the program (typically 4kB) can move in or out independently. This gets more use out of the memory you have, because many programs run quite happily with only part of their virtual memory in RAM. Linux doesn't implement old-style whole-program swapping at all, and there does not seem to be any reason to add it.

I'll recapitulate the way VM works, and in particular the ways it is different on Linux from in your average computer science textbook. The basic idea is that we have a relatively small fast RAM, and a slower larger disk, and we want to get the best performance out of the combination. I will skip some details and special cases for simplicity.

All memory pages on Linux can be grouped into four classes. Firstly, there are kernel pages which are fixed in memory and never swapped. (Some other systems have pageable kernels, but at the moment the Linux developers consider it too risky.) Then there is program text: the contents of /bin/sh or /lib/ These are read-only in memory, and so are always exactly the same as the file on disk. There are file-backed pages, which might have changes in memory that haven't been written out yet. Finally there are memory pages that don't correspond to any file on disk: this includes the stack and heap variables of running tasks. When a program does a malloc(), it allocates memory of this type. Pages in this last category are called anonymous mappings, because they don't correspond to any file name.

There is no separate disk cache in Linux, like there is on old-Unix or on DOS. Instead, we try to keep the most useful parts of the disk in memory as cached pages. Linux usually doesn't directly modify the disk: rather, changes are made to the files in memory and then they're flushed out.

You'll notice that the free-memory measure on Linux machines is normally pretty low, even when the machine has plenty of memory for the tasks it's running. This is normal and intentional: the kernel tries to keep memory filled up with cached pages so that if those files are accessed again it won't have to go to disk. The free pool indicates just a few spare pages that are ready for immediate allocation. One time when the free memory will be large is shortly after bootup when the kernel just hasn't read in very much of the disk yet. Another time you'll have a lot of free memory is shortly after a large program exited: it had a lot of data pages in RAM, but those pages were deleted and so there's no useful information in them anymore.

We talk of pages as being clean when the in-memory version is the same as the one on disk, and dirty when they've been changed since being read in. Data pages need to get written back to disk eventually, and the kernel generally does this in the background. You can force all dirty pages to be written out using the sync system call.

The kernel can discard a clean page whenever it needs the memory for something else, because it knows it can always get the data back from disk. However, dirty pages need to be saved to disk before they can be reused. We call this eviction. So flushing pages in the background has two purposes: it helps protect data from sudden power cuts, and more importantly it means there are plenty of clean pages that can be reused when a process needs memory. So efficient is this flushing that at the moment my machine has only four dirty pages out of 256,000 (by grep nr_dirty /proc/vmstat).

As the kernel allocates memory, it firstly takes pages from the free pool. If that drops too low, it needs to free up more memory. Where does that come from? It needs to discard a clean page to make room. If there aren't any suitable clean pages then it needs to flush a dirty page to disk, then use it. This is very slow because the allocation can't continue until the disk write has finished, so the kernel tries very hard to avoid this by always having some clean pages around. (Remember the whole point of the VM algorithm is to avoid ever having to wait for a disk access to complete, by keeping pages in memory that are likely to be used again.)

File-backed pages can be flushed by writing them back to their file on disk. But anonymous mappings by definition don't have any backing file. Where can they be flushed to? Swap space, of course. Swap partitions or files on Linux hold pages that aren't backed by a file.

If you don't have swap space, then anonymous mappings can't be flushed. They have to stay in memory until they're deleted. The kernel can only obtain clean memory and free memory by flushing out file-backed pages: programs, libraries, and data files. Not having swap space constrains and unbalances the kernel's page allocation. However unlikely it is that the data pages will be used again — even if they're never used again — they still need to stay in memory sucking up precious RAM. That means the kernel has to do more work to write out file-backed pages, and to read them back in after they're discarded. The kernel needs to throw out relatively valuable file-backed pages, because it has nowhere to write relatively worthless anonymous pages.

Not only this, but flushing pages to swap is actually a bit easier and quicker than flushing them to disk: the code is much simpler, and there are no directory trees to update. The swap file/partition is just an array of pages. This is another reason to give the kernel the option of flushing to swap as well as to the filesystem.

As I write this, my 1024MB machine has 184MB of swap used out of 1506MB, and only 17MB of memory free. On old-Unix this would indicate a perilous situation: with numbers like this it would be grinding. But Linux is perfectly happy with these numbers: the disk is idle and it responds well.

The 184MB constitutes tasks that are running in the background, like the getty on the text console, or the gdm login manager. Neither of them has done anything much since I logged in days ago. From a certain overoptimizing point of view I ought to get rid of those tasks — although for the login manager it might be hard. But even then, there's probably lots of memory used for features of programs I am running that don't get invoked very often.

With swap, that memory is written to disk and costs very little. Without swap, it would be cluttering up RAM, as if I was down to only 840MB. Everything else would need to page a bit harder, but it wouldn't be obvious why.

Disk is cheap, so allocate a gigabyte or two for swap.

On BSD people used to advise allocating as much swap as memory, or maybe two or three times as much. Although the VM design is completely different, it's still a good rule of thumb. If anything, disk has gotten relatively cheaper over time: a typical developer machine now has 1GB of memory, but 200GB of disk. Spending one half or one percent of your disk on swap can probably improve performance.

If you are short on disk, as I am on my laptop, then use a swap file instead of a swap partition so that you can shrink or grow it more easily. (I think there is still a limit of 2GB per swap target, but you can create as many as you like.) Swap files might be slightly slower, but it's much better than not having it at all. If you ever see it get close to full, add some more.

Understanding the Linux Virtual Memory Manager has an enormous amount of detail on how this works in 2.4 (and I hope it doesn't contradict me!)

O'Reilly's System Performance Tuning approaches this from a sysadmin's point of view, but it mostly describes the way swap works under Solaris.

Archives 2008: Apr Feb 2007: Jul May Feb Jan 2006: Dec Nov Oct Sep Aug Jul Jun Jan 2005: Sep Aug Jul Jun May Apr Mar Feb Jan 2004: Dec Nov Oct Sep Aug Jul Jun May Apr Mar Feb Jan 2003: Dec Nov Oct Sep Aug Jul Jun May