Martin Pool's blog

Stablity is a chimera

LWN has yet another thread on kernel stability. Some people would like to be able to install new drivers on old kernels. This is the mirror image of something also often requested: being able to use old (typically binary) drivers on new kernels.

Satisfying both at once requires not just backward- but also forward compatibility: the interfaces between interchangeable parts of the kernel must not change *at all* within a stability window. This is very constraining to development, and since people would like to use years-old binaries the window is very large. So in fact this is possible, if we define the stability window to be exactly one kernel release: anything that works against 2.4.20 will continue to work against 2.4.20 forever.

More generally: everyone would like a stable series that lets them get the bug fixes they care about, but without any unnecessary changes. The problem is that different people have different requirements for what is allowed into the stable kernel series. Some people really need support for some new hardware, but to everyone else that's unnecessary churn. Some people want nothing but security fixes; some people want security and bug fixes; some people want new drivers but no core kernel changes; some people need scheduler changes to run their workload effectively.

It is fundamentally impossible to reconcile all of these changes; therefore the optimal approach is to develop the mainstream in the most efficient way possible, and let people fork stable branches as they want. If you want to build a branch which has only bug and security fixes and nothing more then you can.

Fixed again in 2.6.9

A different Minolta bug apparently caused breakage this time. There is a workaround by Matthew Dharm for 2.6.9-rc4. The necessary patch hunk is

diff -Nru a/drivers/usb/storage/transport.c b/drivers/usb/storage/transport.c
--- a/drivers/usb/storage/transport.c   2004-10-10 19:59:55 -07:00
+++ b/drivers/usb/storage/transport.c   2004-10-10 19:59:55 -07:00
@@ -908,6 +911,7 @@
   int result;
        /* issue the command */
+       us->iobuf[0] = 0;
        result = usb_stor_control_msg(us, us->recv_ctrl_pipe,
                                           USB_DIR_IN | USB_TYPE_CLASS | 
@@ -918,7 +922,7 @@
            result, us->iobuf[0]);
        /* if we have a successful request, return the result */
-       if (result == 1)
+       if (result >= 0)
           return us->iobuf[0];

This should apply against anything reasonably recent from 2.6.

I can't really recommend attaching this camera over USB though, because it runs down the NiMH batteries extremely quickly. It's better, if at all possible, to read direct from the card through a PCMCIA carrier or a printer or a USB attachment. It may be quicker too, if you have a better interface than the USB1 on the DiMAGE.

Greg K-H replies to Sun

A rebuttal of a single Sun misinformed developer.

Eric Schrock from Sun says

The main reason we can't just jump into Linux is because Linux doesn't align with our engineering principles, and no amount of patches will ever change that.

(That sounds more like a problem for Sun than for Linux.)

There is a big LWN thread about it.

My guess would be that OpenSolaris will be a kind of walled garden: licenced so that code cannot be merged into other projects, and in any case differences in architecture will make that hard. Releasing it will be useful to people working on Solaris, as read-only access always is: the final documentation, useful in tracking down bugs, and so on. So it'll be mildly positive for existing users and get Sun some good press, but it won't change the overall curve.

I can't imagine choosing a Sun machine just because I can get (some?) operating system source. What percentage of Macintosh buyers care about OpenDarwin? I like Macs, and I like open source, but I don't think OpenDarwin would be a compelling reason to buy one.

broken again

DiMAGE7i usb-storage support seems to be broken again as of I need to see if the patch is still in there, or needs to be updated.

Async disk access

Courtesy of slamb, pphaneuf.


No POSIX system supports non-blocking disk I/O through the O_NONBLOCK interface. Some support it through a separate asynchronous I/O interface but due to its complexity and non-portability, few programs actually use it. Also, it doesn't support an async open(2).

djb argues, like you do, that they should support this through a more normal mechanism.

Why is it so hard to just do O_NONBLOCK on disk files, as you can do on pipes or sockets? Because Linux disk IO is fundamentally different, because of being so tightly integrated with the virtual memory system.

Related stuff from Christopher Baus.

Why doesn't free memory go down

I wrote a little while ago about interpreting swap usage on Linux. A related question is why Linux always seems to have so little free memory. Does this indicate some kind of problem in Linux or the application? No.

Someone at work asked (paraphrasing):

I have a process that uses a lot of memory while it's running, so the free memory (shown by free or top) goes right down to 60MB out of 8100MB. But when the process exits, the free memory doesn't go back up. Why isn't memory released when the process exits?

The short answer is that you should never worry about the amount of free memory on Linux. The kernel attempts to keep this slightly above zero by keeping the cache as large as possible. This is a feature not a bug.

If you are concerned about VM performance then the most useful thing to watch is the page in/out rate, shown by the "bi" and "bo" columns in vmstat. Another useful measure (2.6 only) is the "wa" column, showing the amount of CPU time spent waiting for IO. "wa" is probably the one you have to worry about most, because it shows CPU cycles that are essentially wasted because VM is too slow.

As you said, linux is keeping the free memory into buffer cache, but when there is no process running how come the buffer cache is having 4GB and how it is released 3GB to free memory.

Disk cache is maintained globally, not per-process. Files can remain in cache even after the process that was using them exited, because they might be used by another process. Freeing the cache would mean discarding cached data. There's no reason to do that until the data is obsolete (e.g. files are deleted) or the memory is needed for some other purpose.

After a while the free memory goes back up again.

Pages only become free when they're evicted to build up the free pool (see below), or when nothing useful can be stored in them. If there are gigabytes of free memory then the main cause is that the kernel doesn't have anything to cache in them.

This can happen when, for example, a file that was cached was deleted, or a filesystem is unmounted: there's no point keeping those pages cached because they can't be accessed. (Note that the kernel can still cache a file which is just unlinked, but still in use by applications.)

A similar case is that an application has allocated a lot of anonymous memory and then either exited or freed the memory. That data is discarded, so the pages are free.

Note that flushing the data to disk makes the pages clean, but not free. They can still be kept in memory in case they're read in the future. (Clean means the in-memory page is the same as the on-disk page.)

The guy in the second row asks:

So if Linux tries to keep the cache as large as possible, why is there 60MB free rather than zero? Wouldn't it be better to cache an additional 60MB?

Linux keeps a little bit of memory free so that it is ready as soon as it needs to allocate more memory. If the extra 60MB was used for cache too then when a new allocation was required, the kernel would have to go through the cache and work out what to evict. Possibly it would need to wait for a page to be written out. This would make allocation slower and more complex. So there is a tradeoff where the page cache is made slightly slower so that allocation can be faster and simpler. The kernel keeps just a few free pages prepared in advance.

(If you have any questions mail me and I'll try to answer them here.)

Is swap space obsolete?

There was a thread on the CLUG list recently about whether it was still useful to have swap space, now that it's quite affordable to have a gigabyte or more of memory on a desktop machine. I think it is.

Some people have the idea that touching swap space at all is a sign that the machine is very overloaded, and so you ought to avoid it at all costs, by adding enough memory that the machine never needs to swap. This may have been true on Unix ten years ago, and may still be true on some systems for all I know but it's not true for Linux.

The meaning of the term swap has changed over time. It used to mean that entire tasks were moved out to disk, and they'd stay there until it was necessary to run them again. You can do this on machines without paged MMUs, and perhaps it was simpler to implement. However, these days almost all machines have MMUs, and so we use paging instead, where particular chunks of the program (typically 4kB) can move in or out independently. This gets more use out of the memory you have, because many programs run quite happily with only part of their virtual memory in RAM. Linux doesn't implement old-style whole-program swapping at all, and there does not seem to be any reason to add it.

I'll recapitulate the way VM works, and in particular the ways it is different on Linux from in your average computer science textbook. The basic idea is that we have a relatively small fast RAM, and a slower larger disk, and we want to get the best performance out of the combination. I will skip some details and special cases for simplicity.

All memory pages on Linux can be grouped into four classes. Firstly, there are kernel pages which are fixed in memory and never swapped. (Some other systems have pageable kernels, but at the moment the Linux developers consider it too risky.) Then there is program text: the contents of /bin/sh or /lib/ These are read-only in memory, and so are always exactly the same as the file on disk. There are file-backed pages, which might have changes in memory that haven't been written out yet. Finally there are memory pages that don't correspond to any file on disk: this includes the stack and heap variables of running tasks. When a program does a malloc(), it allocates memory of this type. Pages in this last category are called anonymous mappings, because they don't correspond to any file name.

There is no separate disk cache in Linux, like there is on old-Unix or on DOS. Instead, we try to keep the most useful parts of the disk in memory as cached pages. Linux usually doesn't directly modify the disk: rather, changes are made to the files in memory and then they're flushed out.

You'll notice that the free-memory measure on Linux machines is normally pretty low, even when the machine has plenty of memory for the tasks it's running. This is normal and intentional: the kernel tries to keep memory filled up with cached pages so that if those files are accessed again it won't have to go to disk. The free pool indicates just a few spare pages that are ready for immediate allocation. One time when the free memory will be large is shortly after bootup when the kernel just hasn't read in very much of the disk yet. Another time you'll have a lot of free memory is shortly after a large program exited: it had a lot of data pages in RAM, but those pages were deleted and so there's no useful information in them anymore.

We talk of pages as being clean when the in-memory version is the same as the one on disk, and dirty when they've been changed since being read in. Data pages need to get written back to disk eventually, and the kernel generally does this in the background. You can force all dirty pages to be written out using the sync system call.

The kernel can discard a clean page whenever it needs the memory for something else, because it knows it can always get the data back from disk. However, dirty pages need to be saved to disk before they can be reused. We call this eviction. So flushing pages in the background has two purposes: it helps protect data from sudden power cuts, and more importantly it means there are plenty of clean pages that can be reused when a process needs memory. So efficient is this flushing that at the moment my machine has only four dirty pages out of 256,000 (by grep nr_dirty /proc/vmstat).

As the kernel allocates memory, it firstly takes pages from the free pool. If that drops too low, it needs to free up more memory. Where does that come from? It needs to discard a clean page to make room. If there aren't any suitable clean pages then it needs to flush a dirty page to disk, then use it. This is very slow because the allocation can't continue until the disk write has finished, so the kernel tries very hard to avoid this by always having some clean pages around. (Remember the whole point of the VM algorithm is to avoid ever having to wait for a disk access to complete, by keeping pages in memory that are likely to be used again.)

File-backed pages can be flushed by writing them back to their file on disk. But anonymous mappings by definition don't have any backing file. Where can they be flushed to? Swap space, of course. Swap partitions or files on Linux hold pages that aren't backed by a file.

If you don't have swap space, then anonymous mappings can't be flushed. They have to stay in memory until they're deleted. The kernel can only obtain clean memory and free memory by flushing out file-backed pages: programs, libraries, and data files. Not having swap space constrains and unbalances the kernel's page allocation. However unlikely it is that the data pages will be used again — even if they're never used again — they still need to stay in memory sucking up precious RAM. That means the kernel has to do more work to write out file-backed pages, and to read them back in after they're discarded. The kernel needs to throw out relatively valuable file-backed pages, because it has nowhere to write relatively worthless anonymous pages.

Not only this, but flushing pages to swap is actually a bit easier and quicker than flushing them to disk: the code is much simpler, and there are no directory trees to update. The swap file/partition is just an array of pages. This is another reason to give the kernel the option of flushing to swap as well as to the filesystem.

As I write this, my 1024MB machine has 184MB of swap used out of 1506MB, and only 17MB of memory free. On old-Unix this would indicate a perilous situation: with numbers like this it would be grinding. But Linux is perfectly happy with these numbers: the disk is idle and it responds well.

The 184MB constitutes tasks that are running in the background, like the getty on the text console, or the gdm login manager. Neither of them has done anything much since I logged in days ago. From a certain overoptimizing point of view I ought to get rid of those tasks — although for the login manager it might be hard. But even then, there's probably lots of memory used for features of programs I am running that don't get invoked very often.

With swap, that memory is written to disk and costs very little. Without swap, it would be cluttering up RAM, as if I was down to only 840MB. Everything else would need to page a bit harder, but it wouldn't be obvious why.

Disk is cheap, so allocate a gigabyte or two for swap.

On BSD people used to advise allocating as much swap as memory, or maybe two or three times as much. Although the VM design is completely different, it's still a good rule of thumb. If anything, disk has gotten relatively cheaper over time: a typical developer machine now has 1GB of memory, but 200GB of disk. Spending one half or one percent of your disk on swap can probably improve performance.

If you are short on disk, as I am on my laptop, then use a swap file instead of a swap partition so that you can shrink or grow it more easily. (I think there is still a limit of 2GB per swap target, but you can create as many as you like.) Swap files might be slightly slower, but it's much better than not having it at all. If you ever see it get close to full, add some more.

Understanding the Linux Virtual Memory Manager has an enormous amount of detail on how this works in 2.4 (and I hope it doesn't contradict me!)

O'Reilly's System Performance Tuning approaches this from a sysadmin's point of view, but it mostly describes the way swap works under Solaris.


Greg K-H says he took this patch, so it should be in the next 2.4 and 2.6 kernels.

Linux Minolta DiMAGE 7i patch

Patch to make Minolta DiMAGE 7, 7i, 7Hi cameras work on Linux. This might help with the DiMAGE A1 as well, which is the successor to the 7. Let me know!

--- linux-2.4.22/drivers/usb/storage/unusual_devs.h.~1~	2003-09-08 21:23:50.000000000 +1000
+++ linux-2.4.22/drivers/usb/storage/unusual_devs.h	2003-11-12 13:26:49.000000000 +1100
@@ -388,6 +388,28 @@
+/* Following three Minolta cameras reported by Martin Pool
+ * .  Originally discovered by Kedar Petankar,
+ * Matthew Geier, Mikael Lofj"ard, Marcel de Boer.
+ */
+UNUSUAL_DEV( 0x0686, 0x4006, 0x0001, 0x0001,
+             "Minolta",
+             "DiMAGE 7",
+             US_SC_SCSI, US_PR_DEVICE, NULL,
+             0 ),
+UNUSUAL_DEV( 0x0686, 0x400b, 0x0001, 0x0001,
+             "Minolta",
+             "DiMAGE 7i",
+             US_SC_SCSI, US_PR_DEVICE, NULL,
+             0 ),
+UNUSUAL_DEV( 0x0686, 0x400f, 0x0001, 0x0001,
+             "Minolta",
+             "DiMAGE 7Hi",
+             US_SC_SCSI, US_PR_DEVICE, NULL,
+             0 ),
 UNUSUAL_DEV(  0x0693, 0x0002, 0x0100, 0x0100, 
 		"FlashGate SmartMedia",


SHFS: a Linux remote filesystem implemented by running shell commands over SSH, kind of like emacs Tramp mode.



As the Linux kernel matures, there is an ever increasing number of facilities and tools that need to relay large amounts of data from kernel space to user space. Up to this point, each of these has had its own mechanism for relaying data. To supersede the individual mechanisms, we introduce the "high-speed data relay filesystem" (relayfs). As such, things like LTT, printk, EVLog, etc. should all use relayfs to get their data to user-space. The use of relayfs would, for example, avoid lost printk's. It would also result in the standardization of the way in which large amounts of data are transferred from kernel space to user space.

The main idea behind the relayfs is that every data flow is put into a separate "channel" and each channel is a file. In practice, each channel is a separate memory buffer allocated from within kernel space upon channel instantiation. Software needing to relay data to user space would open a channel or a number of channels, depending on its needs, and would log data to that channel. All the buffering and locking mechanics are taken care of by the relayfs.

davem doesn't like it though...

So would you consider running printk on Netlink sockets? Do you think Netlink could accomodate something as intensive as tracing? etc.

Of course it can. Look, netlink is used on routers to transfer hundreds of thousands of routing table entries in one fell swoop between a user process and the kernel every time the next hop Cisco has a BGP routing flap.

If you must have "enterprise wide client server" performance, we can add mmap() support to netlink sockets just like AF_PACKET sockets support such a thing. But I _really_ doubt you need this and unlike netlink sockets relayfs has no queueing model, whereas not only does netlink have one it's been tested in real life.

You guys are really out of your mind if you don't just take the netlink printk thing I did months ago and just run with it. When someone first told showed me this relayfs thing, I nearly passed out in disbelief that people are still even considering non-netlink solutions.

On the other hand Richard Moore says

In messaging terms relayfs is more about he collation of parts of a message rather than the sending of multiple messages to a session partner. There are three aspects in which relayfs radically differs from netlink:

1) it does not require a partnership -- a client and serve, or session pair -- it is simply a buffering mechanism that allows data be deposited. There is no expectation that the data will be consumed or that there is a listening partner. The reason fore this design point comes from the origin of relayfs as a buffering mechanism that satisfies the needs of a low-level system trace. Data from a trace might never be consumed if the system, sub-system or component never fails.

2) data can be deposited from any context - interrupt time, task time, sysinit in particular.

3) the depositing of data with relayfs has to depend one a very simple interface and infrastructure in order to function under a severely damaged system. My impression is that netlink depends a significant infrastructure.

2.4.20 NFS bug

I think Gavrie Philipson and I found a bug in 2.4.20 NFS that can cause files to read back as being full of nuls. I'm just going to see if I can make a little test case.

How to debug kernel problems

Step-by-step instructions on how to debug kernel problems. Kind of patchy, and they assume a fair level of knowledge and confidence, but still a good start.


kfishd: an implementation of kfish in the kernel.

hp trivia

The "TOC" button on HP servers and workstations stands for "transfer of control", apparently in the sense of switching control of the machine from the regular OS to a debugger or monitor.

ia-64 Machine Check Architecture

MCA support in 64-bit Windows

MCA is "Machine Check Architecture", which is a way for the hardware to report problems to the operating system. These can be fatal problems such as a CPU internal error, or warnings such as a corrected memory error. (Also "Machine Check Abort", which means the fatal ones in particular.)

The OS can make them available for userspace analysis tools, which can do things like statistical analysis or interpretation of vendor-specific fields. The OS might also handle errors by panicing in severe cases, or handling them in less serious cases. For example, the OS may be able to respond appropriately to failure of a particular memory line, or even failure of a processor in an SMP machine.

The implementation depends on the extensive on-board firmware including the SAL (System Abstraction Layer).

MCA information can survive reboots or power cycles so it can be recovered even if the machine suddenly aborts.

print backtrace from OS INIT handler

mca-recovery project at SourceForge, and draft patch.

Good document on Machine Check Abort (MCA) Handling

* Applications cannot continue with bad data: they are killed when they touch an error spot in the memory.

* Applications having error spots in their valid memory regions, should be let run until they touch an error spot. There is a reasonable chance that they can run to completion.

As most of the memory is used for the applications - several (tens of) Gbytes vs. 64 Mbytes of the kernel - almost all errors happen in user space. User mode errors are recoverable by killing the application affected. Kernel mode errors are fatal because the error detection and recovery paths in Linux are not elaborated...

Intel's Itanium Processor Family Error Handling Guide

PAL = Processor Abstraction Layer, SAL = System Abstraction Layer.

An MCA INIT can be generated from a hardware button on some HP machines, so that you can NMI a hung machine and hopefully get some debugging information.

Porting Drivers to HP ZX1.

The zx6000 has a TOC button too, but RH AS 2.1 seems to just hang rather than going into a debug mode when it's pressed... :-/



Console on IA-64

Although many IA-64 machines have some kind of front-end processor, it looks like that is not used for the console under Linux. Instead, Linux talks directly to the serial port or VGA card. The front-end processor can also talk to the serial port, but this is apparently done at an electronic level; it doesn't mediate access.

ia64 toolchain

How to build an ia64 cross toolchain

The Wonderful World of 2.6

Joe Pranevich has a good document about new features in the Linux 2.6 kernel.

Archives 2008: Apr Feb 2007: Jul May Feb Jan 2006: Dec Nov Oct Sep Aug Jul Jun Jan 2005: Sep Aug Jul Jun May Apr Mar Feb Jan 2004: Dec Nov Oct Sep Aug Jul Jun May Apr Mar Feb Jan 2003: Dec Nov Oct Sep Aug Jul Jun May