Martin Pool's blog

Stablity is a chimera

LWN has yet another thread on kernel stability. Some people would like to be able to install new drivers on old kernels. This is the mirror image of something also often requested: being able to use old (typically binary) drivers on new kernels.

Satisfying both at once requires not just backward- but also forward compatibility: the interfaces between interchangeable parts of the kernel must not change *at all* within a stability window. This is very constraining to development, and since people would like to use years-old binaries the window is very large. So in fact this is possible, if we define the stability window to be exactly one kernel release: anything that works against 2.4.20 will continue to work against 2.4.20 forever.

More generally: everyone would like a stable series that lets them get the bug fixes they care about, but without any unnecessary changes. The problem is that different people have different requirements for what is allowed into the stable kernel series. Some people really need support for some new hardware, but to everyone else that's unnecessary churn. Some people want nothing but security fixes; some people want security and bug fixes; some people want new drivers but no core kernel changes; some people need scheduler changes to run their workload effectively.

It is fundamentally impossible to reconcile all of these changes; therefore the optimal approach is to develop the mainstream in the most efficient way possible, and let people fork stable branches as they want. If you want to build a branch which has only bug and security fixes and nothing more then you can.

RHEL filesystems

Trap for young players: RHEL only supports only the ext2 and ext3 filesystems (not counting FAT etc.) If you happen to have a big-ass disk that's already set up in XFS this can be quite a pain.

(I wonder why? Lack of assurance in the others? Simplicity?)

The kernel-unsupported package contains reiserfs, jfs and some others but not XFS.

Fixed again in 2.6.9

A different Minolta bug apparently caused breakage this time. There is a workaround by Matthew Dharm for 2.6.9-rc4. The necessary patch hunk is

diff -Nru a/drivers/usb/storage/transport.c b/drivers/usb/storage/transport.c
--- a/drivers/usb/storage/transport.c   2004-10-10 19:59:55 -07:00
+++ b/drivers/usb/storage/transport.c   2004-10-10 19:59:55 -07:00
@@ -908,6 +911,7 @@
   int result;
 
        /* issue the command */
+       us->iobuf[0] = 0;
        result = usb_stor_control_msg(us, us->recv_ctrl_pipe,
                                           US_BULK_GET_MAX_LUN, 
                                           USB_DIR_IN | USB_TYPE_CLASS | 
@@ -918,7 +922,7 @@
            result, us->iobuf[0]);
 
        /* if we have a successful request, return the result */
-       if (result == 1)
+       if (result >= 0)
           return us->iobuf[0];
 
        /* 

This should apply against anything reasonably recent from 2.6.

I can't really recommend attaching this camera over USB though, because it runs down the NiMH batteries extremely quickly. It's better, if at all possible, to read direct from the card through a PCMCIA carrier or a printer or a USB attachment. It may be quicker too, if you have a better interface than the USB1 on the DiMAGE.

Greg K-H replies to Sun

A rebuttal of a single Sun misinformed developer.

Eric Schrock from Sun says

The main reason we can't just jump into Linux is because Linux doesn't align with our engineering principles, and no amount of patches will ever change that.

(That sounds more like a problem for Sun than for Linux.)

There is a big LWN thread about it.

My guess would be that OpenSolaris will be a kind of walled garden: licenced so that code cannot be merged into other projects, and in any case differences in architecture will make that hard. Releasing it will be useful to people working on Solaris, as read-only access always is: the final documentation, useful in tracking down bugs, and so on. So it'll be mildly positive for existing users and get Sun some good press, but it won't change the overall curve.

I can't imagine choosing a Sun machine just because I can get (some?) operating system source. What percentage of Macintosh buyers care about OpenDarwin? I like Macs, and I like open source, but I don't think OpenDarwin would be a compelling reason to buy one.

broken again

DiMAGE7i usb-storage support seems to be broken again as of 2.6.8.1. I need to see if the patch is still in there, or needs to be updated.

The poor man's profiler

Suppose you have a program that's using a lot of CPU. What's it doing in there? What is so slow, dammit?

gdb /proc/8337/exe 8337

Observe which function it was in when you interrupted.

gdb> continue

Let it keep running. Hit C-c to interrupt it later; see where it is; rinse, repeat.

This is only useful if the program has function names in the executable. If you built it yourself, that's probably true. If you got it from a distribution it may not. (I think that's lame.)

Most programs spend most of their time in one routine. Attaching from gdb at random times will tell you which routine this is.

gdb is no substitute for a proper profiler or for kcachegrind. It won't tell you about system-call hotspots. But if your program has a single userspace hot spot it will tell you where that is, and that can be useful information. This requires no special preparation, can be done on almost any machine, and can be applied to a program while it's running with little disruption.

Async disk access

Courtesy of slamb, pphaneuf.

richdawe:

No POSIX system supports non-blocking disk I/O through the O_NONBLOCK interface. Some support it through a separate asynchronous I/O interface but due to its complexity and non-portability, few programs actually use it. Also, it doesn't support an async open(2).

djb argues, like you do, that they should support this through a more normal mechanism.

Why is it so hard to just do O_NONBLOCK on disk files, as you can do on pipes or sockets? Because Linux disk IO is fundamentally different, because of being so tightly integrated with the virtual memory system.

Related stuff from Christopher Baus.

hp msa1000 on Linux

I've been setting up an hp msa1000 Fibre Channel disk array with one of our Linux machines.

Here are some entirely unofficial notes. If you're actually setting such a unit up, please don't rely on these at all but rather go and read the careful and detailed hp instruction manuals. This is just an overview. (My first hint is that the manuals are mostly not in printed form, but rather on a CD in the little box.)

The msa1000 is basically a rack-mount box with controllers, gianormous power supply/blower units, and about 14 SCSI disks. It can be configured to have no single point of failure with redundant power supplies and lines, redundant disks, mirrorred controllers, and mirrored data paths. Or you can simply attach it directly to a single machine, as I am doing for testing.

There are two SCSI ports on the back which you can use to attach additional disk enclosures. You cannot use these to attach it to a computer. With a firmware upgrade you can also attach SATA enclosures which give you slower, possibly less reliable but much cheaper storage. This sounds like a good deal to me for some applications: RAID is meant to be a redundant array of inexpensive disks, and the msa1000 lets you do the redundancy without too much trouble.

The msa1000 attaches to a computer through the Fibre Channel port on the back of the controller. You can either connect this straight through to a FC Host Bus Adapter PCI card on the computer, or you can go through an FC switch, allowing several machines to share the storage.

From the point of view of the host, the msa is basically a big block device, like a SCSI disk. The computer can't directly see the 14-odd disks inside the array though: it sees a RAID abstraction of those disks presented by the controller. This logical disk is called a unit.

Before you can do anything useful from the host, you need to configure the msa and make a storage unit. There are two ways to do this: either by running some hp software (Insight Manager or Array Configuration Utility) on the host, or by connecting to the serial console on the front. (The serial console needs a special RJ11 cable which comes in the box.) I used the serial console.

The main operation you need to do on the serial console is an ADD UNIT command, which allocates some of the disks for storage. I made a single unit out of all of the disks, with one RAID-5 redundant disk and one hot spare. This is exposed to the Linux host as a SCSI LUN.

You may be able to ask Linux to rescan and hot-add that device. That didn't seem to work with the old kernel on my machine so I just rebooted, and it discovered the disk as sda. It looks like there is no partition table for these units.

At this point I suppose you have a choice of doing Linux LVM or simply creating a filesystem directly on the device. It may seem a bit redundant to run Linux LVM on top of hardware RAID, and perhaps it is.

On the other hand LVM is more flexible than the raid system done in hardware: for example LVM can reduce the size of a logical volume, but the msa firmware cannot.

I'm going to try XFS on this disk for testing; apparently it performs well on big arrays. I may have some results later.

This kind of array can also be simultaneously accessed by multiple machines running something like GFS, Lustre or ClusterFS but I'm not trying that now.

p.s. sneakums asks:

I'm not familiar with these units, but can one not just fdisk it like any other disk? Even if I were only going to use a single partition, I'd be inclined to partition anyway for consistency's sake.

I kind of agree about consistency, but on the other hand a partition table is one more thing you can get wrong if you try to expand the array later.

In a brief look I could not see any advice from hp on whether you should make a partition table on the logical unit or not.

I forget if it matters to LVM itself, but it's typical for LVM PVs to be partitions of type 0x8e. If you were to use MD to stripe across a bunch of these units, you would have to partition for array autostart to work, since MD will only consider partitions of type 0xfd.

Right, but since you cannot (?) boot off these devices I don't think autostart matters too much, and without autostart the partition types don't seem to matter.

Trying to create a GPT table (common on ia64) fails with an IO error, but creating a DOS partition table works. Micah Parrish tells me

This is due to a longstanding 2.4 kernel bug where it is impossible to read or write to the last block on devices with odd numbers of blocks, such as an MSA1000 logical unit. There is a kernel patch floating around which adds a couple of ioctls allowing you to access these blocks directly, but it isn't in the kernel.org sources. One rather old version of it is at this url.

This should be fixed more elegantly in 2.6 kernels. Parted has some special code to use the ioctls on 2.4 kernels. I believe that Red Hat and SuSE include the patch in their 2.4 kernels, but debian may not.

devlabel

devlabel [manual] looks really elegantly simple: get hotplug events in userspace, find a unique identifier for the device and use it to make a symlink /dev/work -> /dev/hdb2. They can use various methods for finding the true name of the device including looking for filesystem UUIDs or labels. (I have only read about it, not used it, so I might be entirely wrong.)

If this works, it might solve the icky problem I had yesterday: adding a new SCSI array to a system made all the other device names changed, so they needed to be manually reassigned.

A related problem is that the order of probing for some SCSI drivers seems to have changed between 2.4 and 2.6. So on 2.4 the drives attached to say Symbios get detected as sda and sdb, but on 2.6 the QLogic gets detected first.

Let's hope devlabel gets into Debian soon.

Like everything else, Linux sucks but it can be fixed: truly a creature of growth and capable of sweetness, to ooze juicily at the last round the bearded lips of God [Anthony Burgess].

I suppose the one thing this doesn't fix is finding the right root partition. Maybe it might work with a cunningly crafted initrd.

Something similar has been done before in things like vold but maybe now it can be broadly adopted. Unlike vold, devlabel doesn't seem to use a polling setup.

Why doesn't free memory go down

I wrote a little while ago about interpreting swap usage on Linux. A related question is why Linux always seems to have so little free memory. Does this indicate some kind of problem in Linux or the application? No.

Someone at work asked (paraphrasing):

I have a process that uses a lot of memory while it's running, so the free memory (shown by free or top) goes right down to 60MB out of 8100MB. But when the process exits, the free memory doesn't go back up. Why isn't memory released when the process exits?

The short answer is that you should never worry about the amount of free memory on Linux. The kernel attempts to keep this slightly above zero by keeping the cache as large as possible. This is a feature not a bug.

If you are concerned about VM performance then the most useful thing to watch is the page in/out rate, shown by the "bi" and "bo" columns in vmstat. Another useful measure (2.6 only) is the "wa" column, showing the amount of CPU time spent waiting for IO. "wa" is probably the one you have to worry about most, because it shows CPU cycles that are essentially wasted because VM is too slow.

As you said, linux is keeping the free memory into buffer cache, but when there is no process running how come the buffer cache is having 4GB and how it is released 3GB to free memory.

Disk cache is maintained globally, not per-process. Files can remain in cache even after the process that was using them exited, because they might be used by another process. Freeing the cache would mean discarding cached data. There's no reason to do that until the data is obsolete (e.g. files are deleted) or the memory is needed for some other purpose.

After a while the free memory goes back up again.

Pages only become free when they're evicted to build up the free pool (see below), or when nothing useful can be stored in them. If there are gigabytes of free memory then the main cause is that the kernel doesn't have anything to cache in them.

This can happen when, for example, a file that was cached was deleted, or a filesystem is unmounted: there's no point keeping those pages cached because they can't be accessed. (Note that the kernel can still cache a file which is just unlinked, but still in use by applications.)

A similar case is that an application has allocated a lot of anonymous memory and then either exited or freed the memory. That data is discarded, so the pages are free.

Note that flushing the data to disk makes the pages clean, but not free. They can still be kept in memory in case they're read in the future. (Clean means the in-memory page is the same as the on-disk page.)

The guy in the second row asks:

So if Linux tries to keep the cache as large as possible, why is there 60MB free rather than zero? Wouldn't it be better to cache an additional 60MB?

Linux keeps a little bit of memory free so that it is ready as soon as it needs to allocate more memory. If the extra 60MB was used for cache too then when a new allocation was required, the kernel would have to go through the cache and work out what to evict. Possibly it would need to wait for a page to be written out. This would make allocation slower and more complex. So there is a tradeoff where the page cache is made slightly slower so that allocation can be faster and simpler. The kernel keeps just a few free pages prepared in advance.

(If you have any questions mail me and I'll try to answer them here.)

Which Linux distribution is the most powerful?

Mr Bad asks:

So, after seeing the umpteenth Debian package description mentioning what a powerful throbbing ur-package is barely contained within the bulging envelope of this particular .deb, I started wondering: how much of the software in Debian is actually POWERFUL? Like, so notably powerful that that's how you'd describe the software; it impresses its powerful powerness on the maintainer that much that they can't help mentioning its power.

A quick look at sid:

evan unicorn:~% apt-cache search powerful | wc -l
586

Almost 600 powerful packages is a lot of power, and I'd be afraid to install even one-tenth of those packages to my computer. Sixty mighty and powerful packages jockeying for full domination of my little ThinkPad? No way. I can't handle that kind of power. [more]

Is swap space obsolete?

There was a thread on the CLUG list recently about whether it was still useful to have swap space, now that it's quite affordable to have a gigabyte or more of memory on a desktop machine. I think it is.

Some people have the idea that touching swap space at all is a sign that the machine is very overloaded, and so you ought to avoid it at all costs, by adding enough memory that the machine never needs to swap. This may have been true on Unix ten years ago, and may still be true on some systems for all I know but it's not true for Linux.

The meaning of the term swap has changed over time. It used to mean that entire tasks were moved out to disk, and they'd stay there until it was necessary to run them again. You can do this on machines without paged MMUs, and perhaps it was simpler to implement. However, these days almost all machines have MMUs, and so we use paging instead, where particular chunks of the program (typically 4kB) can move in or out independently. This gets more use out of the memory you have, because many programs run quite happily with only part of their virtual memory in RAM. Linux doesn't implement old-style whole-program swapping at all, and there does not seem to be any reason to add it.

I'll recapitulate the way VM works, and in particular the ways it is different on Linux from in your average computer science textbook. The basic idea is that we have a relatively small fast RAM, and a slower larger disk, and we want to get the best performance out of the combination. I will skip some details and special cases for simplicity.

All memory pages on Linux can be grouped into four classes. Firstly, there are kernel pages which are fixed in memory and never swapped. (Some other systems have pageable kernels, but at the moment the Linux developers consider it too risky.) Then there is program text: the contents of /bin/sh or /lib/libdl-2.3.2.so. These are read-only in memory, and so are always exactly the same as the file on disk. There are file-backed pages, which might have changes in memory that haven't been written out yet. Finally there are memory pages that don't correspond to any file on disk: this includes the stack and heap variables of running tasks. When a program does a malloc(), it allocates memory of this type. Pages in this last category are called anonymous mappings, because they don't correspond to any file name.

There is no separate disk cache in Linux, like there is on old-Unix or on DOS. Instead, we try to keep the most useful parts of the disk in memory as cached pages. Linux usually doesn't directly modify the disk: rather, changes are made to the files in memory and then they're flushed out.

You'll notice that the free-memory measure on Linux machines is normally pretty low, even when the machine has plenty of memory for the tasks it's running. This is normal and intentional: the kernel tries to keep memory filled up with cached pages so that if those files are accessed again it won't have to go to disk. The free pool indicates just a few spare pages that are ready for immediate allocation. One time when the free memory will be large is shortly after bootup when the kernel just hasn't read in very much of the disk yet. Another time you'll have a lot of free memory is shortly after a large program exited: it had a lot of data pages in RAM, but those pages were deleted and so there's no useful information in them anymore.

We talk of pages as being clean when the in-memory version is the same as the one on disk, and dirty when they've been changed since being read in. Data pages need to get written back to disk eventually, and the kernel generally does this in the background. You can force all dirty pages to be written out using the sync system call.

The kernel can discard a clean page whenever it needs the memory for something else, because it knows it can always get the data back from disk. However, dirty pages need to be saved to disk before they can be reused. We call this eviction. So flushing pages in the background has two purposes: it helps protect data from sudden power cuts, and more importantly it means there are plenty of clean pages that can be reused when a process needs memory. So efficient is this flushing that at the moment my machine has only four dirty pages out of 256,000 (by grep nr_dirty /proc/vmstat).

As the kernel allocates memory, it firstly takes pages from the free pool. If that drops too low, it needs to free up more memory. Where does that come from? It needs to discard a clean page to make room. If there aren't any suitable clean pages then it needs to flush a dirty page to disk, then use it. This is very slow because the allocation can't continue until the disk write has finished, so the kernel tries very hard to avoid this by always having some clean pages around. (Remember the whole point of the VM algorithm is to avoid ever having to wait for a disk access to complete, by keeping pages in memory that are likely to be used again.)

File-backed pages can be flushed by writing them back to their file on disk. But anonymous mappings by definition don't have any backing file. Where can they be flushed to? Swap space, of course. Swap partitions or files on Linux hold pages that aren't backed by a file.

If you don't have swap space, then anonymous mappings can't be flushed. They have to stay in memory until they're deleted. The kernel can only obtain clean memory and free memory by flushing out file-backed pages: programs, libraries, and data files. Not having swap space constrains and unbalances the kernel's page allocation. However unlikely it is that the data pages will be used again — even if they're never used again — they still need to stay in memory sucking up precious RAM. That means the kernel has to do more work to write out file-backed pages, and to read them back in after they're discarded. The kernel needs to throw out relatively valuable file-backed pages, because it has nowhere to write relatively worthless anonymous pages.

Not only this, but flushing pages to swap is actually a bit easier and quicker than flushing them to disk: the code is much simpler, and there are no directory trees to update. The swap file/partition is just an array of pages. This is another reason to give the kernel the option of flushing to swap as well as to the filesystem.

As I write this, my 1024MB machine has 184MB of swap used out of 1506MB, and only 17MB of memory free. On old-Unix this would indicate a perilous situation: with numbers like this it would be grinding. But Linux is perfectly happy with these numbers: the disk is idle and it responds well.

The 184MB constitutes tasks that are running in the background, like the getty on the text console, or the gdm login manager. Neither of them has done anything much since I logged in days ago. From a certain overoptimizing point of view I ought to get rid of those tasks — although for the login manager it might be hard. But even then, there's probably lots of memory used for features of programs I am running that don't get invoked very often.

With swap, that memory is written to disk and costs very little. Without swap, it would be cluttering up RAM, as if I was down to only 840MB. Everything else would need to page a bit harder, but it wouldn't be obvious why.

Disk is cheap, so allocate a gigabyte or two for swap.

On BSD people used to advise allocating as much swap as memory, or maybe two or three times as much. Although the VM design is completely different, it's still a good rule of thumb. If anything, disk has gotten relatively cheaper over time: a typical developer machine now has 1GB of memory, but 200GB of disk. Spending one half or one percent of your disk on swap can probably improve performance.

If you are short on disk, as I am on my laptop, then use a swap file instead of a swap partition so that you can shrink or grow it more easily. (I think there is still a limit of 2GB per swap target, but you can create as many as you like.) Swap files might be slightly slower, but it's much better than not having it at all. If you ever see it get close to full, add some more.

Understanding the Linux Virtual Memory Manager has an enormous amount of detail on how this works in 2.4 (and I hope it doesn't contradict me!)

O'Reilly's System Performance Tuning approaches this from a sysadmin's point of view, but it mostly describes the way swap works under Solaris.

The Economist: Unix's Founding Fathers

The Economist has a nice bit on Dennis Ritchie and the history of Unix.

The later history of Unix is convoluted, and indeed has again become mired in court battles. Following its origins at Bell Labs, a competing version sprang up at the University of California, Berkeley, which first released its version of Unix in 1977, under the leadership of a graduate student named Bill Joy, who later went on to found Sun Microsystems. Ideological battles raged between adherents of the two versions of Unix through much of the 1980s.

To an extent, this rivalry was stripped of relevance by an unexpected entrant. In 1991, an obscure university student in Finland, Linus Torvalds, announced a project to write a new, open-source clone of Unix from scratch — what has come to be known as Linux. That someone would seek to do this is a testament to the high regard in which programmers hold the achievement of the Bell Labs group. Dr Ritchie, in return, expresses a high regard for Linux, attributing its success to the fact that it was a unified effort, at a time when other competing versions of Unix were mired in legal battles.

Linux is also the true heir of the Unix tradition in the sense that its development process is collaborative. Dr Pike says that the thing he misses most from the 1970s at Bell Labs was the terminal room. Because computers were rare at the time, people did not have them on their desks, but rather went to the room, one side of which was covered with whiteboards, and sat down at a random computer to work. The technical hub of the system became the social hub.

It is that interplay between the technical and the social that gives both C and Unix their legendary status. Programmers love them because they are powerful, and they are powerful because programmers love them. David Gelernter, a computer scientist at Yale, perhaps put it best when he said, Beauty is more important in computing than anywhere else in technology because software is so complicated. Beauty is the ultimate defence against complexity. Dr Ritchie's creations are indeed beautiful examples of that most modern of art forms.

Ken Brown and Darl McBride should read it and save themselves further ridicule.

The c10k document

Every Linux network programmer ought to read Dan Kegel's great page the C10k problem:

It's time for web servers to handle ten thousand clients simultaneously, don't you think? After all, the web is a big place now.

And computers are big, too. You can buy a 1000MHz machine with 2 gigabytes of RAM and an 1000Mbit/sec Ethernet card for $1200 or so. Let's see - at 20000 clients, that's 50KHz, 100Kbytes, and 50Kbits/sec per client. It shouldn't take any more horsepower than that to take four kilobytes from the disk and send them to the network once a second for each of twenty thousand clients. (That works out to $0.08 per client, by the way. Those $100/client licensing fees some operating systems charge are starting to look a little heavy!) So hardware is no longer the bottleneck. [....]

(fwd) Awesome story

Scott writes:

I just thought I would share a little experience of mine that you might like.

Last night I was flying back from Boise on Alaska Air. I had my laptop (Compaq Evo N800w) up and running RH9 and had just started to play Tux Racer for the first time ever after doing a little filesystem housekeeping when one of the flight attendants, a woman who had to be at least 55 if not 60 years old, stopped, commented that the game looked cute, and asked me where I got it. I told her I was running linux and it came with the OS. She responded that she had tried to put linux on her laptop but had problems getting the display working right. Needless to say, my jaw dropped. She goes on to tell me how she's running Win98 but using Opera instead of IE and she is just getting of MS telling her what to run and she's sick of their security holes that you can drive a truck through. After chatting a bit longer I asked her when she had tried to install linux on her laptop and she told me it was probably four or five years ago! She went back to Win98 because she didn't feel she was up to building a new kernel which was what she thought she might have to do. Keep in mind this flight attendant looked like she had to be at least 55...

Hmmm. Maybe my Mom is ready for a linux desktop afterall. ;)

merged

Greg K-H says he took this patch, so it should be in the next 2.4 and 2.6 kernels.

Linux Minolta DiMAGE 7i patch

Patch to make Minolta DiMAGE 7, 7i, 7Hi cameras work on Linux. This might help with the DiMAGE A1 as well, which is the successor to the 7. Let me know!

--- linux-2.4.22/drivers/usb/storage/unusual_devs.h.~1~	2003-09-08 21:23:50.000000000 +1000
+++ linux-2.4.22/drivers/usb/storage/unusual_devs.h	2003-11-12 13:26:49.000000000 +1100
@@ -388,6 +388,28 @@
 		US_FL_SINGLE_LUN ),
 #endif
 
+/* Following three Minolta cameras reported by Martin Pool
+ * .  Originally discovered by Kedar Petankar,
+ * Matthew Geier, Mikael Lofj"ard, Marcel de Boer.
+ */
+UNUSUAL_DEV( 0x0686, 0x4006, 0x0001, 0x0001,
+             "Minolta",
+             "DiMAGE 7",
+             US_SC_SCSI, US_PR_DEVICE, NULL,
+             0 ),
+
+UNUSUAL_DEV( 0x0686, 0x400b, 0x0001, 0x0001,
+             "Minolta",
+             "DiMAGE 7i",
+             US_SC_SCSI, US_PR_DEVICE, NULL,
+             0 ),
+
+UNUSUAL_DEV( 0x0686, 0x400f, 0x0001, 0x0001,
+             "Minolta",
+             "DiMAGE 7Hi",
+             US_SC_SCSI, US_PR_DEVICE, NULL,
+             0 ),
+
 UNUSUAL_DEV(  0x0693, 0x0002, 0x0100, 0x0100, 
 		"Hagiwara",
 		"FlashGate SmartMedia",

SHFS

SHFS: a Linux remote filesystem implemented by running shell commands over SSH, kind of like emacs Tramp mode.

relayfs

relayfs:

As the Linux kernel matures, there is an ever increasing number of facilities and tools that need to relay large amounts of data from kernel space to user space. Up to this point, each of these has had its own mechanism for relaying data. To supersede the individual mechanisms, we introduce the "high-speed data relay filesystem" (relayfs). As such, things like LTT, printk, EVLog, etc. should all use relayfs to get their data to user-space. The use of relayfs would, for example, avoid lost printk's. It would also result in the standardization of the way in which large amounts of data are transferred from kernel space to user space.

The main idea behind the relayfs is that every data flow is put into a separate "channel" and each channel is a file. In practice, each channel is a separate memory buffer allocated from within kernel space upon channel instantiation. Software needing to relay data to user space would open a channel or a number of channels, depending on its needs, and would log data to that channel. All the buffering and locking mechanics are taken care of by the relayfs.

davem doesn't like it though...

So would you consider running printk on Netlink sockets? Do you think Netlink could accomodate something as intensive as tracing? etc.

Of course it can. Look, netlink is used on routers to transfer hundreds of thousands of routing table entries in one fell swoop between a user process and the kernel every time the next hop Cisco has a BGP routing flap.

If you must have "enterprise wide client server" performance, we can add mmap() support to netlink sockets just like AF_PACKET sockets support such a thing. But I _really_ doubt you need this and unlike netlink sockets relayfs has no queueing model, whereas not only does netlink have one it's been tested in real life.

You guys are really out of your mind if you don't just take the netlink printk thing I did months ago and just run with it. When someone first told showed me this relayfs thing, I nearly passed out in disbelief that people are still even considering non-netlink solutions.

On the other hand Richard Moore says

In messaging terms relayfs is more about he collation of parts of a message rather than the sending of multiple messages to a session partner. There are three aspects in which relayfs radically differs from netlink:

1) it does not require a partnership -- a client and serve, or session pair -- it is simply a buffering mechanism that allows data be deposited. There is no expectation that the data will be consumed or that there is a listening partner. The reason fore this design point comes from the origin of relayfs as a buffering mechanism that satisfies the needs of a low-level system trace. Data from a trace might never be consumed if the system, sub-system or component never fails.

2) data can be deposited from any context - interrupt time, task time, sysinit in particular.

3) the depositing of data with relayfs has to depend one a very simple interface and infrastructure in order to function under a severely damaged system. My impression is that netlink depends a significant infrastructure.

2.4.20 NFS bug

I think Gavrie Philipson and I found a bug in 2.4.20 NFS that can cause files to read back as being full of nuls. I'm just going to see if I can make a little test case.

Microsoft relies on Linux to survive worm

In InformationWeek (no public link apparent), Mitch Wagner writes:

Microsoft says Linux isn't fit for the enterprise, but it's using Linux to help protect its servers from denial-of-service attacks. Web requests aimed at www.microsoft.com no longer go to machines on Microsoft's network. Instead, they're handled by the Akamai Technologies Inc. caching system, which runs Linux. A Microsoft spokeswoman says the goal is to "help ensure customers can get to the Blaster worm patch to protect their computers. Microsoft is using Akamai's extensive worldwide network to distribute the massive traffic that is illegally being directed at Microsoft by hackers."

Web servers? We have one we use, and one we sell. If our customers don't like it, we tell them to see figure one.

How to debug kernel problems

Step-by-step instructions on how to debug kernel problems. Kind of patchy, and they assume a fair level of knowledge and confidence, but still a good start.

kfishd

kfishd: an implementation of kfish in the kernel.

hp trivia

The "TOC" button on HP servers and workstations stands for "transfer of control", apparently in the sense of switching control of the machine from the regular OS to a debugger or monitor.

ia-64 Machine Check Architecture

MCA support in 64-bit Windows

MCA is "Machine Check Architecture", which is a way for the hardware to report problems to the operating system. These can be fatal problems such as a CPU internal error, or warnings such as a corrected memory error. (Also "Machine Check Abort", which means the fatal ones in particular.)

The OS can make them available for userspace analysis tools, which can do things like statistical analysis or interpretation of vendor-specific fields. The OS might also handle errors by panicing in severe cases, or handling them in less serious cases. For example, the OS may be able to respond appropriately to failure of a particular memory line, or even failure of a processor in an SMP machine.

The implementation depends on the extensive on-board firmware including the SAL (System Abstraction Layer).

MCA information can survive reboots or power cycles so it can be recovered even if the machine suddenly aborts.

print backtrace from OS INIT handler

mca-recovery project at SourceForge, and draft patch.

Good document on Machine Check Abort (MCA) Handling

* Applications cannot continue with bad data: they are killed when they touch an error spot in the memory.

* Applications having error spots in their valid memory regions, should be let run until they touch an error spot. There is a reasonable chance that they can run to completion.

As most of the memory is used for the applications - several (tens of) Gbytes vs. 64 Mbytes of the kernel - almost all errors happen in user space. User mode errors are recoverable by killing the application affected. Kernel mode errors are fatal because the error detection and recovery paths in Linux are not elaborated...

Intel's Itanium Processor Family Error Handling Guide

PAL = Processor Abstraction Layer, SAL = System Abstraction Layer.

An MCA INIT can be generated from a hardware button on some HP machines, so that you can NMI a hung machine and hopefully get some debugging information.

Porting Drivers to HP ZX1.

The zx6000 has a TOC button too, but RH AS 2.1 seems to just hang rather than going into a debug mode when it's pressed... :-/

kgdb

kgdb

Console on IA-64

Although many IA-64 machines have some kind of front-end processor, it looks like that is not used for the console under Linux. Instead, Linux talks directly to the serial port or VGA card. The front-end processor can also talk to the serial port, but this is apparently done at an electronic level; it doesn't mediate access.

ia64 toolchain

How to build an ia64 cross toolchain

The Wonderful World of 2.6

Joe Pranevich has a good document about new features in the Linux 2.6 kernel.

Microsoft: Linux developers are paid more

According to a Microsoft-funded study reported in the Register, Linux developers are paid more than embedded Windows developers. (If I were a CS undergraduate again, I know what I'd play with on the weekends. :-) Thanks, Bill!)

John Lettice's meta-analysis is pretty interesting: Microsoft's study would like to show that Linux is more expensive and risky to choose as a platform, but it seems more likely that people choose Linux for projects that are inherently more risky (and have a higher potential return):

Windows, in all its many and varied forms, is about commoditisation. Microsoft offers tightly defined and controlled platforms together with a wealth of standard tools, Ts & Cs and support packages for developers to work with, so it's fairly cheap and easy to produce products that are pretty similar to other people's products. Microsoft also, from way back in the 80s, has pushed the industrialisation of the development process. The result as far as hardware is concerned has been that differentiation and price have been eroded, and if you want to compete - with, say, Dell and HP - in the commodity handheld computer market you need to keep your team size down and your ambitions modest.

Avoiding commoditisation while sticking in the Windows arena is however hard, and quite probably a suicide mission, so if you don't want to go on it you go somewhere else instead, check? You don't necessarily go for Linux (Krasner's Linux versus Windows focus is in this sense artificial), but whatever you go for you're likely to be spending more on development than you would by taking the Windows commoditisation route. And your developers are more likely to be more expensive, goof-off prone geeks than cheaper, downtrodden code-monkeys.[...]

High risk projects that are possibly aimed at new categories we think may exist are more expensive and fail more often, but produce higher rewards when they hit the spot. Commoditised categories result in low risk projects, but the rewards are also lowered, and there's a case for saying that, where there is an overall objective of commoditisation, innovation dies.

This fits really well with my experience of embedded Linux. Linux is strong in networking. It's reasonably easy to build commodity firewall, webserver or storage appliances on Linux, and indeed many companies do just that — and with correspondingly low margins in general. Conversely, if you wanted to produce a mass-market PDA with little engineering effort, the sane course at the moment would be to licence PalmOS or WinCE. Of course (in both cases) just slapping some existing software on a box and calling it an appliance is not generally a path to riches, unless you can produce or distribute far cheaper than everyone else.

Personally I'd rather work on projects that are innovative and exciting, even if they're more risky.

Archives 2008: Apr Feb 2007: Jul May Feb Jan 2006: Dec Nov Oct Sep Aug Jul Jun Jan 2005: Sep Aug Jul Jun May Apr Mar Feb Jan 2004: Dec Nov Oct Sep Aug Jul Jun May Apr Mar Feb Jan 2003: Dec Nov Oct Sep Aug Jul Jun May