Martin Pool's blog

ia-64 Machine Check Architecture

MCA support in 64-bit Windows

MCA is "Machine Check Architecture", which is a way for the hardware to report problems to the operating system. These can be fatal problems such as a CPU internal error, or warnings such as a corrected memory error. (Also "Machine Check Abort", which means the fatal ones in particular.)

The OS can make them available for userspace analysis tools, which can do things like statistical analysis or interpretation of vendor-specific fields. The OS might also handle errors by panicing in severe cases, or handling them in less serious cases. For example, the OS may be able to respond appropriately to failure of a particular memory line, or even failure of a processor in an SMP machine.

The implementation depends on the extensive on-board firmware including the SAL (System Abstraction Layer).

MCA information can survive reboots or power cycles so it can be recovered even if the machine suddenly aborts.

print backtrace from OS INIT handler

mca-recovery project at SourceForge, and draft patch.

Good document on Machine Check Abort (MCA) Handling

* Applications cannot continue with bad data: they are killed when they touch an error spot in the memory.

* Applications having error spots in their valid memory regions, should be let run until they touch an error spot. There is a reasonable chance that they can run to completion.

As most of the memory is used for the applications - several (tens of) Gbytes vs. 64 Mbytes of the kernel - almost all errors happen in user space. User mode errors are recoverable by killing the application affected. Kernel mode errors are fatal because the error detection and recovery paths in Linux are not elaborated...

Intel's Itanium Processor Family Error Handling Guide

PAL = Processor Abstraction Layer, SAL = System Abstraction Layer.

An MCA INIT can be generated from a hardware button on some HP machines, so that you can NMI a hung machine and hopefully get some debugging information.

Porting Drivers to HP ZX1.

The zx6000 has a TOC button too, but RH AS 2.1 seems to just hang rather than going into a debug mode when it's pressed... :-/

Archives 2008: Apr Feb 2007: Jul May Feb Jan 2006: Dec Nov Oct Sep Aug Jul Jun Jan 2005: Sep Aug Jul Jun May Apr Mar Feb Jan 2004: Dec Nov Oct Sep Aug Jul Jun May Apr Mar Feb Jan 2003: Dec Nov Oct Sep Aug Jul Jun May