You may be surprised to learn that not all crashes are the same.
But imagine, a disk failure takes out a different part of the system
from a network card failure, which in turn is very different from a
CPU that gets wedged.
Choices (pick as many as you want :)
Finding the cause of the problem:
Software
Cross Platform
A script like pstrees may be helpful in
determining what processes were running at the time of a crash, if
you don't have decent crashdumps or something, or simply feel more
comfortable running a script that analyzing crashdumps.
Accounting
You could try enabling accounting, to see what, if
any, commands were run, and what, if any, users were
logged in, at the time of a crash.
Linux specifics
Enabling this is a matter of something like:
touch /var/log/pacct
chmod 600 /var/log/pacct
accton /var/log/pacct
See my dissect-pacct program
for a way of dumping V3-format accounting data.
Syslog
If it's a problem where accessing a disk/filesystem
becomes troublesome, disabling syslog to disk
and enabling remote syslog may help get useful
messages.
netdumps, similar to savecore on
Solaris, but over the net, and initiates a reboot (sometimes).
Originated with RHEL, but it's getting ported around to other
versions of Linux.
Get it enabled, and verify that it's enabled.
Then when a node crashes on which sysrq had been
enabled, we can try to interrogate the system using
magic sysrq. To enable it, we run "sysctl -w
kernel.sysrq=1" and/or edit /etc/sysctl.conf. The
sysctl command should take effect right away, and
sysctl.conf should set it on reboots. If it won't
enable with sysctl, then we may have to rebuild the
kernel with "Magic SysRq key (CONFIG_MAGIC_SYSRQ)".
Alt-sysrq-? gives a terse help message. Basically, if it outputs
anything, that's informative, because it means the system/kernel
isn't 100% wedged.
You may have to increase the log level of the kernel on the
console before sysrq outputs much of interest. You can do this with
something like alt-sysrq-5 to go to log level 5.
alt-sysrq-t should give a list of tasks known to the kernel.
alt-sysrq-m may be useful too for a view of what's in memory.
Some things that sysrq will output will generate more (perhaps far
more :) than a screenful of output. In this case, you can see more with
shift-pageup.
Some keyboards don't have a key labeled sysrq. In that case,
alt-printscreen is probably what's needed
IPMI (if you have it or can get it) gives you access to lots of interesting
sensor data, including temperatures and voltages
SMART.
Here's
an interesting paper by Google about hard disks and
SMART.
Solaris
This page is pretty good. It covers adb, iscda, and
more.
Specifics
echo '$c' | adb -k unix.0 vmcore.0
iscda unix.0 vmcore.0
Hardware
Try running some diagnostics:
Ultimate Boot
CD is full of hardware diagnostics. Here's a page that presents a
UBCD for a USB Thumb drive, and a script that derives same
from the CD image (loopback mounted).
Although included in UBCD, memtest86+ deserves special mention, and
is very nice for finding memory problems, which can cause crashes. It's a
lot better than the rudimentary memory test most PC's do when you power them
on. BTW, don't run it for 5 minutes and consider it tested; you probably
shouold run it for hours if not a day or two.
memtester
is not, to my knowledge, included in UBCD, and it's worth
mentioning. It's like memtest86+, but it runs under most
any UNIX or Linux, and tests more effectively than
memtst, for example. It should be more convenient in some
cases than memtest86+, because it doesn't require a
reboot or console access.
Try swapping parts. This merits an article itself, but the
basics are to try swapping known-good parts in for suspect
parts (one at a time, unless you've swapped them all one at a
time without eliminating the problem - in which case either
it's really a software problem, or you have a problem with
the combination of two or more parts!), including but not limited to:
CPU
RAM
Power Supply
Motherboard
Disk drives
PCI cards (or cards that go into some other sort of
bus). This will commonly include things like video cards,
sound cards, disk controllers, etc.
rICMP - a linux
kernel patch that allows you to do a reboot with only an ICMP
packet
NMI Watchdog
On RHEL 3 and FC2, and probably other linuxes as well, there's a
software NMI Watchdog that resets the system after it's been
wedged for a while.
You can see
/usr/src/linux-*/Documentation/nmi_watchdog.txt for information
about how to enable it, but it's basically just a small change to
your grub.conf or lilo.conf.
This is unlikely to help with a
wedged CPU, but if the IRQ subsystem is messed up, this still has
a chance of being able to recover with a reboot.
fallback-reboot
fallback-reboot is for the situation
where you have a machine that stays pingable, and gives banners on
resident programs (like smtp on a mail hub), but fails to give
banners on {x,}inetd-launched programs - IE, fork+exec is broken.
It does an mlockall() to
try to keep itself from being paged in and out, and does no other
form of disk I/O (after initialization), and does not
fork or exec. It also doesn't sync, so it's
definitely a last resort.
Hardware
I gather there are powerstrips that have their own IP address,
which you can connect to and reboot systems with.
Get an IPMI (daughter) card, or IPMI-enabled motherboard.
lets you remotely reboot a totally hung (unpingable) system.
lets you access the console remotely via Serial Over LAN (SOL),
but only starting in IPMI 2.0; IPMI 1.5 didn't yet have this feature.
It includes an extra CPU accessible via a distinct IP address
on the same NIC (always or just sometimes?)
Hardware watchdog
You purchase a hardware watchdog for PC's
Sun sparc hardware has a hardware watchdog you can enable
in the firmware
CPS Inc. has a
nice variety of fairly inexpensive means of rebooting a crashy
system. Some allow for an admin to initiate a reboot, while
others automatically reboot hardware that has crashed. Their
products are separated into "Base units" (which actually do the
reboot via, for example, cycling AC power, using the motherboard
reset, or using motherboard power switch) and controllers
(which accept connections via serial, http, cell phone, etc.
and then pass the request off to a base unit). You can avoid
buying a controller if you use an old computer to act as your
controller. They have a bunch of free windows software to
control them, or there is a program based on termios that can
control some of their products on *ix systems (not all
*ix's support termios, but most should. IIRC, competing
interfaces are termio and sgtty). See also this URL on combining CPS'
products and my
try-copying-up-to-n-times
script.