Some things to try on crashy unix/linux machines

Table of Contents

Theory
Fixing
Resetting

Light theory

You may be surprised to learn that not all crashes are the same. But imagine, a disk failure takes out a different part of the system from a network card failure, which in turn is very different from a CPU that gets wedged.

Choices (pick as many as you want :)

Finding the cause of the problem:
- Software
  1. Cross Platform
    - A script like pstrees may be helpful in determining what processes were running at the time of a crash, if you don't have decent crashdumps or something, or simply feel more comfortable running a script that analyzing crashdumps.
    - Accounting
      - You could try enabling accounting, to see what, if any, commands were run, and what, if any, users were logged in, at the time of a crash.
      - Linux specifics
        
        Enabling this is a matter of something like:
        
        touch /var/log/pacct
        chmod 600 /var/log/pacct
        accton /var/log/pacct
        
        See my dissect-pacct program for a way of dumping V3-format accounting data.
    - Syslog
      - If it's a problem where accessing a disk/filesystem becomes troublesome, disabling syslog to disk and enabling remote syslog may help get useful messages.
  2. Linux
    1. Crashdumps
      - Evaluation of a bunch of Linux crashdump facilities
      - netdumps, similar to savecore on Solaris, but over the net, and initiates a reboot (sometimes). Originated with RHEL, but it's getting ported around to other versions of Linux.
      - diskdump
      - LKCD
      - mkdump
      - kdump
    2. Magic SysRQ
      - Get it enabled, and verify that it's enabled. Then when a node crashes on which sysrq had been enabled, we can try to interrogate the system using magic sysrq. To enable it, we run "sysctl -w kernel.sysrq=1" and/or edit /etc/sysctl.conf. The sysctl command should take effect right away, and sysctl.conf should set it on reboots. If it won't enable with sysctl, then we may have to rebuild the kernel with "Magic SysRq key (CONFIG_MAGIC_SYSRQ)".
      - Alt-sysrq-? gives a terse help message. Basically, if it outputs anything, that's informative, because it means the system/kernel isn't 100% wedged.
      - You may have to increase the log level of the kernel on the console before sysrq outputs much of interest. You can do this with something like alt-sysrq-5 to go to log level 5.
      - alt-sysrq-t should give a list of tasks known to the kernel.
      - alt-sysrq-m may be useful too for a view of what's in memory.
      - Some things that sysrq will output will generate more (perhaps far more :) than a screenful of output. In this case, you can see more with shift-pageup.
      - Some keyboards don't have a key labeled sysrq. In that case, alt-printscreen is probably what's needed
      - Forcing a crash on a linux system - note that this may be harder on SuSE 9.3, due to a lack of the sysrq-c option
    3. lm_sensors and a Linux Journal article about them.
    4. IPMI (if you have it or can get it) gives you access to lots of interesting sensor data, including temperatures and voltages
    5. SMART. Here's an interesting paper by Google about hard disks and SMART.
  3. Solaris
    1. This page is pretty good. It covers adb, iscda, and more.
    2. Specifics
      - echo '$c' | adb -k unix.0 vmcore.0
      - iscda unix.0 vmcore.0
- Hardware
  1. Try running some diagnostics:
    1. Ultimate Boot CD is full of hardware diagnostics. Here's a page that presents a UBCD for a USB Thumb drive, and a script that derives same from the CD image (loopback mounted).
    2. Although included in UBCD, memtest86+ deserves special mention, and is very nice for finding memory problems, which can cause crashes. It's a lot better than the rudimentary memory test most PC's do when you power them on. BTW, don't run it for 5 minutes and consider it tested; you probably shouold run it for hours if not a day or two.
    3. memtester is not, to my knowledge, included in UBCD, and it's worth mentioning. It's like memtest86+, but it runs under most any UNIX or Linux, and tests more effectively than memtst, for example. It should be more convenient in some cases than memtest86+, because it doesn't require a reboot or console access.
  2. Try swapping parts. This merits an article itself, but the basics are to try swapping known-good parts in for suspect parts (one at a time, unless you've swapped them all one at a time without eliminating the problem - in which case either it's really a software problem, or you have a problem with the combination of two or more parts!), including but not limited to:
    - CPU
    - RAM
    - Power Supply
    - Motherboard
    - Disk drives
    - PCI cards (or cards that go into some other sort of bus). This will commonly include things like video cards, sound cards, disk controllers, etc.
    - Cables
    - Monitor
Resetting the system after a crash:
- Software
  1. Use a hypervisor, EG Virtual Box
  2. rICMP - a linux kernel patch that allows you to do a reboot with only an ICMP packet
  3. NMI Watchdog
  4. fallback-reboot
- Hardware
  1. I gather there are powerstrips that have their own IP address, which you can connect to and reboot systems with.
  2. Get an IPMI (daughter) card, or IPMI-enabled motherboard.
    - lets you remotely reboot a totally hung (unpingable) system.
    - lets you access the console remotely via Serial Over LAN (SOL), but only starting in IPMI 2.0; IPMI 1.5 didn't yet have this feature.
    - It includes an extra CPU accessible via a distinct IP address on the same NIC (always or just sometimes?)
  3. Hardware watchdog
    - You purchase a hardware watchdog for PC's
    - Sun sparc hardware has a hardware watchdog you can enable in the firmware
  4. CPS Inc. has a nice variety of fairly inexpensive means of rebooting a crashy system. Some allow for an admin to initiate a reboot, while others automatically reboot hardware that has crashed. Their products are separated into "Base units" (which actually do the reboot via, for example, cycling AC power, using the motherboard reset, or using motherboard power switch) and controllers (which accept connections via serial, http, cell phone, etc. and then pass the request off to a base unit). You can avoid buying a controller if you use an old computer to act as your controller. They have a bunch of free windows software to control them, or there is a program based on termios that can control some of their products on *ix systems (not all *ix's support termios, but most should. IIRC, competing interfaces are termio and sgtty). See also this URL on combining CPS' products and my try-copying-up-to-n-times script.

Hits: 9536
Timestamp: 2025-07-25 03:01:59 PDT

Back to Dan's tech tidbits

You can e-mail the author with questions or comments: