• Symptoms:
    1. dcs.nac crashes when extracting a large zip file (like a Solaris 8 CD-ROM) from NFS on dcs.nac, to NFS on dcs.nac
    2. dcs.nac gives many NFS timeouts with Solaris NFS clients, but not an FC2 NFS client
    3. dcs.nac reports many errors that look like IDE disk (DMA) problems in dmesg
  • What we tried:
    1. I tried turning off DMA with "/sbin/hdparm -d0 /dev/hda". Judy indicated this helped, but there were still problems. I checked dmesg, and it was no longer reporting copious DMA errors, but it continued to report lost interrupts.
    2. I turned DMA back on (via a reboot), and added "noapic" in /etc/grub.conf (same reboot). This appears to have solved the problem.
    3. Turning off the APIC may have a negative effect on performance on SMP machines. ISTR that dcs.nac has slots for multiple CPU's, but we only have one CPU at this time. If we ever add a CPU to dcs.nac, we may want to benchmark "noapic" vs "ideX=serialize".
    4. I have a suspicion that part of the problem, was that Evans and Saska were using the same make and model of network switch that has problems with Solaris NFS packets. The switches they were using were D-Link DSS-8+'s, while the switch I've been using that hasn't had trouble is an SMC 108DT. There is precedent for such a thing:
      1. I first learned there can be problems with networking equipment and only a limited number of protocols many years ago, when I was a graduate student at the University of Cincinnati, from a fellow grad student who knew a considerable amount about networking.
      2. I went through 4 switches myself (3 bargain basement, the 4th was a little more expensive, the SMC I currently have) before finding one that didn't have problems with Solaris NFS.
      3. I have a linksys DSL router at home (or perhaps it's my el-cheapo switch) that has problems with a small number of http submissions. However, if I remote display a mozilla and access the exact same http server with the same inputs, things work fine. I also once upgraded my linksys firmware to eliminate a similar problem with ssh.
      4. In other words, it isn't enough to say that because an RHEL machine doesn't have a problem through the same switch, that the switch is necessarily doing fine.
      5. That said, I want to restress that this is only a hypothesis; it needs to be examined more closely to get a more definite word on the subject
    5. Update, Tue Oct 5 11:24:24 PDT 2004: Last night, we had a small number of IDE problems. I've e-mailed the maintainer of the AMD-8111 IDE code in Linus' kernel (and RHEL 3's kernel) to see if he has any helpful information for us.