• Autoinst has been having network trouble on and off for years.
  • Here's a bit of a brainstorming session about initial things to try in resolving this
  • The first few (post-brainstorming) steps toward resolving this issue of late have been:
    1. Getting SNMP-based network statistics. SNMP data is organized in a hierarchy, so I've got a cron job that just samples all the interface-related statistics, and appends them to a file, which is as simple as what's below, because snmpwalk starts from the point in the hierarchy you specify, and traverses everything below that point, and this script is hung off of cron via crontab, >>'ing the output here. Later it may be useful to graph some of these numbers of time, or graph a quotient of two of them or similar.
        #!/bin/bash
        
        # network-stats-collector-1-strombrg> type -all timesecs
        # timesecs is /dcslib/allsys/etc/timesecs
        # Tue Feb 07 10:47:44
        
        # network-stats-collector-1-strombrg> type -all snmpwalk
        # snmpwalk is /usr/bin/snmpwalk
        # Tue Feb 07 10:47:49
        
        # network-stats-collector-1-strombrg> type -all sed
        # sed is /bin/sed
        # Tue Feb 07 10:47:53
        
        PATH=$PATH:/dcslib/allsys/etc:/usr/bin:/bin
        export PATH
        
        Time="$(timesecs | awk ' { print $1 }')"
        snmpwalk -v 2c -c community-string autoinst.nacs.uci.edu 1.3.6.1.2.1.2.2.1 2>&1 | sed "s/^/$Time /"
        
    2. Doing a periodic bandwidth test, run from a while loop. So far, it appears that the problem is only manifesting a couple of times a day, and then only pretty briefly, but looking over the statistics after more have been collected may show a different result.
    3. Writing a script to empirically determine which of 10/100, full duplex/half duplex is working best, and running it from cron once a day
    4. Fri Feb 10 12:41:24 PST 2006: Autoinst was just really slow a bit ago. Ran top, saw twagent was a bit busy. Ran: ...and suddenly the machine was much more responsive. Ultimately, slowdown paused on 790,196 I/O related system calls, so clearly:
      1. twagent is busy during the week sometimes
      2. autoinst has at least two distinct performance problems of late; maybe more. Namely tripwire messing up autoinst's buffer cache and/or prefetch, and the duplexity changing.
      Ostensibly, tripwire only gets a system busy on Sundays, but we know better now
  • Fri Feb 10 12:58:57 PST 2006: I decided it was time to set up a tethereal with absolute timestamps to capture snapped packets to see what was causing the brief problems. In the process of doing so, I decided to copy all of ethereal to a disk on autoinst, to reduce the network activity tethereal would cause due to demand paging, etc. And the transfer is dog slow, so it would appear we have a way of replicating the problem on demand. The command that "caused" this was: ...so I'm now going to try the same thing again, via a different protocol: ...and it's still dog slow. Inspecting the network parameters: 10/100, full/half: ...which makes it pretty clear that the parameters are changing. The reason the ssh transfer from bingy was taking forever, is that bingy was unpingable: But the switch in my office was having problems: But an additional problem pinging bingy from autoinst persists, but only if using DNS; not if using an IP address, which must mean that IPv6 is messed up:
  • Mon Feb 13 11:25:40 PST 2006:
  • Mon Feb 13 11:32:35 PST 2006
  • Wed Feb 22 16:10:34 PST 2006


    Hits: 5304
    Timestamp: 2024-02-29 09:10:25 PST

    Back to Dan's tech tidbits

    You can e-mail the author with questions or comments: