autoinst network trouble

Autoinst has been having network trouble on and off for years.

Here's a bit of a brainstorming session about initial things to try in resolving this

The first few (post-brainstorming) steps toward resolving this issue of late have been:

Getting SNMP-based network statistics. SNMP data is organized in a hierarchy, so I've got a cron job that just samples all the interface-related statistics, and appends them to a file, which is as simple as what's below, because snmpwalk starts from the point in the hierarchy you specify, and traverses everything below that point, and this script is hung off of cron via crontab, >>'ing the output here. Later it may be useful to graph some of these numbers of time, or graph a quotient of two of them or similar.

#!/bin/bash

# network-stats-collector-1-strombrg> type -all timesecs
# timesecs is /dcslib/allsys/etc/timesecs
# Tue Feb 07 10:47:44

# network-stats-collector-1-strombrg> type -all snmpwalk
# snmpwalk is /usr/bin/snmpwalk
# Tue Feb 07 10:47:49

# network-stats-collector-1-strombrg> type -all sed
# sed is /bin/sed
# Tue Feb 07 10:47:53

PATH=$PATH:/dcslib/allsys/etc:/usr/bin:/bin
export PATH

Time="$(timesecs | awk ' { print $1 }')"
snmpwalk -v 2c -c community-string autoinst.nacs.uci.edu 1.3.6.1.2.1.2.2.1 2>&1 | sed "s/^/$Time /"

Doing a periodic bandwidth test, run from a while loop. So far, it appears that the problem is only manifesting a couple of times a day, and then only pretty briefly, but looking over the statistics after more have been collected may show a different result.
Writing a script to empirically determine which of 10/100, full duplex/half duplex is working best, and running it from cron once a day
Fri Feb 10 12:41:24 PST 2006: Autoinst was just really slow a bit ago. Ran top, saw twagent was a bit busy. Ran:
- slowdown -v 0.01 $(Pidof -S twagent)
...and suddenly the machine was much more responsive. Ultimately, slowdown paused on 790,196 I/O related system calls, so clearly:
1. twagent is busy during the week sometimes
2. autoinst has at least two distinct performance problems of late; maybe more. Namely tripwire messing up autoinst's buffer cache and/or prefetch, and the duplexity changing.
Ostensibly, tripwire only gets a system busy on Sundays, but we know better now

Fri Feb 10 12:58:57 PST 2006: I decided it was time to set up a tethereal with absolute timestamps to capture snapped packets to see what was causing the brief problems. In the process of doing so, I decided to copy all of ethereal to a disk on autoinst, to reduce the network activity tethereal would cause due to demand paging, etc. And the transfer is dog slow, so it would appear we have a way of replicating the problem on demand. The command that "caused" this was:

bash-2.05# (cd /dcs/packages/ethereal && tar cf - .) | tar xvfp -

...so I'm now going to try the same thing again, via a different protocol:

ssh bingy.nac.uci.edu 'cd /dcs/packages/ethereal && /dcs/bin/gtar cflS - .' | /dcs/bin/python /dcslib/allsys/bin/reblock -e 60172 16384 300 | /dcs/bin/gtar xvfp -

...and it's still dog slow. Inspecting the network parameters: 10/100, full/half:

autoinst-root> cat S20fulldup 
ndd -set /dev/eri instance 0
ndd -set /dev/eri adv_10hdx_cap 0
ndd -set /dev/eri adv_10fdx_cap 0
ndd -set /dev/eri adv_100hdx_cap 1
ndd -set /dev/eri adv_100fdx_cap 0
ndd -set /dev/eri adv_autoneg_cap 0

autoinst-root> cat S20fulldup  | sed -e 's/-set/-get' 

autoinst-root> ndd -set /dev/eri instance 0

autoinst-root> ndd -get /dev/eri adv_10hdx_cap
0

autoinst-root> ndd -get /dev/eri adv_10fdx_cap
0

autoinst-root> ndd -get /dev/eri adv_100fdx_cap
1

autoinst-root> ndd -get /dev/eri adv_100hdx_cap
0

autoinst-root> bash -x /etc/rc3.d/S20fulldup  
+ ndd -set /dev/eri instance 0
+ ndd -set /dev/eri adv_10hdx_cap 0
+ ndd -set /dev/eri adv_10fdx_cap 0
+ ndd -set /dev/eri adv_100hdx_cap 1
+ ndd -set /dev/eri adv_100fdx_cap 0
+ ndd -set /dev/eri adv_autoneg_cap 0

...which makes it pretty clear that the parameters are changing. The reason the ssh transfer from bingy was taking forever, is that bingy was unpingable:

bash-2.05# ping bingy.nac.uci.edu
ICMPv6 Address Unreachable from gateway fe80::203:baff:fe44:f576
 for icmp6 from fe80::203:baff:fe44:f576 to fe80::a00:20ff:fe8f:5790
ICMPv6 Address Unreachable from gateway fe80::203:baff:fe44:f576
 for icmp6 from fe80::203:baff:fe44:f576 to fe80::a00:20ff:fe8f:5790
ICMPv6 Address Unreachable from gateway fe80::203:baff:fe44:f576
 for icmp6 from fe80::203:baff:fe44:f576 to fe80::a00:20ff:fe8f:5790
ICMPv6 Address Unreachable from gateway fe80::203:baff:fe44:f576
 for icmp6 from fe80::203:baff:fe44:f576 to fe80::a00:20ff:fe8f:5790
ICMPv6 Address Unreachable from gateway fe80::203:baff:fe44:f576
 for icmp6 from fe80::203:baff:fe44:f576 to fe80::a00:20ff:fe8f:5790
ICMPv6 Address Unreachable from gateway fe80::203:baff:fe44:f576
 for icmp6 from fe80::203:baff:fe44:f576 to fe80::a00:20ff:fe8f:5790

But the switch in my office was having problems:

seki-strombrg> ping bingy.nac
PING bingy.nac.uci.edu (128.200.34.36) 56(84) bytes of data.
From seki.nac.uci.edu (128.200.34.70) icmp_seq=1 Destination Host Unreachable
From seki.nac.uci.edu (128.200.34.70) icmp_seq=2 Destination Host Unreachable
From seki.nac.uci.edu (128.200.34.70) icmp_seq=3 Destination Host Unreachable

--- bingy.nac.uci.edu ping statistics ---
9 packets transmitted, 0 received, +6 errors, 100% packet loss, time 8000ms
, pipe 4
Fri Feb 10 13:19:24

seki-strombrg> ping sabaki.nac
PING sabaki.nac.uci.edu (128.200.34.80) 56(84) bytes of data.
From seki.nac.uci.edu (128.200.34.70) icmp_seq=0 Destination Host Unreachable
From seki.nac.uci.edu (128.200.34.70) icmp_seq=1 Destination Host Unreachable
From seki.nac.uci.edu (128.200.34.70) icmp_seq=2 Destination Host Unreachable
From seki.nac.uci.edu (128.200.34.70) icmp_seq=3 Destination Host Unreachable
From seki.nac.uci.edu (128.200.34.70) icmp_seq=4 Destination Host Unreachable
From seki.nac.uci.edu (128.200.34.70) icmp_seq=5 Destination Host Unreachable
From seki.nac.uci.edu (128.200.34.70) icmp_seq=7 Destination Host Unreachable
From seki.nac.uci.edu (128.200.34.70) icmp_seq=8 Destination Host Unreachable
From seki.nac.uci.edu (128.200.34.70) icmp_seq=9 Destination Host Unreachable
From seki.nac.uci.edu (128.200.34.70) icmp_seq=11 Destination Host Unreachable
From seki.nac.uci.edu (128.200.34.70) icmp_seq=12 Destination Host Unreachable
From seki.nac.uci.edu (128.200.34.70) icmp_seq=13 Destination Host Unreachable
From seki.nac.uci.edu (128.200.34.70) icmp_seq=14 Destination Host Unreachable
I pulled and replugged the power on the little network switch in my office here
64 bytes from sabaki.nac.uci.edu (128.200.34.80): icmp_seq=27 ttl=64 time=1274 ms
64 bytes from sabaki.nac.uci.edu (128.200.34.80): icmp_seq=28 ttl=64 time=274 ms
64 bytes from sabaki.nac.uci.edu (128.200.34.80): icmp_seq=29 ttl=64 time=0.163 ms
64 bytes from sabaki.nac.uci.edu (128.200.34.80): icmp_seq=30 ttl=64 time=0.134 ms
64 bytes from sabaki.nac.uci.edu (128.200.34.80): icmp_seq=31 ttl=64 time=0.168 ms
64 bytes from sabaki.nac.uci.edu (128.200.34.80): icmp_seq=32 ttl=64 time=0.191 ms
64 bytes from sabaki.nac.uci.edu (128.200.34.80): icmp_seq=33 ttl=64 time=0.208 ms

--- sabaki.nac.uci.edu ping statistics ---
34 packets transmitted, 7 received, +24 errors, 79% packet loss, time 33098ms
rtt min/avg/max/mdev = 0.134/221.416/1274.442/440.201 ms, pipe 5
Fri Feb 10 13:20:01

seki-strombrg> ping bingy.nac
PING bingy.nac.uci.edu (128.200.34.36) 56(84) bytes of data.
64 bytes from bingy.nac.uci.edu (128.200.34.36): icmp_seq=0 ttl=255 time=0.261 ms
64 bytes from bingy.nac.uci.edu (128.200.34.36): icmp_seq=1 ttl=255 time=0.245 ms
64 bytes from bingy.nac.uci.edu (128.200.34.36): icmp_seq=2 ttl=255 time=0.234 ms
64 bytes from bingy.nac.uci.edu (128.200.34.36): icmp_seq=3 ttl=255 time=0.229 ms
64 bytes from bingy.nac.uci.edu (128.200.34.36): icmp_seq=4 ttl=255 time=0.231 ms
64 bytes from bingy.nac.uci.edu (128.200.34.36): icmp_seq=5 ttl=255 time=0.239 ms
64 bytes from bingy.nac.uci.edu (128.200.34.36): icmp_seq=6 ttl=255 time=0.225 ms
64 bytes from bingy.nac.uci.edu (128.200.34.36): icmp_seq=7 ttl=255 time=0.225 ms
64 bytes from bingy.nac.uci.edu (128.200.34.36): icmp_seq=8 ttl=255 time=0.261 ms
64 bytes from bingy.nac.uci.edu (128.200.34.36): icmp_seq=9 ttl=255 time=0.244 ms
64 bytes from bingy.nac.uci.edu (128.200.34.36): icmp_seq=10 ttl=255 time=0.267 ms

--- bingy.nac.uci.edu ping statistics ---
17 packets transmitted, 17 received, 0% packet loss, time 16029ms
rtt min/avg/max/mdev = 0.222/0.242/0.268/0.025 ms, pipe 2
Fri Feb 10 13:20:19

seki-strombrg>

But an additional problem pinging bingy from autoinst persists, but only if using DNS; not if using an IP address, which must mean that IPv6 is messed up:

bash-2.05# ping bingy.nac
ICMPv6 Address Unreachable from gateway fe80::203:baff:fe44:f576
 for icmp6 from fe80::203:baff:fe44:f576 to fe80::a00:20ff:fe8f:5790
ICMPv6 Address Unreachable from gateway fe80::203:baff:fe44:f576
 for icmp6 from fe80::203:baff:fe44:f576 to fe80::a00:20ff:fe8f:5790
ICMPv6 Address Unreachable from gateway fe80::203:baff:fe44:f576
 for icmp6 from fe80::203:baff:fe44:f576 to fe80::a00:20ff:fe8f:5790
^C
bash-2.05# ping 128.200.34.80
128.200.34.80 is alive
bash-2.05# ping 128.200.34.36
128.200.34.36 is alive
bash-2.05#

Mon Feb 13 11:25:40 PST 2006:

Autoinst was at 100/full again this morning at about 11AM
I've commented out my cron job that was benchmarking 10/100, full/half every night at midnight
I'm building a local version of tcpdump.
It may be worthwile to hardcode autoinst to 100/full in the switch, since autoinst seems to want to keep switching to 100/full

Mon Feb 13 11:32:35 PST 2006

I've got the following command running on autoinst via its serial cable now:

bash-2.05# ./tcpdump -n -v -c 1000000 -w /export/home/network-trouble/tcpdump-started-2006-02-13 not host nsc-1.nacs and not host seki.nac 
tcpdump: listening on eri0, link-type EN10MB (Ethernet), capture size 68 bytes
Got 299

I've excluded seki, because I log into autoinst from seki often, and I've excluded nsc-1, because that's the host initiating the bandwidth tests.

Wed Feb 22 16:10:34 PST 2006

Judy's digging around in the switch autoinst is on today, and concluded that because autoinst should be on port 8, autoinst must be set to 100/Full in the switch. So I ran some more tests. She'd put autoinst on port 8, but that doesn't seem to be jibing. Then she began to suspect it was on port 3, because it was changing as I ran my test:

autoinst-root) ./inspect-network 
autonegotiation: 0
10/half: 0
10/full: 0
100/half: 1
100/full: 0

bash-2.05# ./test-network 
Turning off autonegotiation:

Wed Feb 22 16:02:03 PST 2006
10 hdx
No ping (yet?)
No ping (yet?)
Network is at least minimally usable
Starting performance test - transfer 1 meg by ssh
1+0 records in
1+0 records out

real        1.9
user        0.3
sys         0.0
Wed Feb 22 16:02:41 PST 2006
/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\


Wed Feb 22 16:02:41 PST 2006
10 fdx
No ping (yet?)
No ping (yet?)
Network is at least minimally usable
Starting performance test - transfer 1 meg by ssh
1+0 records in
1+0 records out

real       35.6
user        0.3
sys         0.0
Wed Feb 22 16:03:53 PST 2006
/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\


Wed Feb 22 16:03:53 PST 2006
100 hdx
No ping (yet?)
No ping (yet?)
Network is at least minimally usable
Starting performance test - transfer 1 meg by ssh
1+0 records in
1+0 records out

real        0.5
user        0.3
sys         0.0
Wed Feb 22 16:04:29 PST 2006
/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\


Wed Feb 22 16:04:29 PST 2006
100 fdx
No ping (yet?)
No ping (yet?)
Network is at least minimally usable
Starting performance test - transfer 1 meg by ssh
1+0 records in
1+0 records out

real       21.5
user        0.3
sys         0.0
Wed Feb 22 16:05:27 PST 2006
/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\


bash-2.05# sh -x /etc/rc3.d/S20
S20fulldup     S20newaliases  
bash-2.05# sh -x /etc/rc3.d/S20fulldup start
+ ndd -set /dev/eri instance 0 
+ ndd -set /dev/eri adv_10hdx_cap 0 
+ ndd -set /dev/eri adv_10fdx_cap 0 
+ ndd -set /dev/eri adv_100hdx_cap 1 
+ ndd -set /dev/eri adv_100fdx_cap 0 
+ ndd -set /dev/eri adv_autoneg_cap 0 
bash-2.05# ./inspect-network 
autonegotiation: 0
10/half: 0
10/full: 0
100/half: 1
100/full: 0
bash-2.05# ifconfig -a
lo0: flags=1000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000 
eri0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 128.200.34.23 netmask ffffff00 broadcast 128.200.34.255
        ether 0:3:ba:44:f5:76 
lo0: flags=2000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6> mtu 8252 index 1
        inet6 ::1/128 
eri0: flags=2000841<UP,RUNNING,MULTICAST,IPv6> mtu 1500 index 2
        ether 0:3:ba:44:f5:76 
        inet6 fe80::203:baff:fe44:f576/10 
bash-2.05#

I've suggested that we try to match up what port autoinst is really on, using autoinst's MAC address: 0:3:ba:44:f5:76
Judy confirmed that autoinst was on port 3 via autoinst's MAC address. Port 3 was set to autonegotiate. Because the switch is Cisco brand, we've hard coded the switch to 100/Full, and autoinst is now hardcoded to 100/Full as well. Changed recorded in autoinstall as well.

Hits: 6006
Timestamp: 2025-07-25 20:40:50 PDT

Back to Dan's tech tidbits

You can e-mail the author with questions or comments: