Feedback on where to get more or less detailed is desired.
Thanks!
2006-02-02 Autoinst network problem(s) resolution plan
Assessing the problem
Get omnicenter doing an -available- bandwidth test - IE benchmark - if too low, notify
Get a periodic sampling of available bandwidth going - something that can be graphed well.
Probably also in omnicenter, or at least rrdtool/cacti, or just gnuplot I guess like I've been doing for load on meter
Inspect logs in detail, focusing on the time of poor available bandwidth
Set up graphing of load factor and bandwidth utilization (not available, but utilization)
Make my test-network script a cron job on autoinst, and see if it keeps coming up with the same result consistently. Depending on how
serious we are about a speedy resolution, we could run only at night, or throughout the day; at fixed times, or at random times. It incurs brief
downtime on each run - on the order of 4 downtimes of 30 seconds each. Find out for sure if NFS and ssh sessions are preserved during these downtimes.
Get a snapped-short sniff going, to see if any particular kind of traffic is correlated with the problems
Upgrading to Solaris 10 would allow us to use DTrace to assess the problem. Also, Solaris 10 also has a unique
identifier that is associated with each individual detected problem. These might be useful in understanding and communicating about
the problem.
Check all hardware components
Replace motherboard
Replace memory
Replace any cards in the machine
Replace CPU(s)
Replace switch
Replace network cable
If network is on motherboard, try adding a NIC
Replace anything else that isn't nailed down :) Just take a gander inside the case, and see what's left
Identify a short network path from autoinst to some other host, and replace all network components between autoinst and that host.
Alternatively, and less onerously, identify two network paths that have only a small portion of the network path in common, and see
if the problem manifests with one and not the other. I suspect that both machines will glitch similarly though.
Check all software components, including patches
In this case, this mostly means searching sunsolve and the logs, and placing a call to the vendor. There isn't really anything
to truss, at least not initially, since this is happening at the kernel level
Upgrading to Solaris 10 would allow us to use DTrace to assess the problem. Also, Solaris 10 also has a unique
identifier that is associated with each individual detected problem. These might be useful in understanding and communicating about
the problem.
People impacted: folks in DCS doing installs or making edits to the each scripts on autoinst. Get subject impressions
If reductionistic analysis isn't leading to a solution as readily as one might desire, go holistic: replace large chunks of the equation at a time
Replace software - reinstall/upgrade OS. Upgrading to Solaris 10 would allow us to use DTrace to assess the problem.
Also, Solaris 10 also has a unique identifier that is associated with each individual, detected problem. These might be
useful in understanding and communicating about the problem.
Replace hardware with something similar - perhaps bingy, permanently or on a trial basis
Replace entire platform with something else, like a linux machine on x86 or x86-64. This would complicate adding new releases of
Solaris to autoinst, but we used to use the same technique (adding via NFS from a Sun) back when autoinst was a sun3