Note: This web page was automatically created from a PalmOS "pedit32" memo.

Problem solving on unix/linux systems

This document covers generic problem solving approaches that have
proved useful on unix and/or linux systems. Some of it applies to other
operating systems as well.

If you see a method of solving problems on unix and/or linux systems that
isn't here, Please let me know: strombrg at dcs dot nac dot uci dot edu.
I'll of course credit the source.

These are not, at this point, listed in any particular order, but they
may be someday. :)

1) Get the full text of any error messages. Take a guess what they mean,
and try to address the problem based on that.

2) Get the full text of any error messages, and google for them.
Leave out anything very system-specific, like pid numbers or values
of pointers (other than the NULL pointer). Often someone will have
already solved the problem you're seeing, and there'll be an answer to
your question in some archive somewhere. Googling in both the web and
usenet is generally a good idea. You may or may not want to restrict
your usenet search to a particular usenet group - sometimes this can
increase the relevancy of the results, but of course it can also cut
down greatly on the number of hits you get.

3) Run df. A lot of problems can be quickly tracked down by just
checking if any filesystems are full, or any remote (EG, NFS) mounts
are having problems.

4) Try truss/strace/par/trace/&c. These programs can list system
calls being executed by a program. Often the content of the system call
trace, near the bottom, will give a fair indication of what is wrong.
If one of the last things is trying to do something with a file, and an
"Esomething" error status is returned, there's a good chance that's
the problem. Alternatively, if the last thing is succesfully reading
a config file but shortly therafter giving an error anyway (via write()
or whatever, or perhaps not giving an error at all!), then there's a good
chance that the error is in that config file. It's often worth trying
something like this on both the client and the server. If it's hard to
fire up a tracer against a client quickly enough, then "echo $$", and
truss -f -p the pid that yields from another window. This will truss
(or whatever) your shell, and its subprocesses. It's also sometimes
helpful to truss -f -p inetd's pid, xinetd's pid, or other daemon's pid
(like sshd). If traceing httpd, you may have to kill and restart httpd
under truss, or change httpd's config file to only spawn one child (for
example). Sometimes if you're on a busy system, you'll get flooded with
information doing this. In such a situation, you can sometimes move
to another representative system, or set up a tight while loop that
will initiate your truss of a relevant process as soon as possible
after it is exec'd, by ps | grep'ing again and again. See also https://stromberg.dnsalias.org/~strombrg/debugging-with-syscall-tracers.html

5) You can usually tell which NFS mount is having problems by one of
three methods:

5a) Run df &. Wait a long time. Eventually, df will probably tell
you which NFS server is down.

5b) Run df &. Note the last filesystem listed. It is probably the
-next- filesystem in the machine's filesystem list that has the problem.
You can often list these filesystems by inspecting /etc/mtab, /etc/mnttab,
or running the mount command with no arguments.

5c) Use a system call tracer on df &. This will most likely identify
which filesystem is having problems pretty quickly. I generally prefer
this method of the three.

6) If the problem you are troubleshooting is network related, fire up
a sniffer on the traffic. ethereal/tethereal, snoop and tcpdump -v are
pretty good at annotating network conversations with useful information.
Even if the traffic is encrypted, you can sometimes make an educated guess
about where the problem lies based on the last host to send anything as
part of the conversation. Also, sometimes you can give sniffers keys
that they can use to decrypt traffic.

7) truss and such will probably detect this to some extent, but check
if the user in question is up to or exceeding their hard quota, or have
exceeded their soft quota for more than the specified amount of time
(usually one week). This problem can often lead to other problems -
for example, X11 credential forwarding may mysteriously fail if the
homedir is not writeable.

8) Check for permissions problems. Again, truss and such will help you
pinpoint this fairly quickly, but it can still sometimes help to think
"If I were this program, what files would I need, and do I have the
needed access?"

9) Try to eliminate as many variables as you can. Compare across
machines. Do all machines of the same OS type have the same problem?
Consider entire platforms as well as increasingly minor releases of
the software. Also compare across users: Is the problem unique to a
specific user or group of users? If so, why?

10) Check if the program, or the components of the program, have been
modified recently. ls -l `which chmod`, for
example. Also, get a list of libraries used by the program, and see if
they've been updated. You can usually do this with "ldd /bin/ls" or
"odump -Dl /bin/ls" or "dump -X 32 -Tv /bin/ls". Another alternative
is to strings the binary ("strings -a `which
chmod` | grep / | less -sc"), and then checking each of the files and/or
directories the program references.

11) If one system is working, and another is not, compare the md5sum's
of the files in step 10 on a working system, and a nonworking system

12) If one user is working, and another user is not, there is a good
chance there's a permissions problem, which again, truss and co. can help
you identify. Another major class of problems come from differences
in environment variables. To track down this kind of problem, "su
- okuser" followed by "env | sort > /tmp/env.okuser; exit" and
then "su - baduser" followed by "env | sort > /tmp/env.baduser".
You can then "diff -u /tmp/env.okuser /tmp/env.baduser" to determine
what differences the users have in their environments. If there are
a lot of differences, you can binary search on the differences, until
you pinpoint the one that matters. I've also sometimes replaced an
entire environment with that of another user, to see if there is any
variable leading to the trouble, or if it is really something else.
Please note that this sort+diff method isn't perfect, especially
if some environment variables contain newlines. See also https://stromberg.dnsalias.org/~strombrg/env-search.html

13) Sometimes it is helpful to set up a cron job or while loop, that will
save the status of a particular thing (like "ps axf", "hps", "netstat
-a", "uptime" and so on) in a series of files, named by date +%whatever.
Then when a system finally crashes, you can get some idea of what was
happening at the time, by looking at the last item(s) in your output.

14) Sometimes it is helpful to see if a particular kind of problem
is always happening at the same time every day. This tends to lead
to hypothesies like "is it a cron job?" or "Is it a user with regular
behavior?" Checking nagios can help with this.

15) If you're dealing with a network service, try to replicate
the problem (in a minimalist way) by telnet'ing to the port on
the host (optionally, from the client), or using the "ssl-connect"
program to connect to an openssl-encrypted service - see also https://stromberg.dnsalias.org/~strombrg/ssl-connect.html

16) If there is a technologically-enforced licensing scheme involved,
check if any license servers have died, or if any licenses have expired,
or if any license server configuration changes have been made (check
both the license manager(s)' input data, as well as its executable and
dependent libraries - see if any changes have been made recently).

17) Ask users when they first noticed the problem. This can lead to
recalling a change that was made around that time.

18) If you have one group of users with a problem, and another
group of users without a problem, you can binary search their
config file keywords, much like was mentioned above on environment
variable issues. You can also do a quick, rudimentary check of
users' config files using the "classify" program, or my "equivs"
program. classify has more flexible options, but my equivs
program is usually faster on large collections of input files. https://stromberg.dnsalias.org/~strombrg/software/

19) If you're on an AIX system, and you're seeing strange shared library
conflicts, study up on "loader domains". Question: Do any other *ix's
have "loader domains" or something similar to them?

20) Check any and all relevant logs! If you don't find anything, go
check any logs that have changed recently (works best on relatively
quiet systems). This is triply true if you see a truss (or similar)
writing to a log file, or opening a socket or door to syslog.

21) If you're having trouble finding stuff in your syslog files, consider
combining them into one big file. Also, a script that pulls anything
you've had trouble with before in your syslog data, is a really good
way to be proactive.

22) Don't rule out multi-variable problems or holistic situations
unnecessarily. While it's usually best to initially assume a
single-problem issue, and that reductionistic analysis will work,
eventually solution-resistant problems call for considering things like
"OK, are there two variables (or more) in specific combinations) that
give the failure, while other combinations of the same variables give
working results?" To sum this up in programmer/logician terms, in the
two variable case, sometimes "a and b" yields problems, but sometimes it's
"not a and b" or "a and not b" or "not a and not b".

23) Try getting a backtrace. This may help you, or it may help the
people you request help from. Usually you can do this with "gdb program
[core]" followed by "run -a arg1 arg2 arg3 ... argn" followed by "bt".
Newer gdb's don't seem to want the -a anymore.

24) Try other forms of debugging - whatever's availalble. If you're a
programmer, you may want to try ddd or similar on C/C++/whatever programs.
If you're troubleshooting an sh/ash/ksh/bash script, try throwing in
"set -x" (and optionally, "set +x") here and there, to put the error
in context. If you're troubleshooting a csh/tcsh script, try putting a
"-x" on the #! line (the first line).

25) If you're on a mixed wordsize (EG 32 bit and 64 bit) system, are you
getting a bad combination of 32 bit and 64 bit libraries at load time?
Or are you seeing libraries that are available for 32 bit systems,
but not for 64 bit systems (or vice-versa)?

26) If your OS has a "map the 0th page to something innocuous and
writable" option, go ahead and try it, but be sure to report the crash
to the developers/maintainers anyway. This can sometimes help make null
pointer dereferencing relatively toothless. Some OSes put a "bomb" at the
0th page, so that programmers can catch their errors early. Others don't.
On Solaris 8 (maybe earlier), we have /usr/lib/lib0@0.so.1 - which you
should sometimes be able to eliminate problems with through LD_PRELOAD.

27) Can you move the application to another machine, on which it -will- work?

28) Can you upgrade the operating system on the machine(s) that is/are
having problems?

29) Can you put a different operating system on the same hardware, that
will fix the problem? (EG, there are many *ix's that run on x86 hardware.
If you're having problems with NetBSD, maybe try Fedora. If you're
having problems with Fedora, maybe try DragonFlyBSD. If you're having
problems with DragonFly, maybe try SuSE. And so on. When considering
this, keep in mind that in some environments, it's helpful to cut down
on the number of OSes in play. In others, you can chose whatever's best
for just the single job at hand. Bear in mind that a large number of
OSes means extra labor put into patching, as compared to a small number
of OSes. Some folks like to just compile their own binaries from the same
sources, and there can be a place for that, but don't underestimate the
value of a vendor or distributor doing quality testing on the programs
you're using, in the environment you're using them.

30) A tool like nagios, netreo or bigbrother can help you recognize
patterns in a problem. EG, does it happen at the same time of day,
5 days a week? Is it happening to all the Suns we support?

31) If you are trying to sort out trouble with an RPC service,
rpcinfo is your friend, in addition to some of the other methods.
If you "rpcinfo -p <hostname>", that should tell you what
RPC services the host in question has registered. You can then
"rpcinfo -u <hostname> <rpcservice>" to list the
readiness of the UDP versions of a service, and you can do the same
for TCP versions of a server with the "-t" option. See also https://stromberg.dnsalias.org/~strombrg/rpc-health.html

32) Try ping. :) If a machine isn't pingable, try traceroute or mtr.
traceroute and mtr will be more useful if you've saved a copy of what
they should normally look like in advance - that is, unless you have a
network small enough to know how it's supposed to look without that. :)
Be aware though, that if your network has redundant paths built into it,
sometimes what you saved won't correspond to the path you're seeing at
the time you investigate a problem.

33) Check if the problem is DNS-related. Try "dig hostname.uci.edu",
and "dig hostname.uci.edu mx" and "dig -x 128.200.34.1" and such.
Some weird network problems can be traced to slow DNS resolution, say,
because of a down DNS server timing out before a good DNS server answers.
Another common problem is for programs that verify that a host has a
good source address, to reverse resolve the client's IP address - and
some of these programs will reject requests from hosts that don't have
proper reverse resolution configured (ask your DNS people about "the PTR
record"). Make sure that your /etc/resolv.conf is set up correctly too.

Also, sometimes what -seems- to be a DNS problem can end up being a bad
entry in the NIS "hosts" map. I recommend that you keep your NIS hosts
map 0 length.

These are from Shane Chen on the OCLUG maliing list, on the subject of
tracking down DNS problems:

* Figure out the condition of your ns servers by pinging them.
Are they up? Is the latency bad? Is it dropping packets?

* Check their performance by manually resolving against them. Something
like `time host google.com ns_server.foo`. How
long is it taking to resolve something? How long does it take to resolve
the same domain if you try another name server (e.g. ns1.earthlink.net)?

* See if there's any difference between ping a host by FQDN and IP
(preferably some domain you haven't resolved by using your local name
server - `host foo.bar ns1.earthlink.net` then
ping the IP first, follow by the domain).

34) Another class of problems can be tracked down to trouble in some form
of name service switch configuration. Some hosts put this information
in /etc/nsswitch.conf, /etc/svc.conf, or even /etc/resolv.conf.

35) If you're having (or suspect you're having) NIS problems, try
ypcat'ing the relevant maps, EG "ypcat passwd". Some weird NIS problems
can be traced back to a corrupted map, a map that some OSes require and
others don't (EG, for speeding up, through indexing, getpwuid lookups -
a good sniffer is your friend here). Other NIS problems can be traced
to an outdated NIS slave or master that hasn't been updated in a while -
"ypwhich" and "ypwhich -m" can be helpful. You can also get a list of
map aliases with "ypcat -x".

36) Try to get an easy way of replicating the problem. If it's a
complaint from only a single user, consider using x11vnc or similar
so you can see the problem "first hand" over the network. https://stromberg.dnsalias.org/~strombrg/vnc.html#addons

37) If you suspect a particular process on a system of causing
load problems, or other forms of problems, when way of testing that
hypothesis is to kill the process. But there's a more subtle way too:
kill -STOP <pid>, monitor how the system changes, and then kill
-CONT <pid> to make the process pick up where it left off.

Less technical items:

1) Post to newsgroups or bulletin boards or mailing lists -relevant-
to the difficulty you're faced with. Seriously consider reading
any relevant FAQ's -first-! Schedule yourself times to check in
on the message thread you've created. Consider hanging around
on that forum a while longer to contribute a couple/few solutions
(or more) yourself, to repay the group for its help. Read this: http://www.catb.org/~esr/faqs/smart-questions.html
!

2) Also, sometimes using some form of chat channel, like IRC or an Instant
Messaging service can be helpful for quick turnaround, but often will
not give your question exposure to the large number of eyes that a bbs,
mailing list or newsgroup will.

3) Contact the relevant vendor, vendors, author, authors, maintainer,
or maintainers, if any. If you "have no vendor", consider signing up
with one of the many consulting businesses that are springing up, which
specialize in support of other people's opensource software. You like
to be thanked for being helpful; so do the people you're asking for
help from. In the case of an opensource author or maintainer, be sure
to mention how valuable the software system is to you or your clients'
endeavors, if it is.

4) Setting user expectations: I've found that the single most useful
phrase in helping endusers understand the nature of IT jobs, is to say
"OK, that's one hurdle cleared. Now we have to check and see if there
are any others."

5) Smile and try to enjoy your work. This will often spread to your
users in the form of greater user satisfaction. EG, if you grimace on
the phone, sometimes people pick up on that.

Back to Dan's palm memos