Note: This web page was automatically created from a PalmOS "pedit32" memo.

Problem solving on unix/linux systems


This document covers generic problem solving approaches that have
proved useful on unix and/or linux systems.  Some of it applies to other
operating systems as well.

If you see a method of solving problems on unix and/or linux systems that isn't here, Please let me know: strombrg at dcs dot nac dot uci dot edu. I'll of course credit the source.
These are not, at this point, listed in any particular order, but they may be someday. :)
1) Get the full text of any error messages. Take a guess what they mean, and try to address the problem based on that.
2) Get the full text of any error messages, and google for them. Leave out anything very system-specific, like pid numbers or values of pointers (other than the NULL pointer). Often someone will have already solved the problem you're seeing, and there'll be an answer to your question in some archive somewhere. Googling in both the web and usenet is generally a good idea. You may or may not want to restrict your usenet search to a particular usenet group - sometimes this can increase the relevancy of the results, but of course it can also cut down greatly on the number of hits you get.
3) Run df. A lot of problems can be quickly tracked down by just checking if any filesystems are full, or any remote (EG, NFS) mounts are having problems.
4) Try truss/strace/par/trace/&c. These programs can list system calls being executed by a program. Often the content of the system call trace, near the bottom, will give a fair indication of what is wrong. If one of the last things is trying to do something with a file, and an "Esomething" error status is returned, there's a good chance that's the problem. Alternatively, if the last thing is succesfully reading a config file but shortly therafter giving an error anyway (via write() or whatever, or perhaps not giving an error at all!), then there's a good chance that the error is in that config file. It's often worth trying something like this on both the client and the server. If it's hard to fire up a tracer against a client quickly enough, then "echo $$", and truss -f -p the pid that yields from another window. This will truss (or whatever) your shell, and its subprocesses. It's also sometimes helpful to truss -f -p inetd's pid, xinetd's pid, or other daemon's pid (like sshd). If traceing httpd, you may have to kill and restart httpd under truss, or change httpd's config file to only spawn one child (for example). Sometimes if you're on a busy system, you'll get flooded with information doing this. In such a situation, you can sometimes move to another representative system, or set up a tight while loop that will initiate your truss of a relevant process as soon as possible after it is exec'd, by ps | grep'ing again and again. See also https://stromberg.dnsalias.org/~strombrg/debugging-with-syscall-tracers.html
5) You can usually tell which NFS mount is having problems by one of three methods: 5a) Run df &. Wait a long time. Eventually, df will probably tell you which NFS server is down. 5b) Run df &. Note the last filesystem listed. It is probably the -next- filesystem in the machine's filesystem list that has the problem. You can often list these filesystems by inspecting /etc/mtab, /etc/mnttab, or running the mount command with no arguments. 5c) Use a system call tracer on df &. This will most likely identify which filesystem is having problems pretty quickly. I generally prefer this method of the three.
6) If the problem you are troubleshooting is network related, fire up a sniffer on the traffic. ethereal/tethereal, snoop and tcpdump -v are pretty good at annotating network conversations with useful information. Even if the traffic is encrypted, you can sometimes make an educated guess about where the problem lies based on the last host to send anything as part of the conversation. Also, sometimes you can give sniffers keys that they can use to decrypt traffic.
7) truss and such will probably detect this to some extent, but check if the user in question is up to or exceeding their hard quota, or have exceeded their soft quota for more than the specified amount of time (usually one week). This problem can often lead to other problems - for example, X11 credential forwarding may mysteriously fail if the homedir is not writeable.
8) Check for permissions problems. Again, truss and such will help you pinpoint this fairly quickly, but it can still sometimes help to think "If I were this program, what files would I need, and do I have the needed access?"
9) Try to eliminate as many variables as you can. Compare across machines. Do all machines of the same OS type have the same problem? Consider entire platforms as well as increasingly minor releases of the software. Also compare across users: Is the problem unique to a specific user or group of users? If so, why?
10) Check if the program, or the components of the program, have been modified recently. ls -l `which chmod`, for example. Also, get a list of libraries used by the program, and see if they've been updated. You can usually do this with "ldd /bin/ls" or "odump -Dl /bin/ls" or "dump -X 32 -Tv /bin/ls". Another alternative is to strings the binary ("strings -a `which chmod` | grep / | less -sc"), and then checking each of the files and/or directories the program references.
11) If one system is working, and another is not, compare the md5sum's of the files in step 10 on a working system, and a nonworking system
12) If one user is working, and another user is not, there is a good chance there's a permissions problem, which again, truss and co. can help you identify. Another major class of problems come from differences in environment variables. To track down this kind of problem, "su - okuser" followed by "env | sort > /tmp/env.okuser; exit" and then "su - baduser" followed by "env | sort > /tmp/env.baduser". You can then "diff -u /tmp/env.okuser /tmp/env.baduser" to determine what differences the users have in their environments. If there are a lot of differences, you can binary search on the differences, until you pinpoint the one that matters. I've also sometimes replaced an entire environment with that of another user, to see if there is any variable leading to the trouble, or if it is really something else. Please note that this sort+diff method isn't perfect, especially if some environment variables contain newlines. See also https://stromberg.dnsalias.org/~strombrg/env-search.html
13) Sometimes it is helpful to set up a cron job or while loop, that will save the status of a particular thing (like "ps axf", "hps", "netstat -a", "uptime" and so on) in a series of files, named by date +%whatever. Then when a system finally crashes, you can get some idea of what was happening at the time, by looking at the last item(s) in your output.
14) Sometimes it is helpful to see if a particular kind of problem is always happening at the same time every day. This tends to lead to hypothesies like "is it a cron job?" or "Is it a user with regular behavior?" Checking nagios can help with this.
15) If you're dealing with a network service, try to replicate the problem (in a minimalist way) by telnet'ing to the port on the host (optionally, from the client), or using the "ssl-connect" program to connect to an openssl-encrypted service - see also https://stromberg.dnsalias.org/~strombrg/ssl-connect.html
16) If there is a technologically-enforced licensing scheme involved, check if any license servers have died, or if any licenses have expired, or if any license server configuration changes have been made (check both the license manager(s)' input data, as well as its executable and dependent libraries - see if any changes have been made recently).
17) Ask users when they first noticed the problem. This can lead to recalling a change that was made around that time.
18) If you have one group of users with a problem, and another group of users without a problem, you can binary search their config file keywords, much like was mentioned above on environment variable issues. You can also do a quick, rudimentary check of users' config files using the "classify" program, or my "equivs" program. classify has more flexible options, but my equivs program is usually faster on large collections of input files. https://stromberg.dnsalias.org/~strombrg/software/
19) If you're on an AIX system, and you're seeing strange shared library conflicts, study up on "loader domains". Question: Do any other *ix's have "loader domains" or something similar to them?
20) Check any and all relevant logs! If you don't find anything, go check any logs that have changed recently (works best on relatively quiet systems). This is triply true if you see a truss (or similar) writing to a log file, or opening a socket or door to syslog.
21) If you're having trouble finding stuff in your syslog files, consider combining them into one big file. Also, a script that pulls anything you've had trouble with before in your syslog data, is a really good way to be proactive.
22) Don't rule out multi-variable problems or holistic situations unnecessarily. While it's usually best to initially assume a single-problem issue, and that reductionistic analysis will work, eventually solution-resistant problems call for considering things like "OK, are there two variables (or more) in specific combinations) that give the failure, while other combinations of the same variables give working results?" To sum this up in programmer/logician terms, in the two variable case, sometimes "a and b" yields problems, but sometimes it's "not a and b" or "a and not b" or "not a and not b".
23) Try getting a backtrace. This may help you, or it may help the people you request help from. Usually you can do this with "gdb program [core]" followed by "run -a arg1 arg2 arg3 ... argn" followed by "bt". Newer gdb's don't seem to want the -a anymore.
24) Try other forms of debugging - whatever's availalble. If you're a programmer, you may want to try ddd or similar on C/C++/whatever programs. If you're troubleshooting an sh/ash/ksh/bash script, try throwing in "set -x" (and optionally, "set +x") here and there, to put the error in context. If you're troubleshooting a csh/tcsh script, try putting a "-x" on the #! line (the first line).
25) If you're on a mixed wordsize (EG 32 bit and 64 bit) system, are you getting a bad combination of 32 bit and 64 bit libraries at load time? Or are you seeing libraries that are available for 32 bit systems, but not for 64 bit systems (or vice-versa)?
26) If your OS has a "map the 0th page to something innocuous and writable" option, go ahead and try it, but be sure to report the crash to the developers/maintainers anyway. This can sometimes help make null pointer dereferencing relatively toothless. Some OSes put a "bomb" at the 0th page, so that programmers can catch their errors early. Others don't. On Solaris 8 (maybe earlier), we have /usr/lib/lib0@0.so.1 - which you should sometimes be able to eliminate problems with through LD_PRELOAD.
27) Can you move the application to another machine, on which it -will- work?
28) Can you upgrade the operating system on the machine(s) that is/are having problems?
29) Can you put a different operating system on the same hardware, that will fix the problem? (EG, there are many *ix's that run on x86 hardware. If you're having problems with NetBSD, maybe try Fedora. If you're having problems with Fedora, maybe try DragonFlyBSD. If you're having problems with DragonFly, maybe try SuSE. And so on. When considering this, keep in mind that in some environments, it's helpful to cut down on the number of OSes in play. In others, you can chose whatever's best for just the single job at hand. Bear in mind that a large number of OSes means extra labor put into patching, as compared to a small number of OSes. Some folks like to just compile their own binaries from the same sources, and there can be a place for that, but don't underestimate the value of a vendor or distributor doing quality testing on the programs you're using, in the environment you're using them.
30) A tool like nagios, netreo or bigbrother can help you recognize patterns in a problem. EG, does it happen at the same time of day, 5 days a week? Is it happening to all the Suns we support?
31) If you are trying to sort out trouble with an RPC service, rpcinfo is your friend, in addition to some of the other methods. If you "rpcinfo -p <hostname>", that should tell you what RPC services the host in question has registered. You can then "rpcinfo -u <hostname> <rpcservice>" to list the readiness of the UDP versions of a service, and you can do the same for TCP versions of a server with the "-t" option. See also https://stromberg.dnsalias.org/~strombrg/rpc-health.html
32) Try ping. :) If a machine isn't pingable, try traceroute or mtr. traceroute and mtr will be more useful if you've saved a copy of what they should normally look like in advance - that is, unless you have a network small enough to know how it's supposed to look without that. :) Be aware though, that if your network has redundant paths built into it, sometimes what you saved won't correspond to the path you're seeing at the time you investigate a problem.
33) Check if the problem is DNS-related. Try "dig hostname.uci.edu", and "dig hostname.uci.edu mx" and "dig -x 128.200.34.1" and such. Some weird network problems can be traced to slow DNS resolution, say, because of a down DNS server timing out before a good DNS server answers. Another common problem is for programs that verify that a host has a good source address, to reverse resolve the client's IP address - and some of these programs will reject requests from hosts that don't have proper reverse resolution configured (ask your DNS people about "the PTR record"). Make sure that your /etc/resolv.conf is set up correctly too. Also, sometimes what -seems- to be a DNS problem can end up being a bad entry in the NIS "hosts" map. I recommend that you keep your NIS hosts map 0 length.
These are from Shane Chen on the OCLUG maliing list, on the subject of tracking down DNS problems: * Figure out the condition of your ns servers by pinging them. Are they up? Is the latency bad? Is it dropping packets? * Check their performance by manually resolving against them. Something like `time host google.com ns_server.foo`. How long is it taking to resolve something? How long does it take to resolve the same domain if you try another name server (e.g. ns1.earthlink.net)? * See if there's any difference between ping a host by FQDN and IP (preferably some domain you haven't resolved by using your local name server - `host foo.bar ns1.earthlink.net` then ping the IP first, follow by the domain).
34) Another class of problems can be tracked down to trouble in some form of name service switch configuration. Some hosts put this information in /etc/nsswitch.conf, /etc/svc.conf, or even /etc/resolv.conf.
35) If you're having (or suspect you're having) NIS problems, try ypcat'ing the relevant maps, EG "ypcat passwd". Some weird NIS problems can be traced back to a corrupted map, a map that some OSes require and others don't (EG, for speeding up, through indexing, getpwuid lookups - a good sniffer is your friend here). Other NIS problems can be traced to an outdated NIS slave or master that hasn't been updated in a while - "ypwhich" and "ypwhich -m" can be helpful. You can also get a list of map aliases with "ypcat -x".
36) Try to get an easy way of replicating the problem. If it's a complaint from only a single user, consider using x11vnc or similar so you can see the problem "first hand" over the network. https://stromberg.dnsalias.org/~strombrg/vnc.html#addons
37) If you suspect a particular process on a system of causing load problems, or other forms of problems, when way of testing that hypothesis is to kill the process. But there's a more subtle way too: kill -STOP <pid>, monitor how the system changes, and then kill -CONT <pid> to make the process pick up where it left off.


Less technical items: 1) Post to newsgroups or bulletin boards or mailing lists -relevant- to the difficulty you're faced with. Seriously consider reading any relevant FAQ's -first-! Schedule yourself times to check in on the message thread you've created. Consider hanging around on that forum a while longer to contribute a couple/few solutions (or more) yourself, to repay the group for its help. Read this: http://www.catb.org/~esr/faqs/smart-questions.html !
2) Also, sometimes using some form of chat channel, like IRC or an Instant Messaging service can be helpful for quick turnaround, but often will not give your question exposure to the large number of eyes that a bbs, mailing list or newsgroup will.
3) Contact the relevant vendor, vendors, author, authors, maintainer, or maintainers, if any. If you "have no vendor", consider signing up with one of the many consulting businesses that are springing up, which specialize in support of other people's opensource software. You like to be thanked for being helpful; so do the people you're asking for help from. In the case of an opensource author or maintainer, be sure to mention how valuable the software system is to you or your clients' endeavors, if it is.
4) Setting user expectations: I've found that the single most useful phrase in helping endusers understand the nature of IT jobs, is to say "OK, that's one hurdle cleared. Now we have to check and see if there are any others."
5) Smile and try to enjoy your work. This will often spread to your users in the form of greater user satisfaction. EG, if you grimace on the phone, sometimes people pick up on that.


Back to Dan's palm memos