Note: This web page was automatically created from a PalmOS "pedit32" memo.

oacstats behind-the-scenes doc, 2004-07-28


The results yielded by the DCS-written portion of oacstats is relatively
well documented at this time, and while I wrote up a previous explanation
of how oacstats works, apparently no one thought to save a copy.

So here is a behind-the-scenes look at how the DCS-written portion of
oacstats works.

First off, oacstats is launched, three times a week, by the following
root cronjob on bounce2.nac.uci.edu:

# Sunday, Tuesday, Friday
0 0 * * 0,2,5 /usr/local/etc/daily-srsh

This script first sleeps for a random number of seconds, from 0 to 24
hours - to "jitter" the sampling, to avoid the effect of missing some
hosts that are always turned off at the same time every day.

Next, this cron job launches 3 "phases" of the DCS-oacstats data collection:

1) srsh-tied
2) dns-tied
3) subnet-tied

There is actually a fourth phase, which was only collected once:

4) p0f-tied

Now I'll go into some detail about each of these phases, and then wrap
up with a summary of how these phases are combined into a variety of
views of the data.

1) srsh-tied
This phase is run on bounce2.nac.uci.edu itself.  The host simply srsh's
to each host in the srsh database, and runs /dcslib/allsys/etc/HostInfo,
which just outputs a bunch of information describing the machine is was
run on.  In practice, these hosts tend to be DCS-support, or sometimes
(unfortunately) -formerly- DCS supported.  This is the highest level
of detail we collect on hosts, and is only collected for a relatively
small number of hosts.

2) dns-tied
In this phase we iterate over /dcslib/allsys/etc/hosts.uci, probing
each host in the list (except for a relatively small list of exception
hosts, created through people complaining about oacstats probing their
machines).  We collect things like well known ports, registered RPC
services, some of the banners on well known ports, microsoft networking
information where enabled, and some OS guessing is performed based on
"active IP fingerprinting" - meaning the guessing is performed based
on attributes of an IP conversation initiated by the host doing the
collection, network-stats-collector-1.nacs.uci.edu.  The script that
is run once per ucinet host that collects all this "dns-tied" data is
~oacstats/bin/do-dns-tied .

3) subnet-tied
In this phase, we query all the routers we can and examine their ethernet
address caches.  We then attempt to identify the make of the ethernet card
(and sometimes, computer vendor) based on the first three octets of these
ethernet address in combination with a textual database of vendors.
Not all UCI routers allow DCS to collect this data.  This data is
collected via ~oacstats/subnet-tied-collection/ethers, which is a
semi-sophisticated wrapper around /dcs/packages/cmu-snmpd/bin/snmpwalk.

4) p0f-tied
This phase was only run once, and we've been making use of that same
data ever since.  As such, it is not especially trustworthy anymore,
but it gives us data that sometimes the other phases do not.  This phase
used "passive IP fingerprinting" to guess operating systems, meaning the
guessing is done based on IP conversations initiated by the host we're
trying to guess the OS of.  Passive IP fingerprinting can often be more
accurate than active IP fingerprinting, hence the interest in this data.
This phase was run on a linux/x86 box at the UCI network border.


Finally, all the data resulting from these phases are
combined together by running ~oacstats/bin/turn-over-data on
network-stats-collector-1.nacs.uci.edu via a single-host srsh
job from bounce2.nac.uci.edu, from the same cronjob mentioned
above, /usr/local/etc/daily-srsh.  turn-over-data is the script
that merges the above phases into the directories described in
nsc-1.nacs.uci.edu:~oacstats/00README and ~oacstats/01README .

Please feel free to ask questions about how this works, or even about
the value of the data, in person, over the phone, or via e-mail.

Thanks.
 


Back to Dan's palm memos