Previously on Twin Peaks
After some amount of work, I have very useful Solaris-specific Wavefront dashboards. Most of the metrics come from kstats, so they’re low overhead, and give deep, accurate introspection. For instance, here’s network traffic out of all zones on a single host, courtesy of my SunOS Network Collector.
But, kstats consider the kernel’s view of the system as a whole, and sometimes it’s handy to have a finer-grained view than that.
So, in this episode, I’m going to write something about how I use
Solaris’s /proc
filesystem to produce process-specific metrics.
Linux /proc
is a mess. Who knows what’s in there. Solaris
/proc
isn’t. It’s very
clean, consistent, and, of course, well documented. man -s4 proc
will tell you all you need to know.
What the /proc
?
You likely know that /proc
contains one directory for each process
in the system. The name of the directory is the PID of the process.
Let’s have a look at one.
$ ls -l /proc/$$
-rw------- 1 rob sysadmin 5877760 Mar 3 14:15 as
-r-------- 1 rob sysadmin 336 Mar 3 14:15 auxv
dr-x------ 2 rob sysadmin 32 Mar 3 14:15 contracts
-r-------- 1 rob sysadmin 32 Mar 3 14:15 cred
--w------- 1 rob sysadmin 0 Mar 3 14:15 ctl
lr-x------ 1 rob sysadmin 0 Mar 3 14:15 cwd ->
dr-x------ 2 rob sysadmin 272 Mar 3 14:15 fd
-r-------- 1 rob sysadmin 0 Mar 3 14:15 ldt
-r--r--r-- 1 rob sysadmin 192 Mar 3 14:15 lpsinfo
-r-------- 1 rob sysadmin 1328 Mar 3 14:15 lstatus
-r--r--r-- 1 rob sysadmin 1072 Mar 3 14:15 lusage
dr-xr-xr-x 3 rob sysadmin 64 Mar 3 14:15 lwp
-r-------- 1 rob sysadmin 2600 Mar 3 14:15 map
dr-x------ 2 rob sysadmin 544 Mar 3 14:15 object
-r-------- 1 rob sysadmin 4176 Mar 3 14:15 pagedata
dr-x------ 2 rob sysadmin 816 Mar 3 14:15 path
-r-------- 1 rob sysadmin 72 Mar 3 14:15 priv
-r-------- 1 rob sysadmin 0 Mar 3 14:15 prune
-r--r--r-- 1 rob sysadmin 440 Mar 3 14:15 psinfo
-r-------- 1 rob sysadmin 2600 Mar 3 14:15 rmap
lr-x------ 1 rob sysadmin 0 Mar 3 14:15 root ->
-r-------- 1 rob sysadmin 2304 Mar 3 14:15 sigact
-r-------- 1 rob sysadmin 1680 Mar 3 14:15 status
-r--r--r-- 1 rob sysadmin 504 Mar 3 14:15 usage
-r-------- 1 rob sysadmin 0 Mar 3 14:15 watch
-r-------- 1 rob sysadmin 40824 Mar 3 14:15 xmap
Unlike Linux, Solaris’s /proc
directory doesn’t present all that
much from a simple ls
. Beyond the user and group the process runs
as, there’s not a lot of information at all. The timestamps on the
files are the time the process was launched. Also unlike Linux, none
of those files can be usefully accessed with simple tools like
cat
. Instead of just cat
ting things, you’re expected to use
tools like pmap
or pstack
, which are consumers of these binary
structures. This is fine, and if we were writing instrumentation in
C, as nature intended, it would be trivially easy to read and use
them.
But, I’m writing a collector for Diamond, which means Python. And, as this article will surely show, I’m not much of a Python programmer.
One of the things I like about /proc
is the careful use of file
permissions. Look at the example above and consider
$ ls -ld /proc/$$
dr-x--x--x 5 rob sysadmin 928 Mar 3 14:15 /proc/2208
Only a process owner can list the directories, and see other senstive information. But the “other” execute bit is set on the directory, and some files are world-readable. So if we’re even a little bit clever, our unprivileged Diamond process should be able to see all system processes.
First, I needed to decide what information I wanted. /proc
gives
way more detail, particularly on LWPs, than I’m interested in. What
I want, at least for now, is a prstat
style thing which reports
the CPU and memory consumption of running processes. I’d like to be
able to aggregate and filter that on process name, PID, and the zone
in which the process runs.
After a bit of thought I decided to put the process name in the
metric path, then to tag each point with the PID of the process and
its zone. Tagging by PID forces Wavefront to separate multiple
processes with the same executable name, but it’s trivial to wrap
them in a sum()
, or group them by metric path if need be. (There’s
an argument for dumping the whole lot on the same
path and having the exec name also be a tag, but this seemed
more natural to me, and I was worried about having too many tags on
the same path.) When I first started writing my Solaris
collectors, I put the zone name in the metric path, but it never
quite seemed right having to force global
into things where it
didn’t really belong. So, zone name is always a tag now. More tags.
Tags are good. It’s possible that the PID tag could end up being of
too high cardinality, but I don’t see much alternative. We’ll see
how it goes.
The proc(4)
man page gives you the layout of the /proc
data
structures, and they’re also in /usr/include/sys/procfs.h
, with a
little bit more annotation. From that file, here’s
psinfo
#define PRARGSZ 80 /* number of chars of arguments */
typedef struct psinfo {
int pr_flag; /* process flags (DEPRECATED; do not use) */
int pr_nlwp; /* number of active lwps in the process */
pid_t pr_pid; /* unique process id */
pid_t pr_ppid; /* process id of parent */
pid_t pr_pgid; /* pid of process group leader */
pid_t pr_sid; /* session id */
uid_t pr_uid; /* real user id */
uid_t pr_euid; /* effective user id */
gid_t pr_gid; /* real group id */
gid_t pr_egid; /* effective group id */
uintptr_t pr_addr; /* address of process */
size_t pr_size; /* size of process image in Kbytes */
size_t pr_rssize; /* resident set size in Kbytes */
size_t pr_pad1;
dev_t pr_ttydev; /* controlling tty device (or PRNODEV) */
/* The following percent numbers are 16-bit binary */
/* fractions [0 .. 1] with the binary point to the */
/* right of the high-order bit (1.0 == 0x8000) */
ushort_t pr_pctcpu; /* % of recent cpu time used by all lwps */
ushort_t pr_pctmem; /* % of system memory used by process */
timestruc_t pr_start; /* process start time, from the epoch */
timestruc_t pr_time; /* usr+sys cpu time for this process */
timestruc_t pr_ctime; /* usr+sys cpu time for reaped children */
char pr_fname[PRFNSZ]; /* name of execed file */
char pr_psargs[PRARGSZ]; /* initial characters of arg list */
int pr_wstat; /* if zombie, the wait() status */
int pr_argc; /* initial argument count */
uintptr_t pr_argv; /* address of initial argument vector */
uintptr_t pr_envp; /* address of initial environment vector */
char pr_dmodel; /* data model of the process */
char pr_pad2[3];
taskid_t pr_taskid; /* task id */
projid_t pr_projid; /* project id */
int pr_nzomb; /* number of zombie lwps in the process */
poolid_t pr_poolid; /* pool id */
zoneid_t pr_zoneid; /* zone id */
id_t pr_contract; /* process contract */
int pr_filler[1]; /* reserved for future use */
lwpsinfo_t pr_lwp; /* information for representative lwp */
} psinfo_t;
and here’s usage
:
typedef struct prusage {
id_t pr_lwpid; /* lwp id. 0: process or defunct */
int pr_count; /* number of contributing lwps */
timestruc_t pr_tstamp; /* current time stamp */
timestruc_t pr_create; /* process/lwp creation time stamp */
timestruc_t pr_term; /* process/lwp termination time stamp */
timestruc_t pr_rtime; /* total lwp real (elapsed) time */
timestruc_t pr_utime; /* user level cpu time */
timestruc_t pr_stime; /* system call cpu time */
timestruc_t pr_ttime; /* other system trap cpu time */
timestruc_t pr_tftime; /* text page fault sleep time */
timestruc_t pr_dftime; /* data page fault sleep time */
timestruc_t pr_kftime; /* kernel page fault sleep time */
timestruc_t pr_ltime; /* user lock wait sleep time */
timestruc_t pr_slptime; /* all other sleep time */
timestruc_t pr_wtime; /* wait-cpu (latency) time */
timestruc_t pr_stoptime; /* stopped time */
timestruc_t filltime[6]; /* filler for future expansion */
ulong_t pr_minf; /* minor page faults */
ulong_t pr_majf; /* major page faults */
ulong_t pr_nswap; /* swaps */
ulong_t pr_inblk; /* input blocks */
ulong_t pr_oublk; /* output blocks */
ulong_t pr_msnd; /* messages sent */
ulong_t pr_mrcv; /* messages received */
ulong_t pr_sigs; /* signals received */
ulong_t pr_vctx; /* voluntary context switches */
ulong_t pr_ictx; /* involuntary context switches */
ulong_t pr_sysc; /* system calls */
ulong_t pr_ioch; /* chars read and written */
ulong_t filler[10]; /* filler for future expansion */
} prusage_t;
Between them, those two structures reveal everything I want. And,
they’re both universally accessible. So how the heck do I read a
binary struct
in Python?
Ruddy Python
I’m not the biggest Python fan. I certainly don’t think there’s anything bad about it, but it’s never “clicked” with me in the way that, say, Ruby did. Writing Ruby, for me, is fun. Writing Python is work. Rather dry work. But, I plough on, hopeful that one day I’ll “get it” and enjoy Python too.
It’s straightforward to read a binary file into a variable, and once
that’s done, you can use Python’s struct
module
to unpack
the structure within. But, you need to know how to
describe that structure.
In C we define data types as, say, char
or int
, and the compiler
knows the sizes of those types. Python’s struct
module requires
you to define the incoming structures with characters, which are
helpfully listed in a
table.
With basic types like ulong_t
or int
, it’s obvious what letter
to use, but I had no idea what the underlying types of things like
poolid_t
were, so I spent a fair amount of time hunting through
/usr/include/sys/types.h
or running bits of C like
#include <stdio.h>
#include <sys/types.h>
void main() {
printf("%zu\n", sizeof(dev_t));
}
timestruc_t
took a bit of tracking down, but it looks like this:
typedef struct timespec { /* definition per POSIX.4 */
time_t tv_sec; /* seconds */
long tv_nsec; /* and nanoseconds */
} timespec_t;
So after a little while, I had a format string identifying the whole
of the usage
struct. psinfo
was more challenging, as it includes
a complex structure called lwpsinfo_t
. I started trying to decode
this, but eventually realised I didn’t want any of the information
in it, so I could discard it by telling my intial read()
operation
to only read the file up to the point just before lwpsinfo_t
began.
I thought it would be nice to make a Python dict
of each of the
/proc
files I was interested in, so I ended up writing a “key”
dict
to describe them:
proc_parser = {
'usage': {
'fmt': '=ii8s8s8s8s8s8s8s8s8s8s8s8s8s8s13L',
'keys': ('pr_lwpid', 'pr_count', 'pr_tstamp', 'pr_create', 'pr_term',
'pr_rtime', 'pr_utime', 'pr_stime', 'pr_ttime', 'pr_tftime',
'pr_dftime', 'pr_kftime', 'pr_ltime', 'pr_slptime',
'pr_wtime', 'pr_stoptime', 'pr_minf', 'pr_majf', 'pr_nswap',
'pr_inblk', 'pr_oublk', 'pr_msnd', 'pr_mrcv', 'pr_sigs',
'pr_vctx', 'pr_ictx', 'pr_sysc', 'pr_ioch'),
'size': 172,
'ts_t': ('pr_tstamp', 'pr_create', 'pr_term', 'pr_rtime', 'pr_utime',
'pr_stime', 'pr_ttime', 'pr_tftime', 'pr_dftime',
'pr_kftime', 'pr_ltime', 'pr_slptime', 'pr_wtime',
'pr_stoptime')
},
'psinfo': {
'fmt': '=iiiiiiIIIIlLLLlHH8s8s8s16s80siills3siiiiiii',
'keys': ('pr_flag', 'pr_nlwp', 'pr_pid', 'pr_ppid', 'pr_pgid',
'pr_sid', 'pr_uid', 'pr_euid', 'pr_gid', 'pr_egid',
'pr_addr', 'pr_size', 'pr_rssize', 'pr_pad1', 'pr_ttydev',
'pr_pctcpu', 'pr_pctmem', 'pr_start', 'pr_time', 'pr_ctime',
'pr_fname', 'pr_psargs', 'pr_wstat', 'pr_argc', 'pr_argv',
'pr_envp', 'pr_dmodel', 'pr_pad2', 'pr_taskid', 'pr_projid',
'pr_nzomb', 'pr_poolid', 'pr_zoneid', 'pr_contract'),
'size': 232,
'ts_t': ('pr_start', 'pr_time', 'pr_ctime'),
},
}
The ts_t
list is of all the timestruc_t
fields. For my own
convenience I decided to turn all those into simple Python float
s.
Then a simple method which uses that information to return a dict
which lets me easily access, say, pr_zoneid
. (I’ve removed all the
error handling for brevity and clarity.)
def proc_info(p_file, pid):
parser = proc_parser[p_file]
p_path = path.join('/proc', str(pid), p_file)
raw = file(p_path, 'rb').read(parser['size'])
ret = dict(zip(parser['keys'], struct.unpack(parser['fmt'], raw)))
for k in parser['ts_t']:
(s, n) = struct.unpack('lL', ret[k])
ret[k] = (s * 1e9) + n
return ret
I put all of this into a library, and went on to write the collector.
Collector
The collector loops over every process in /proc
, reading both
psinfo
and usage
and turning them into a single dict
. The
person configuring Diamond is able to select any number of keys from
that dict
and have them made into metrics under the process
namespace. Simple!
Except, of course, nothing’s ever simple. A likely use for the new
proc
collector is to show the busiest processes on a host, and
just how busy they are. Refer back to the usage
structure, and you
can see that processes do keep track of that, using all those
timestruc_t pr_*time
s. But they’re cumulative, so the collector
has to do some sums. Diamond gives you a class variable,
last_value
to memo things between runs, and using that and our
simple-to-work-with nanosecond times, it’s easy to work out the time
each process spends on-CPU, literally, and as a percentage of
available time. Here’s the system + kernel CPU usage for the Java
processes on one of my hosts
If you hover over that chart, you’ll see the zone tags. These are
attached to every metric produced by the proc
collector, using the
pr_zoneid
value from psinfo
. This is the first field you see
when you do zoneadm list -cv
; it’s numeric, and it’s not that
meaningful. (If you reboot a zone it will likely get a different
ID.) At the beginning of each collector run I shell out to zoneadm
and generate a map of zone ID to name. When it’s time to send the
point, I look up the pr_zoneid
value in the map, and tag with the
name.
Now on to memory, where I chose to follow prstat(1)
, and offer the
RSS (resident set size) and SIZE columns. The latter is defined in
the man
pages as The total virtual memory size of the process,
including all mapped files and devices
. We can get the values from
the psinfo
struct’s pr_rssize
and pr_size
fields.
The prstat
man page tells us
that memory sizes are displayed in Kb. I prefer to send all my
metrics as bytes, and let Wavefront handle the prefixes. I’ve taken
a hardline “always bytes” policy across all my collectors, even if
the standard tooling uses a different unit. But, you know, that K
can mean different things to different people. Checking the
source
it seems that Sun chose to use “proper” K and M, not this Ki and Mi
nonsense. So, we multiply the raw figure by 1024.
This seemed to work fine. I wrote and tested it on SmartOS,
but I also run a couple of Solaris machines. When I dropped the code
on to those, some of the processes showed zero memory usage. At
first it seemed arbitrary: of two Java process, one reported its
memory correctly, the other didn’t. I couldn’t work it out and I
started digging: I DTraced prstat
to see exactly what it did and,
so far as I could tell, it was doing the same as my code. I read
through the prstat
source. (The Illumos source is mostly very easy to
follow, and the block comments are superb.) The more I looked, the more
baffled I was. Everything was correct, I was certain, but for the
unavoidable fact it didn’t work.
Eventually I gave up, and asked for help on the illumos-discuss
mailing list. In next to no time, an Illumos kernel dev had pointed
me at the code which zeros out the pr_size
field
if a 32-bit process tries to examine a 64-bit one. And sure enough,
on SmartOS:
$ file -b `which python2.7`
ELF 64-bit LSB executable AMD64 Version 1, dynamically linked, not stripped, no
debugging information available
and on Solaris:
$ file -b `which python`
ELF 32-bit LSB executable 80386 Version 1 [SSE FXSR FPU], dynamically linked, not stripped
Mystery solved. Self kicked.
At this point I decided to write a script to make an “Omnibus” style package to deploy Diamond. This builds and bundles together together a 64-bit Python, my own Wavefront-enabled fork of Diamond, all Diamond’s dependencies, and my SunOS collectors.
The Tenant Collector
So far I’d developed everything with a view to running in a global zone (easier done on Solaris than on SmartOS) and collecting metrics on
the system and on individual zones all from one place. Doing as much as possible from the global is a I learnt early in my zoning days (2007 I
think). Back then I wrote acres of Korn shell to dynamically probe whatever appeared under /zones
, and run zlogin
loops over the
output of zoneadm list
running Nagios-style check scripts. That looping survives in these collectors, though the zlogin
stuff is
much less necessary on SmartOS due to zone-aware extensions to the svc
commands, and zone-level kstats. Solaris needs to catch up in
these areas, assuming it continues to exist at all.
But I have stuff, like the site you’re reading, which runs in the Joyent Public Cloud. I’m in zones there, and you have a different view of the system from inside a zone. Some metrics are invisible, others are meaningless. So I got copying and pasting, and put together a collector tuned to run in a resource-capped SmartOS zone. The README explains the available metrics, so if you’re interested, read that.
Telegraf and the Future
Diamond is fine. It’s a reasonably active open-source project, and it feels a bit step up from CollectD. But now we have Telegraf, and that is far more modern than Diamond.
When I first tried Telegraf, it wouldn’t build on Solaris. Now, it will but, as was the case with Diamond, none of the OS-related plugins
work at all. So I’ve started porting my Diamond collectors to Telegraf. There are kstat bindings for Go, and I’m using those. Once I have
everything ported over, I might look at trying to write a libscf
binding, (though I suspect I’m over-reaching myself there) so I can
monitor SMF without having to shell out. I also haven’t really explored DTrace metric collection in the way I wanted to, and there are Go
libusdt
bindings just waiting for me. It’s a little slow going though, as I’m having to learn Go and its toolchain as I go.
Having a single binary is a big advantage when deploying to a SmartOS global zone. Getting a Python environment built and running was messy, and feels like a big fat hack.