To get right to the point, Wavefront is amazing, and you need it. You need it because it will let you see right into the heart of your system, however big and complicated that might be. You need it because you want to alert off meaningful telemetry generated by your whole estate, not off a shell script that exits 1, 2, or 3. You need it because, well, scaling Graphite.
Wavefront is a service into which you shovel time-series data. From
statsd, JMX, Dropwizard,
echo in a stupid shell
cript, anything. As fast as you like. At whatever resolution you
like. Then, using an API,
or a very nice UI, you can perform arbitrary mathematical operations
on any number of those series. It scales seamlessly, it works all
the time, the support is great, it’s feature-complete, and it’s well
documented. It’s everything you always want, but never get.
My current client uses it in production on an Ubuntu estate, but I have an all-SunOS (Solaris and SmartOS) lab, and I thought it would be interesting to instrument that. I can imagine lots of exciting possibilities wiring DTrace, kstats, and all manner of other stuff in, and I’m planning to write up my progress as I go.
Note: This article has been updated, late November 2016. We were a closed-beta customer of Wavefront, and many things have been improved over the time we’ve used it.
Wavefront is presented to you as an endpoint and a web UI. As an administrative user you generate an access token in the UI, then configure a proxy which listens for incoming metrics, bundles them up, and uses the token to pass them securely to the endpoint. Anything can write to the proxy, so it’s up to you to limit access. In EC2 we do this with security groups and IAM roles, but my lab has a private network, so I can put the proxy there, and anything inside can send metrics.
I’m going to build a dedicated Solaris 11.3 zone to host the proxy. I have a golden zone which I clone from, so creation only takes a couple of seconds. Here’s my zone config.
# zonecfg -z shark-wavefront export create -b set brand=solaris set zonepath=/zones/shark-wavefront set autoboot=true set autoshutdown=shutdown set limitpriv=default,dtrace_proc,dtrace_user set ip-type=exclusive add anet set linkname=net0 set lower-link=net0 set allowed-address=192.168.1.30/24 set configure-allowed-address=true set defrouter=192.168.1.1 set link-protection=mac-nospoof set mac-address=random set maxbw=10M end add capped-memory set physical=1G end add rctl set name=zone.max-swap add value (priv=privileged,limit=1073741824,action=deny) end add rctl set name=zone.max-locked-memory add value (priv=privileged,limit=209715200,action=deny) end add rctl set name=zone.cpu-cap add value (priv=privileged,limit=50,action=deny) end add dataset set name=space/zone/shark-wavefront end
To make this into a real, running thing, I only need to create a dataset to delegate, and clone my golden zone.
# zfs create space/zone/shark-wavefront # zoneadm -z shark-wavefront clone shark-gold
The config shows you that I capped the zone’s memory usage at 1G – plenty for my low-traffic proxy – and limited the CPU usage at the equivalent of half a core. I also pinned the zone’s IP address and default router from the global zone. I usually do this, because it stops anyone or anything in the zone deliberately or accidentally changing the address and making the proxy disappear. I also capped the VNIC bandwidth at 10 megabit/s, which is pretty much my upstream-capacity. There might not be a great deal of value in that but, hey, I can, so why not?
If I were building a heavy-duty production proxy with hundreds of nodes writing to it (which I have done, many times), I’d set all these thresholds considerably higher, and build multiple, load-balanced zones on separate hosts.
The delegated dataset will be used for Wavefront’s logging and buffering. If the proxy can’t talk to your Wavefront cluster, it will buffer incoming mertrics on disk until the endpoint comes back, when it will flush them all out. (We’ve seen some massive spikes in our outgoing metric rate after network issues, and the cluster absorbs them without flinching.)
With this in mind I put a quota on the dataset to stop a broken connection flooding the disk and affecting all the other zones on the box. Actually , this probably isn’t necessary any more, as new proxy versions seem to have acquired the ability to limit the size of the buffer. But I still think it’s smart to quota all your non-global datasets so one tenant can’t DOS the others. And again, why not, when ZFS makes it as simple as
# zfs set quota=300M space/zone/shark-wavefront
Unsurprisingly, Wavefront don’t supply packages for anything Solarish. But, they do make the source code available, so we can build one ourselves.
Compilation isn’t hard, but that didn’t stop me making a script to make it even easier.
That script works on Solaris 11 and SmartOS, and spits out a SYSV or
pkgin package. (Well, it does that unless you don’t have
fpm installed, in which case it
gives you a tarball.) It also has the ability (assuming you have the
privileges) to satisfy build dependencies: namely Java 8, Maven and
If you can’t be bothered with all of that, here’s a ready-made package.
The package bundles an SMF method and manifest, but you will have to create the user and make a couple of directories on that dataset we delegated earlier. From inside the zone that looks like a ZFS pool, and we can treat it as if it were.
# useradd -u 104 -g 12 -s /bin/false -c 'Wavefront Proxy' -d /var/tmp wavefront # zfs create -o mountpoint=/var/wavefront/buffer shark-wavefront/buffer # zfs create -o mountpoint=/var/log/wavefront shark-wavefront/log # zfs create -o mountpoint=/config shark-wavefront/config # chown wavefront /var/wavefront /var/log/wavefront
Hopefully of course, you’d do this properly, and automate its all
with the config-management software of your choice. I have a Puppet
which you are welcome to use and extend, but it’s not exactly the
state-of-the-art. To use it, you must convert the datastream package
build_wf_proxy.sh creates into directory format.
$ pkgtrans SDEFwfproxy.pkg . SDEFwfproxy
As we’ve already seen, the proxy is a Java application, so you’ll need a JVM. The Puppet stuff takes care of this of course, but if you’re doing things by hand, remember to:
# pkg install java/jre8
Briefly returning to storage, if your delegated dataset didn’t
compression=on property, it’s definitely worth
setting it now. Looking at my existing proxy
$ zfs get -Hovalue compressratio shark-wavefront/buffer 11.87x
I find that turning on compression gets me twelve times the buffering period for free! The Wavefront UI will tell you how long an outage the buffering will cover on each configured proxy. I habitually turn compression on, unless I know a dataset will only contain incompressible data. I haven’t properly benchmarked, but it seems to me that in most workloads performance improves on compressed datasets.
So, assuming you’ve created the user, made the directories and installed the package, you’re almost ready to go. Depending on how busy you expect your proxy to be you might want to change the amount of memory allocated to the JVM. You can do that through SMF properties.
$ svcprop -p options wavefront/proxy options/config_file astring /config/wavefront/wavefront.conf options/heap_max astring 500m options/heap_min astring 300m
You can see it sets a very small Java heap size, which so far seems to be fine for my modest lab requirements. Your mileage may vary, but it’s pretty easy to change.
# svccfg -s wavefront/proxy setprop options/heap_max=2048m # svcadm restart wavefront/proxy
The proxies report back a lot of internal metrics, which make it very easy to monitor them. Relevant to the heap size discussion are JVM statistics, which let you see memory usage inside the JVM. This is one of a number of charts on my Wavefront “internal metrics” dashboard.
The proxy, obviously. needs a config file. It needs to know where to
talk to, how to authenticate, how to identify itself, and what ports
to listen on. I keep my application files in
/config, on my
delgated dataset. I started this habit years ago, before I learnt
config management. The idea is that you can easily rebuild a vanilla
zone, re-import the dataset, and the applications will work. If you
want to use
/etc or something, the config-file path is also an SMF
server=https://metrics.wavefront.com/api/ hostname=shark-wavefront token=REDACTED pushListenerPorts=2878 pushFlushMaxPoints=40000 pushFlushInterval=1000 pushBlockedSamples=5 pushLogLevel=SUMMARY pushValidationLevel=NUMERIC_ONLY customSourceTags=fqdn, hostname idFile=/var/wavefront/.wavefront_id retryThreads=4
I don’t currently do any metric whitelisting, blacklisting or pre-processing on this proxy. I use it almost entirely for experimenting and playing, so I want everything to go through, right or wrong.
In my client’s production environment we use metric whitelisting on all the proxies. By defining a single whitelist regular expression, we only accept metrics whose paths fit our agreed standards. This preserves the universal namespace which our tooling (and people) expect to see. When you have multiple proxy clusters, I think it’s also worth having them point-tag everything to help you identify where things came from.
I also only use the native Wavefront listener, disabling the
OpenTSDB and Graphite ports. My metrics all go in via a customized
which speaks native Wavefront. If you want to use, say,
you’ll have to use the Graphite listener. (Unless someone has
written a Wavefront plugin, which they might have by the time you
SUMMARY mode, the proxy server is chatty. So, we need to
logadm keep it in check.
# echo '/var/log/wavefront/wavefront-proxy.log -N -A 30d -s 10m -z 1 -a \ "/usr/sbin/svcadm restart wavefront/proxy"' \ >/etc/logadm.d/wavefront-proxy.logadm.conf # svcadm refresh logadm-upgrade
My lab setup has a centralized logging system built with
and Graylog2. I also use Fluentd for things which don’t fit
naturally with syslog, which is clearly the case here.
The logs are multi-line, with the first line having the timestamp, and the second the message. The message is not always consistent. So far I’ve found the following block of config satisfies my needs, but it may need refining.
Note: for formatting, I have folded the long regex with backslashes, but really it has to be one long line.
<source> @type tail path /var/log/wavefront/wavefront-proxy.log pos_file /var/run/td-agent/foo-bar.log.pos tag wavefront_proxy format multiline format_firstline /^(?<time>.*) (?<class>com.wavefront.agent[^ ]+) (?<method>.*)$/ format1 /^(?<level>\w+): \[(?<port>\d+)\] \((?<type>\w+)\): \ points attempted: (?<attempted>\d+); blocked: \ (?<blocked>\d+)$|^(?<level>\w+): (?<message>.*)$/ time_format %b %d, %Y %l:%M:%S %p </source> <match wavefront_proxy.**> type copy <store> type gelf host graylog.localnet port 12201 flush_interval 5s </store> </match>
When I first set this up I parsed out the
blocked counts, so I could alert on blocked messages. I set up a
stream in Graylog which tailed the proxy log and had an
output which used the metrics plugin to report back to Wavefront.
Whenever any invalid metrics were blocked,
the stream noticed, and sent a “blocked” point to
Wavefront, triggering an alert. So, I got Wavefront to
monitor things which don’t get to Wavefront!
I don’t need to do that now, because the proxy reports back counts
of points received, sent, and blocked, so I can tell Wavefront
rate(ts(~agent.points.2878.blocked)) and see a chart of blocked
points. form when I was working on the Ruby
CLI and kept sending duff
Wiring your logging into Wavefront is still a great idea though. At
my client site every FluentD log stream generates metrics, so we can
have Wavefront alert off abnormal numbers of errors, or from
unexpectedly high or low log throughput. One of the first things we
did was to plot the number of
auth.error messages going through
syslog to show people trying to brute-force their way into our
perimeter SSH boxes. We now use it far far more things than that.
But, I digress. Again, there’s very rough-and-ready Puppet code to configure all of this logging stuff, with a seperate manifest for Wavefront. You’re welcome to any of it.
Back to the “building a zone” part of the exercise, which I completed by making the proxy immutable.
# zonecfg -z shark-wavefront zonecfg:shark-wavefront> set file-mac-profile = fixed-configuration zonecfg:shark-wavefront> commit zonecfg:shark-wavefront> ^D # zoneadm -z shark-wavefront reboot
When the zone comes back up
/var is writeable, but everything else
is read-only. Again, this is something I’ve adopted as standard
practice. If you have a thing that you don’t expect to change, make
it so that it can’t change.
Next time I’ll talk a little bit about how I started getting useful metrics out of Solaris and into the proxy.