Lately, Wavefront has acquired log processing abilities. In short, you have a log shipper talk to your Wavefront proxy, which applies grok rules to match lines and extract data, generating metrics from what it finds.
After processing, the logs are discarded, so Wavefront doesn’t replace something like Splunk, ELK or Graylog, but I think there is a great deal of value in its approach to statisically analyzing flows of logs. (I actually feel retention of logs is as good as worthless, from a technical standpoint. There’s an article in there somewhere…)
On my current client’s site, we have had our logging generating
telemetry for some time. Logs are collected with
fluentd, and we use a
record_transformer
to increment a StatsD counter every time we see
a certain type of log message. This works well, and gives us lots of
useful information. We are very keen on monitoring and alerting off
the whole state of our systems and services, and bringing log volume
in as a metric lets us say, for instance, we don’t care about a
small number of errors if the service as a whole is healthy. On the
other hand, should the number of errors or warnings suddenly rise,
we’d want to know about it even if the service were not (yet) affected.
Our record_transformer
method is particularly useful when we, the
SRE team, want to keep an eye on some aspect of an application which
the developers haven’t got round to instrumenting. What follows is a
not-particularly-contrived example of that.
The website you are reading runs on a small SmartOS instance in Joyent’s public cloud. It is one of a few small Sinatra applications mapped to virtual hosts in an nginx proxy.
Nginx writes access and error logs for each site, but up to now, I’ve never done anything with any of that data. Therefore I have no idea how much traffic this, or any other site, gets. Or how long page loads typically take.
I have a good OS-level overview of the zone, courtesy of a custom Diamond collector that I wrote. It lets me know how close I am to my memory resource caps and how much bandwidth I use, and I’m alerted if any SMF services fail, or if there’s unexpected paging. It would be great to add site usage and latency to that dashboard.
Telling nginx What to Log
First off I changed my nginx log format to produce the information I needed.
http {
log_format vhost '$host $remote_addr $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'$upstream_response_time $request_time';
}
The server name is now the first field of every log entry. (Nginx doesn’t really think in terms of “vhosts”, though I habitually call them that.) I also log the upstream response and total request times.
Telling Wavefront What to Expect and What to Do With It
I turned on Wavefront proxy log ingestion by adding two lines
to wavefront.conf
and restarting the service.
filebeatPort=5044
logsIngestionConfigFile=/config/wavefront/logsIngestion.yaml
The logIngestionConfigFile
is where you define your grok
matchers, and tell Wavefront what to do with the data it extracts.
I want to count HTTP responses, grouped by code, and have
information about the response and upstream response times. I also
want to know to which vhost each metric applies. In Graphite, you’d
put the vhost name into the metric path, and you could do that in
Wavefront, but my vhost names contain dots, which Wavefront sees
as path separators. For that reason, I’m going to use point tags to
differentiate between vhosts. Here’s my logsIngestionConfigFile
,
with the pattern
lines broken for the sake of formatting.
---
aggregationIntervalSeconds: 5
histograms:
-
pattern: '%{HOSTNAME:vhost} %{IPORHOST:clientip} %{USER:user}
\[%{HTTPDATE:timestamp}\] (?:%{WORD:verb}
%{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|
%{DATA:rawrequest}) %{NUMBER:response}
(?:%{NUMBER:bytes}|-) %{NUMBER:urt} %{NUMBER:rt}'
metricName: 'nginx.code.%{response}'
tagKeys:
- vhost
tagValueLabels:
- vhost
-
pattern: '%{HOSTNAME:vhost} %{IPORHOST:clientip} %{USER:user}
\[%{HTTPDATE:timestamp}\] (?:%{WORD:verb}
%{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|
%{DATA:rawrequest}) %{NUMBER:response}
(?:%{NUMBER:bytes}|-) %{NUMBER:urt} %{NUMBER:rt}'
metricName: 'nginx.time.upstream_response'
tagKeys:
- vhost
tagValueLabels:
- vhost
-
pattern: '%{HOSTNAME:vhost} %{IPORHOST:clientip} %{USER:user}
\[%{HTTPDATE:timestamp}\] (?:%{WORD:verb}
%{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|
%{DATA:rawrequest}) %{NUMBER:response}
(?:%{NUMBER:bytes}|-) %{NUMBER:urt} %{NUMBER:rt}'
metricName: 'nginx.time.response'
tagKeys:
- vhost
tagValueLabels:
- vhost
Each stanza defines a histogram, with the final part of the
metric path dynamically generated. All the grok patterns use named
matchers, whose names can be referenced in subsequent configuration.
So, a successfully handled HTTP request will increment the
nginx.code.200
counter. I plan to turn that counter into “hits per
second” with Wavefront’s rate()
function.
Point tags are defined by two arrays. The first is the tag key,
called, obviously, tagKeys
. The second is called tagValueLabels
,
which gives you a clue how to use it. The elements of this second
array are the names of the matchers in your pattern
. I create a
single tag called vhost
, and assign it the value matched by the
%{HOSTNAME}
at the start of the pattern.
Every point automatically gets a source
tag just like any normal
metric, so we can easily look at single hosts or environments.
To collect my data I use histogram
s. Histograms have “buckets”,
which acculmulate data then flush it with a given periodicity
(aggregationIntervalSeconds
). Like statsd, they produce a bundle of metrics,
including things like median
, count
, percentiles, and the standard
deviation.
The Wavefront engine has single second resolution, so plotting multiple times in the same second will only leave behind the last one (assuming all tags are equal), but histograms will give a detailed breakdown of all points in that second, making it trivial to find the outliers you’re probably interested in. For high frequency data, this is what you want. The log ingestion also offers counters and gauges, but I think they’re less useful.
Getting Logs to Wavefront
We need a log shipper which can speak Lumberjack. Fluentd can’t easily do this, and we’ve been falling out of love with Fluentd anyway, because of the messy path from 0.12 (which doesn’t support much-needed sub-second timestamps) to 0.14 (which does). So, I’m going to go with Filebeat. As I said, I use SmartOS, so I had to have to build Filebeat myself. This was straightforward, and not worth recounting here. I also had to write an SMF manifest to start and stop the service.
From now on I’m going to explain how everything is done “by hand”. In real life I did all my configuration through Puppet, but it will be easier to follow this way.
I SSHed onto the box with the files, and made a user for Filebeat to run as
# useradd -s /bin/false -c log-shipper -u 989 -d /var/log/filebeat logs
# mkdir /var/log/filebeat /var/filebeat
# chown logs /var/log/filebeat /var/filebeat
Then I imported the SMF manifest. Because create_default_instance
is true
, the service starts automatically.
# svccfg import /opt/local/lib/svc/manifest/filebeat.xml
My Filebeat config is pretty simple. I don’t think anything needs explaining.
---
filebeat.prospectors:
- input_type: log
paths:
- /var/log/nginx/*.log
tail_files: true
registry_file: /var/filebeat/registry
output.logstash:
hosts: ["wavefront.localnet:5044"]
logging:
level: info
to_syslog: false
to_files: true
files:
path: /var/log/filebeat
name: filebeat.log
rotateeverybytes: 10485760
keepfiles: 7
path.data: /var/filebeat
I was briefly caught out by Filebeat’s (in my opinion, plain
wrong) insistence that the config file be owned by the Filebeat
logs
user.
Sexy Graph Time
With Filebeat running, I put four of my sites under
Siege, using the -i
option,
to simulate real traffic, and asked Wavefront to show me
rate(ts("nginx.code.200"))
.
And here are the reponse times corresponding to the load generated above. Max, min, and mean, averaged across all four sites.
The beauty of using Wavefront here, as opposed to Graylog, is that my metrics and alerting all stay in the same place. I only need one maintenance window, one Slack webhook, or one Pagerduty config.
I can also build new kinds of alerts, combining log metrics with
“normal” Diamond data, or synthetic Wavefront metrics. For example,
I might want to alert any time a 50x
error gets logged: that’s
simple, and Graylog will happily do it. But, I may also want to know
if the moving average of response times across my entire webserver
tier goes above 300ms. Or maybe only alert if mean upstream response
time breaches a certain threshold whilst the database tier’s average
CPU usage is below 30%, and we’re in core business hours. All of
these things are now trivially easy, because all these things are
now a single timeseries visible to Wavefront’s alerting engine.
If I mark a new code deployment with a Wavefront event (which I do), I can clearly see if the new release causes any change to the nature of my logs. Perhaps I might see more error or warning lines, or increased latency, or some other abnormal pattern. Many of us work at scale now, handling such large flows of logging information that trends and patterns are far more valuable than the actual data contained in the log lines themselves.
Kibana can generate graphs and charts from log streams, but Wavefront is different. It’s made to process, anaylze and correlate timeseries. It never throws data away, which centralised logging systems almost always do, so it’s much easier to understand “normal” usage, and to spot trends and anomalies.
Lately I’ve spent a lot of time working with streams of logs, and a lot of time working with Wavefront. Putting the two together seemed very natural, and though our initial StasD approach was great, direct log ingestion by Wavefront is much cleaner and more powerful, and I feel it’s something I will see a lot of benefit from in the future.
Update
The proxies can now ingest raw logfiles, so log ingestion becomes as
simple as tail -f | netcat
. A quick illustration. Enable raw log
ingestion by adding this to the proxy config:
rawLogsPort=5055
Now we must set up some kind of a receiver. As a simple example, I’m
going to make a quick chart of the facilty and level of
everything going into syslog
on a Linux VM. (Solaris uses multiple
system logs, which means this wouldn’t be a simple one-liner.)
So, my logsIngestion.yaml
needs a new counter
, with a definition
of my syslog
format. We use a modified syslog format, recording
more information than Linux usually does. (Again, the grok pattern
is broken for formatting.)
pattern: '%{TIMESTAMP_ISO8601:time} %{WORD:facility}.%{WORD:level} \
%{HOSTNAME:source} %{SYSLOGPROG:program}:%{GREEDYDATA:message}'
metricName: 'syslog.%{facility}.%{level}'
tagKeys:
- source
tagValueLabels:
- source
Then on the host, just do:
$ tail -F /var/log/syslog | nc wavefront 5055
and we see: