pkhamre.blog

thoughts, devops, tools and stuff.

Understanding StatsD and Graphite

After a short conversation with BryanWB_ on the #logstash channel at Freenode, I realized that I did not know how my data was sent and how it was stored in Graphite. I knew that StatsD collects and aggregates my metrics. And I knew that StatsD ships them off to Graphite. Which I knew stores the time-series data and enables us to render graphs based on these data.

What I did not know was if my http-access graphs displayed requests per second, average requests per retention or anything else.

It was time to research how these things worked in order to get a complete understanding.

StatsD

To get a full understanding of how StatsD works, I started to read the source code. I knew StatsD was a simple application, but I did not knew it was this simple. Just over 300 lines of code in the main script and around 150 lines in the graphite backend code.

Concepts in StatsD

StatsD has a few concepts listed in the documentation that should be understood.

Buckets

Each stat is in its own “bucket”. They are not predefined anywhere. Buckets can be named anything that will translate to Graphite (periods make folders, etc)

Values

Each stat will have a value. How it is interpreted depends on modifiers. In general values should be integer.

Flush interval

After the flush interval timeout (default 10 seconds), stats are aggregated and sent to an upstream backend service.

Metric types

Counters

Counters are simple. It adds a value to a bucket and stays in memory until the flush interval.

Lets take a look at the source code that generates the counter stats that gets flushed to the backend.

1
2
3
4
5
6
7
8
9
for (key in counters) {
  var value = counters[key];
  var valuePerSecond = value / (flushInterval / 1000); // calculate "per second" rate

  statString += 'stats.'        + key + ' ' + valuePerSecond + ' ' + ts + "\n";
  statString += 'stats_counts.' + key + ' ' + value          + ' ' + ts + "\n";

  numStats += 1;
}

First, StatsD iterates over any counters received, where it starts by assigning two variables. One variable holds the counter value, and one variable holds the per-second value. It then adds the values to the statString and increases the numStats variable.

If you have the default flush interval, 10 seconds, and send StatsD 7 increments on a counter with the flush interval, the counter would be 7 and the per-second value would be 0.7. No magic.

Timers

Timers collects numbers. They does not necessarily need to contain a value of time. You can collect bytes read, number of objects in some storage, or anything that is a number. A good thing about timer, is that you get the mean, the sum, the count, the upper and the lower values for free. Feed StatsD a timer and this gets automatically calculated for you before it is flushed to Graphite. Oh, I almost forgot to mention that you also get the 90 percentile calculated for the mean, sum and upper values as well. You can also configure StatsD to use an array of numbers as percentiles, which means you can get 50 percentile, 90 percentile and 95 percentile calculated for you if you want.

The source code for timer stats is a bit more advanced than the code for the counters.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
for (key in timers) {
  if (timers[key].length > 0) {
    var values = timers[key].sort(function (a,b) { return a-b; });
    var count = values.length;
    var min = values[0];
    var max = values[count - 1];

    var cumulativeValues = [min];
    for (var i = 1; i < count; i++) {
        cumulativeValues.push(values[i] + cumulativeValues[i-1]);
    }

    var sum = min;
    var mean = min;
    var maxAtThreshold = max;

    var message = "";

    var key2;

    for (key2 in pctThreshold) {
      var pct = pctThreshold[key2];
      if (count > 1) {
        var thresholdIndex = Math.round(((100 - pct) / 100) * count);
        var numInThreshold = count - thresholdIndex;

        maxAtThreshold = values[numInThreshold - 1];
        sum = cumulativeValues[numInThreshold - 1];
        mean = sum / numInThreshold;
      }

      var clean_pct = '' + pct;
      clean_pct.replace('.', '_');
      message += 'stats.timers.' + key + '.mean_'  + clean_pct + ' ' + mean           + ' ' + ts + "\n";
      message += 'stats.timers.' + key + '.upper_' + clean_pct + ' ' + maxAtThreshold + ' ' + ts + "\n";
      message += 'stats.timers.' + key + '.sum_' + clean_pct + ' ' + sum + ' ' + ts + "\n";
    }

    sum = cumulativeValues[count-1];
    mean = sum / count;

    message += 'stats.timers.' + key + '.upper ' + max   + ' ' + ts + "\n";
    message += 'stats.timers.' + key + '.lower ' + min   + ' ' + ts + "\n";
    message += 'stats.timers.' + key + '.count ' + count + ' ' + ts + "\n";
    message += 'stats.timers.' + key + '.sum ' + sum  + ' ' + ts + "\n";
    message += 'stats.timers.' + key + '.mean ' + mean + ' ' + ts + "\n";
    statString += message;

    numStats += 1;
  }
}

StatsD iterates over each timer and processes the timer if the value is above 0. It then sorts the array of values and simply counts it and locates the minimum and maximum values. An array of the cumulative values is created and a few variables are assigned before it starts to iterate over the percentile thresholds array to calculate percentiles and creates the messages to assign to the statString variable. When percentile calculation is done, the final sum gets assigned and the final statString is created.

If you send the following timer values to StatsD during the default flush interval

  • 450
  • 120
  • 553
  • 994
  • 334
  • 844
  • 675
  • 496

StatsD will calculate the following values

  • mean_90 496
  • upper_90 844
  • sum_90 3472
  • upper 994
  • lower 120
  • count 8
  • sum 4466
  • mean 558.25

Gauges

A gauge simply indicates an arbitrary value at a point in time and is the most simple type in StatsD. It just takes any number and ships it to the backend.

The source code for gauge stats is just four lines.

1
2
3
4
for (key in gauges) {
  statString += 'stats.gauges.' + key + ' ' + gauges[key] + ' ' + ts + "\n";
  numStats += 1;
}

Feed StatsD a number and it sends it unprocessed to the backend. A thing to note is that only the last value of a gauge during a flush interval is flushed to the backend. That means that if you send the following gauge values to StatsD during a flush interval

  • 643
  • 754
  • 583

The only value that gets flushed to the backend is 583. The value of this gauge will be kept in memory in StatsD and be sent to the backend at the end of every flush interval.

Graphite

Now that we know how our data is sent from StatsD, lets take a look at how it is stored and processed in Graphite.

Overview

In the Graphite documentation we can find the Graphite overview. It sums up Graphite with these two simple points.

  • Graphite stores numeric time-series data.
  • Graphite renders graphs of this data on demand.

Graphite consists of three parts.

  • carbon - a daemon that listens for time-series data.
  • whisper - a simple database library for storing time-series data.
  • webapp - a (Django) webapp that renders graphs on demand.

The format for time-series data in graphite looks like this

1
<key> <numeric value> <timestamp>

Storage schemas

Graphite uses configurable storage schemas too define retention rates for storing data. It matches data paths with a pattern and tells what frequency and history for our data to store.

The following configuration example is taken from the StatsD documentation.

1
2
3
[stats]
pattern = ^stats\..*
retentions = 10:2160,60:10080,600:262974

Which means these retentions will be used for every entry with a key matching the pattern defined. The retention format is frequency:history. So this configuration lets us store 10 second data for 6 hours, 1 minute data for 1 week, and 10 minute data for 5 years.

Visualizing a timer in Graphite

Knowing all this, we can now take a look at my simple ruby-script that collects timings for a HTTP requests.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/usr/bin/env ruby

require 'rubygems' if RUBY_VERSION < '1.9.0'
require './statsdclient.rb'
require 'typhoeus'

Statsd.host = 'localhost'
Statsd.port = 8125

def to_ms time
  (1000 * time).to_i
end

while true
  start_time = Time.now.to_f

  resp = Typhoeus::Request.get 'http://www.example.org/system/information'

  end_time = Time.now.to_f

  elapsed_time = (1000 * end_time) - (to_ms start_time)
  response_time = to_ms resp.time
  start_transfer_time = to_ms resp.start_transfer_time
  app_connect_time = to_ms resp.app_connect_time
  pretransfer_time = to_ms resp.pretransfer_time
  connect_time = to_ms resp.connect_time
  name_lookup_time = to_ms resp.name_lookup_time

  Statsd.timing('http_request.elapsed_time', elapsed_time)
  Statsd.timing('http_request.response_time', response_time)
  Statsd.timing('http_request.start_transfer_time', start_transfer_time)
  Statsd.timing('http_request.app_connect_time', app_connect_time)
  Statsd.timing('http_request.pretransfer_time', pretransfer_time)
  Statsd.timing('http_request.connect_time', connect_time)
  Statsd.timing('http_request.name_lookup_time', name_lookup_time)

  sleep 10
end

Lets take a look at the visualized Graphite render from this data. The data is from the last 2 minutes, and the elapsed_time target from our script above.

Image visualization

Render URL

Render URL used for the image below.

1
/render/?width=586&height=308&from=-2minutes&target=stats.timers.http_request.elapsed_time.sum
Rendered image from Graphite

Rendered image from Graphite, a simple graph visualizing elapsed_time for http requests over time.

JSON-data

Render URL

Render URL used for the JSON-data below.

1
/render/?width=586&height=308&from=-2minutes&target=stats.timers.http_request.elapsed_time.sum&format=json
JSON-output from Graphite

In the results below, we can see the raw data from Graphite. It is data from 12 different data points which means 2 minutes with the StatsD 10-second flush interval. It is really this simple, Graphite just visualizes its data.

The JSON-data is beautified with JSONLint for viewing purposes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
[
    {
        "target": "stats.timers.http_request.elapsed_time.sum",
        "datapoints": [
            [
                53.449951171875,
                1343038130
            ],
            [
                50.3916015625,
                1343038140
            ],
            [
                50.1357421875,
                1343038150
            ],
            [
                39.601806640625,
                1343038160
            ],
            [
                41.5263671875,
                1343038170
            ],
            [
                34.3974609375,
                1343038180
            ],
            [
                36.3818359375,
                1343038190
            ],
            [
                35.009033203125,
                1343038200
            ],
            [
                37.0087890625,
                1343038210
            ],
            [
                38.486572265625,
                1343038220
            ],
            [
                45.66064453125,
                1343038230
            ],
            [
                null,
                1343038240
            ]
        ]
    }
]

Visualizing a gauge in Graphite

The following simple script ships a gauge to StatsD, simulating a number of user registrations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/usr/bin/env ruby

require './statsdclient.rb'

Statsd.host = 'localhost'
Statsd.port = 8125

user_registrations = 1

while true
  user_registrations += Random.rand 128

  Statsd.gauge('user_registrations', user_registrations)

  sleep 10
end

Image visualization - Number of user registrations

Render URL

Render URL used for the image below.

1
/render/?width=586&height=308&from=-20minutes&target=stats.gauges.user_registrations
Rendered image from Graphite

Another simple graph, just showing the total number of registrations.

Image visualization - Number of user registrations per minute

By using the derivative-function in Graphite, we can get the number of user registrations per minute.

Render URL

Render URL used for the image below.

1
/render/?width=586&height=308&from=-20minutes&target=derivative(stats.gauges.user_registrations)
Rendered image from Graphite

A graph based on the same data as above, but with the derivative function applied to visualize a per-minute rate.

Conclusion

Knowing more about how StatsD and Graphite works, it will be alot easier to know what kind of data to ship StatsD, to know how to ship the data to StatsD, and to know how to read the data from Graphite.

Got any comments or questions? Let me know in the comment section below.

Comments