Sunday 6 March 2016

Triaging with `pidstat`

This post will detail a few example uses of the `pidstat` tool while digging into performance of a Graphite instance.

Without taking any measurements, reading up a bit on Graphite's architecture can help shed some light on what is going on under the hood.  At the very minimum, we know the following about the application:
  • Receives metrics over a network interface
  • Stores metrics to disk
  • Generates graphs in response to requests over a network interface
The above functionality allows for the possibility of the following bottlenecks:
  • I/O-bound from disk I/O
  • I/O bound from network I/O
  • CPU-bound from I/O co-ordination
I'm expecting the system to be CPU-bound as previous tuning iteration with this system was in response to saturation of the CPU. This has led to the current state where there is a `carbon-cache` instance running for each core available to the system as well as a single carbon-relay running to load-balance traffic to each carbon-cache process.  Given this knowledge, a good place to start poking around would be the `carbon-relay` instance.

I'll begin with examining CPU utilization of the process.

pidstat -u -p 25430 1
Linux 3.13.0-43-generic (stats.tunnelbear.com)  16-02-10        _x86_64_        (4 CPU)

03:24:05 AM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
03:24:06 AM   999     25430   57.00    2.00    0.00   59.00     1  carbon-relay.py
03:24:07 AM   999     25430   49.00    2.00    0.00   51.00     2  carbon-relay.py
03:24:08 AM   999     25430   71.00    2.00    0.00   73.00     2  carbon-relay.py
03:24:09 AM   999     25430   50.00    2.00    0.00   52.00     0  carbon-relay.py
03:24:10 AM   999     25430   26.00    2.00    0.00   28.00     2  carbon-relay.py
03:24:11 AM   999     25430   38.00    1.00    0.00   39.00     2  carbon-relay.py
03:24:12 AM   999     25430   38.00    1.00    0.00   39.00     2  carbon-relay.py
03:24:13 AM   999     25430   44.00    2.00    0.00   46.00     2  carbon-relay.py
03:24:14 AM   999     25430   30.00    2.00    0.00   32.00     2  carbon-relay.py
03:24:15 AM   999     25430   35.00    0.00    0.00   35.00     2  carbon-relay.py
03:24:16 AM   999     25430   68.00    2.00    0.00   70.00     2  carbon-relay.py
03:24:17 AM   999     25430   76.00    1.00    0.00   77.00     2  carbon-relay.py
03:24:18 AM   999     25430   70.00    3.00    0.00   73.00     2  carbon-relay.py
03:24:19 AM   999     25430   41.00    4.00    0.00   45.00     2  carbon-relay.py
03:24:20 AM   999     25430   36.00    2.00    0.00   38.00     2  carbon-relay.py
03:24:21 AM   999     25430   32.00    2.00    0.00   34.00     2  carbon-relay.py
03:24:22 AM   999     25430   28.00    1.00    0.00   29.00     2  carbon-relay.py
03:24:23 AM   999     25430   38.00    2.00    0.00   40.00     2  carbon-relay.py
03:24:24 AM   999     25430   53.00    2.00    0.00   55.00     0  carbon-relay.py
03:24:25 AM   999     25430   71.00    1.00    0.00   72.00     2  carbon-relay.py
03:24:26 AM   999     25430   78.00    3.00    0.00   81.00     2  carbon-relay.py
^C
Average:      999     25430   49.00    1.86    0.00   50.86     -  carbon-relay.py
Here, I've used the `pidstat` utility while specifying the PID of the carbon-relay instance, setting the CPU statistics flag and requesting updates every 1 second.

Some interesting observations are:
  • The `carbon-relay` instance is typically serviced by CPU 2 but it is not strictly pinned to it.
  • The average total percentage of CPU time used by the `carbon-relay` instance is around 50%.
  • Around 96% of the time the `carbon-relay` instance is executing at the user level leaving a small percentage of associated work within kernel space.
How does this compare to a `carbon-cache` process?

pidstat -u -p 24569 1                                                                                                                                                           
Linux 3.13.0-43-generic (stats.tunnelbear.com)  16-02-10        _x86_64_        (4 CPU)

03:30:35 AM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
03:30:36 AM   999     24569   16.00    8.00    0.00   24.00     1  carbon-cache.py
03:30:37 AM   999     24569   16.00    6.00    0.00   22.00     2  carbon-cache.py
03:30:38 AM   999     24569   10.00    4.00    0.00   14.00     2  carbon-cache.py
03:30:39 AM   999     24569    7.00    2.00    0.00    9.00     2  carbon-cache.py
03:30:40 AM   999     24569    6.00    3.00    0.00    9.00     2  carbon-cache.py
03:30:41 AM   999     24569    8.00    4.00    0.00   12.00     2  carbon-cache.py
03:30:42 AM   999     24569    9.00    2.00    0.00   11.00     3  carbon-cache.py
03:30:43 AM   999     24569   18.00   10.00    0.00   28.00     0  carbon-cache.py
03:30:44 AM   999     24569   12.00    5.00    0.00   17.00     3  carbon-cache.py
03:30:45 AM   999     24569    8.00    3.00    0.00   11.00     2  carbon-cache.py
03:30:46 AM   999     24569   17.00    6.00    0.00   23.00     1  carbon-cache.py
03:30:47 AM   999     24569    4.00    3.00    0.00    7.00     3  carbon-cache.py
03:30:48 AM   999     24569    5.00    0.00    0.00    5.00     1  carbon-cache.py
03:30:49 AM   999     24569   11.00    4.00    0.00   15.00     1  carbon-cache.py
03:30:50 AM   999     24569   17.00    8.00    0.00   25.00     1  carbon-cache.py
03:30:51 AM   999     24569   10.00    8.00    0.00   18.00     1  carbon-cache.py
03:30:52 AM   999     24569    8.00    3.00    0.00   11.00     1  carbon-cache.py
03:30:53 AM   999     24569    5.00    3.00    0.00    8.00     0  carbon-cache.py
03:30:54 AM   999     24569   15.00    7.00    0.00   22.00     1  carbon-cache.py
03:30:55 AM   999     24569   11.00    6.00    0.00   17.00     1  carbon-cache.py
03:30:56 AM   999     24569    4.00    1.00    0.00    5.00     0  carbon-cache.py
03:30:57 AM   999     24569   12.00    8.00    0.00   20.00     0  carbon-cache.py
03:30:58 AM   999     24569   15.00    5.00    0.00   20.00     0  carbon-cache.py
03:30:59 AM   999     24569   19.00   10.00    0.00   29.00     0  carbon-cache.py
03:31:00 AM   999     24569    9.90    5.94    0.00   15.84     0  carbon-cache.py
03:31:01 AM   999     24569   12.00    6.00    0.00   18.00     1  carbon-cache.py
03:31:02 AM   999     24569    9.00    7.00    0.00   16.00     0  carbon-cache.py
03:31:03 AM   999     24569    3.00    1.00    0.00    4.00     2  carbon-cache.py
03:31:04 AM   999     24569   14.00    4.00    0.00   18.00     1  carbon-cache.py
03:31:05 AM   999     24569   10.00    3.00    0.00   13.00     1  carbon-cache.py
03:31:06 AM   999     24569   18.00    9.00    0.00   27.00     1  carbon-cache.py
^C
Average:      999     24569   10.93    5.00    0.00   15.93     -  carbon-cache.py
Using a similar `pidstat` call while targeting the PID of a carbon-cache instance yields the following comparative observations:
  • The `carbon-relay` instance taxes processing resources more than the previously examined `carbon-cache` instance.
  • The `carbon-relay` instance executes mainly in user space while it's counterpart operates at a 2:1 ratio between user and kernel space (most likely due to disk I/O).
`pidstat` can also report on context switching activity.  The following snippets are the summary outputs from a pair of `pidstat` calls targeting the relay and cache instances:

01:22:44 AM   UID       PID   cswch/s nvcswch/s  Command
Average:      999     12028     66.94      2.47  carbon-cache.py
Average:      999     12043     71.11    164.32  carbon-relay.py
The average number of voluntary context switches per second are similar while average non-voluntary context switches are much higher for the `carbon-relay` instance.

Voluntary context switches occur when a task blocks because it requires a resource that is unavailable.  Comparing these two processes in terms of voluntary context switches, it seems as if the values are quite close.  Other unrelated processes on the system have values closer to zero while also having CPU time quite low.  Although these processes are functionally related due to them being apart of the same application, they have completely different CPU workloads.  Their similarity in voluntary context switch rate could use some more investigation.

Non-voluntary context switches occur when a task's time slice runs to completion and it is forced away from processing in favour of the next task in line.  Looking at the averages above, the non-voluntary context switches per second that the `carbon-relay` instance is experiencing sticks out and could use some further investigation.

Digging deeper past this initial triage can be done with other tools such as SystemTap or perf to uncover the code paths contributing to this performance.