Counting CPU Events
This article discusses hardware performance counter measurements obtained for a character search benchmark running on Intel x86 and SPARC hardware. Those results help to explain the big runtime differences between a SPARC T5 system and Intel x86 ones (SPARC being much slower), as well as differences between the different implementations.
Current CPUs include hardware performance counters that are increased on events like instruction executed, branch mis-predicted or just on each clock-cycle. They are useful metrics to sample during profiling since they paint an accurate picture what effects the code under investigation has on real hardware.
A pipelined architecture usually comes with some machinery to predict the result of a conditional branch, such that the pipeline is always filled well. Without pipelining, a CPU needs several clock-cycles to process an instruction (e.g. at least 3 cycles for fetch + decode + execute - in a simple model). When the pipeline works well, i.e. when it is filled, the CPU may execute up to 1 instruction per cycle.
Modern CPUs aren't just pipelined they are usually also superscalar, i.e. they are able to fetch, decode and execute multiple instructions in parallel. Meaning that they exploit instruction level parallelism. This parallelism automatically happens on a core, i.e. the speedup happens for a single thread of execution.
Thus, profiling with hardware counters shows if the code actually makes good use of the CPU. For example, dividing the number of mispredicted branches by the number of all (conditional) branches gives the misprediction rate. If it is high it is a hint to optimize the code such that the number of branches is reduced, if possible.
Similarly, the number of instructions divided by the number of cycles shows if the code is well suited for superscalar execution or not. Perhaps it can be optimized to remove unnecessary dependencies, on an instruction level.
On Linux, the
perf stat command can be used to sample
the hardware performance counters of a CPU while running a
perf list lists all available counters, although
useful ones are selected, by default.
An example (executed on a 6th gen Intel i7-6600U CPU @ 2.60GHz):
$ perf stat ./find_unroll2 30000 in > /dev/null Performance counter stats for './find_unroll2 30000 in': 45458.422057 task-clock:u (msec) # 1.000 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 94 page-faults:u # 0.002 K/sec 147,362,896,070 cycles:u # 3.242 GHz 459,613,530,908 instructions:u # 3.12 insn per cycle 126,908,078,140 branches:u # 2791.740 M/sec 1,101,990,382 branch-misses:u # 0.87% of all branches 45.457945052 seconds time elapsed
It is perhaps a little bit surprising that it reports 3.242 GHz although the CPU is marked with 2.6 GHz. This is due to a turbo-boost feature, where the CPU can temporarily 'overclock' if just one core is busy (such that it doesn't overheat). The numbers are internally consistent, e.g. cycles divided by runtime:
147362896070 45458.422057 1000 / / 10 9 ^ / p => 3.241 GHz
Or mis-predicted branches divided by all branches:
.86001101990382 126908078140 / p => .0086 branch-mis-rate (i.e. 0.86 %)
For reading out the hardware counters on Solaris 10,
cputrack can be used as
replacement. It provides access to less counters than its Linux
counterpart and less than the Solaris 11 version, but some
essential ones are there.
Help output of cputrack:
Usage: cputrack [-T secs] [-N count] [-Defhnv] [-o file] -c events [command [args] | -p pid] -T secs seconds between samples, default 1 -N count number of samples, default unlimited -D enable debug mode -e follow exec(2), and execve(2) -f follow fork(2), fork1(2), and vfork(2) -h print extended usage information -n suppress titles -t include virtualized %tick register -v verbose mode -o file write cpu statistics to this file -c events specify processor events to be monitored -p pid pid of existing process to capture Use cpustat(1M) to monitor system-wide statistics.
$ cputrack -o foo.log -c Cycles_user,PAPI_tot_ins,PAPI_br_cn,PAPI_br_msp \ ./ss_find_find 30000 in > /dev/null time lwp event pic0 pic1 pic2 pic3 [..] 104.996 1 tick 3212642270 4080318300 1379858144 8494482 105.996 1 tick 3211969907 4078677601 1379286911 8491144 106.996 1 tick 3211183407 4078776284 1379331819 8491657 107.283 1 exit 344491651460 437368550542 147905759012 910664380
The results are gathered with the benchmark.py utility.
This tool abstracts away some of the differences between systems
and generates stats and plots for multiple runs. For example, the
perf stat CSV output on CentOS 7 misses some columns and uses a
different counter naming scheme. On Solaris, it calls
instead and normalizes its output.
Each program version is executed and profiled 10 times.
The used CPUs are:
- Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz (6th gen, turbo-boost up to 3.6 GHz)
- Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz (4th gen, turbo-boot up to 2.6 GHz)
- SPARC T5 @ 3.6 GHz
When running the benchmarks in a virtual machine (VM), the VM complicates accessing the hardware counters. For example, VirtualBox doesn't support any counter access and doesn't care. There is support in Linux KVM, though.
With Solaris 10 ldoms, access is possible but must be configured, otherwise cputrack fails with errors like:
123456: fini_lwp: lwp1: perf counter contents invalidated
The error occurs if the ldom perf-counters property isn't set. If
it is set to
htstrand it should work.
Of course, the best thing is to run the benchmarks outside of any virtualization, if possible.
The different measurements were done on a Intel Core i7, an Core i5 and a SPARC T5 CPU. The raw data is available in a Git repository.
In general, branches are predicted pretty good for all the different versions. The different results match previous observations.
find_uclibx86 version uses the very CISC instruction
REPNE SCSB which replaces much of the visible branching and
thus there isn't much to predict (and this instruction is
very slow on modern Intel CPUs).
find_avx2_overflow that to a
little bit more to avoid some final looping (and thus branching)
for the tail portion of the input also benefit from less
Also not unexpected, the versions
have the highest branch misprediction rate because their chunked
approach contains a lot of branching that is apparently not a
good match for the branch-prediction units.
The difference between
consistent with the previous runtime benchmark, where the first
version has the loop unrolled 4 times (instead of 3) and runs
slower. Unrolling more introduces more branches and more
branching require more space to store the predictions.
The results on an Intel Core i5 are quite similar to the i7 ones.
In comparison with x86, the result is more partitioned into 2
groups. Either 0.5 to 1 % of the branches are mispredicted (e.g.
find_naive) or the percentage rises to over 4 %. The lower
bound is similar to the results on x86, but the upper bound is
several times higher than on x86 and it is reached by many more
Compiling the versions with different compilers (Solaris Studio
12.3 vs. GCC 4.9) doesn't make a difference, as far as
branch-prediction is affected. Note that
ss_find_find also uses
a different STL.
Those results are also consistent with the measured runtimes.
Versions with the longest runtime also execute less instructions
per cycle. Again, the usage of the very CISC
find_uclibx86 yields the worst result. Probably
do to a micro-code implementation that blocks the usual
superscalar mechanism. Also again, the reduction of the unroll
factor from 4 to 3 (in
find_unroll_3) has a
positive effect. As expected, slow versions likefind_musl
andfind_uclibc` that do some branching and bit operations due to
chunking yield a lower number of instructions per second.
The results on the 2 generations older i5 are similar to the i7 ones. The difference is just that the i7 rate saturates between 3 and 3.5 while the i5 rate satures at around 2.5 for most versions.
Also, omitting the
vzeroupper instruction in the
find_avx2_nozero version (cf. previous article) only
impacts the rate on the i5 (CentOS 7) and not on the i7 (Fedora
On a SPARC T5 core there isn't much superscalar execution observable. Only up to 1.3 or so instruction per cycle are reached by some versions. Most versions execute at a rate of 1.
Different compilers can make a difference, e.g. for
(a simple loop) the code generated by GCC yields a slightly
On the Solaris 10 SPARC T5 system, the counted cycles yield a
rate of 3.2 GHz or so. This is less than the nominal 3.6 GHz,
because the hardware counter is only incremented on cycles
executed in user space. It is consistent with the output of
/usr/bin/time which reports up to 20 % or so time spend in the
kernel (sys-tem calls).
In comparison with Linux x86 systems, this value is also surprisingly high, i.e. on Linux x86 about 4 % or so sys-time is reported.
The input file for the benchmark isn't large, i.e. it is about 1 MiB. For the benchmarks it is stored on local storage. In each benchmark run it is read several times with an block size of 128 KiB such that most reads should come from the VM cache and the system call overhead should be small.
For the character searching benchmark, recent and not so recent mid-level Intel x86 CPUs outperform SPARC T5 ones because of better branch prediction and superscalar execution. On x86, up to 3 or more instructions are executed per cycle, where a SPARC T5 CPU basically effectively doesn't show much signs of superscalar execution.
SPARC T5 marketing material highlights the hyperthreading performance of the chip, i.e. a system can come with up to 8 sockets, where each chip has 16 cores and does 128 fold hyperthreading.
In realitiy, more hyperthreads on a chip aren't a good substitute for bad single core performance. Amdahl's Law shows that a perfect speedup can only be accomplished by a program that is 100 % parallelizable and on a system without communication or synchronisation overhead. And if a program is embarrassingly parallizable without much synchronisation needs, one doesn't necessarily need a shared memory system with many cores and lots of memory. Instead, a cluster of cheap x86 systems suffices.
As reported before, for this use-case, the SPARC T5 single core performance cannot compete even with a 2 generation old i5 desktop level x86 CPU. The program versions run twice as fast on the Intel i5, and more than three times as fast on an Intel i7. Even the execution on an AMD K10 CPU from 2007 is slightly faster.