Counting CPU Events

This article discusses hardware performance counter measurements obtained for a character search benchmark running on Intel x86 and SPARC hardware. Those results help to explain the big runtime differences between a SPARC T5 system and Intel x86 ones (SPARC being much slower), as well as differences between the different implementations.

Current CPUs include hardware performance counters that are increased on events like instruction executed, branch mis-predicted or just on each clock-cycle. They are useful metrics to sample during profiling since they paint an accurate picture what effects the code under investigation has on real hardware.

Motivation

A pipelined architecture usually comes with some machinery to predict the result of a conditional branch, such that the pipeline is always filled well. Without pipelining, a CPU needs several clock-cycles to process an instruction (e.g. at least 3 cycles for fetch + decode + execute - in a simple model). When the pipeline works well, i.e. when it is filled, the CPU may execute up to 1 instruction per cycle.

Modern CPUs aren't just pipelined they are usually also superscalar, i.e. they are able to fetch, decode and execute multiple instructions in parallel. Meaning that they exploit instruction level parallelism. This parallelism automatically happens on a core, i.e. the speedup happens for a single thread of execution.

Thus, profiling with hardware counters shows if the code actually makes good use of the CPU. For example, dividing the number of mispredicted branches by the number of all (conditional) branches gives the misprediction rate. If it is high it is a hint to optimize the code such that the number of branches is reduced, if possible.

Similarly, the number of instructions divided by the number of cycles shows if the code is well suited for superscalar execution or not. Perhaps it can be optimized to remove unnecessary dependencies, on an instruction level.

Perf Example

On Linux, the perf stat command can be used to sample the hardware performance counters of a CPU while running a program. perf list lists all available counters, although useful ones are selected, by default.

An example (executed on a 6th gen Intel i7-6600U CPU @ 2.60GHz):

$ perf stat ./find_unroll2 30000 in > /dev/null

Performance counter stats for './find_unroll2 30000 in':

     45458.422057      task-clock:u (msec)       #    1.000 CPUs utilized
                0      context-switches:u        #    0.000 K/sec
                0      cpu-migrations:u          #    0.000 K/sec
               94      page-faults:u             #    0.002 K/sec
  147,362,896,070      cycles:u                  #    3.242 GHz
  459,613,530,908      instructions:u            #    3.12  insn per cycle
  126,908,078,140      branches:u                # 2791.740 M/sec
    1,101,990,382      branch-misses:u           #    0.87% of all branches

   45.457945052 seconds time elapsed

It is perhaps a little bit surprising that it reports 3.242 GHz although the CPU is marked with 2.6 GHz. This is due to a turbo-boost feature, where the CPU can temporarily 'overclock' if just one core is busy (such that it doesn't overheat). The numbers are internally consistent, e.g. cycles divided by runtime:

147362896070 45458.422057 1000 / / 10 9 ^ / p
=> 3.241 GHz

Or mis-predicted branches divided by all branches:

.86001101990382  126908078140 / p
=> .0086 branch-mis-rate (i.e. 0.86 %)

Solaris/SPARC example

For reading out the hardware counters on Solaris 10, cputrack can be used as perf stat replacement. It provides access to less counters than its Linux counterpart and less than the Solaris 11 version, but some essential ones are there.

Help output of cputrack:

Usage:
    cputrack [-T secs] [-N count] [-Defhnv] [-o file]
        -c events [command [args] | -p pid]

    -T secs   seconds between samples, default 1
    -N count  number of samples, default unlimited
    -D        enable debug mode
    -e        follow exec(2), and execve(2)
    -f        follow fork(2), fork1(2), and vfork(2)
    -h        print extended usage information
    -n        suppress titles
    -t        include virtualized %tick register
    -v        verbose mode
    -o file   write cpu statistics to this file
    -c events specify processor events to be monitored
    -p pid    pid of existing process to capture

    Use cpustat(1M) to monitor system-wide statistics.

Example call:

$ cputrack -o foo.log -c Cycles_user,PAPI_tot_ins,PAPI_br_cn,PAPI_br_msp \
  ./ss_find_find 30000 in > /dev/null
   time lwp  event      pic0      pic1  pic2      pic3
[..]
104.996   1   tick 3212642270 4080318300 1379858144   8494482
105.996   1   tick 3211969907 4078677601 1379286911   8491144
106.996   1   tick 3211183407 4078776284 1379331819   8491657
107.283   1   exit 344491651460 437368550542 147905759012 910664380

Methods

The results are gathered with the benchmark.py utility. This tool abstracts away some of the differences between systems and generates stats and plots for multiple runs. For example, the perf stat CSV output on CentOS 7 misses some columns and uses a different counter naming scheme. On Solaris, it calls cputrack instead and normalizes its output.

Each program version is executed and profiled 10 times.

The used CPUs are:

  • Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz (6th gen, turbo-boost up to 3.6 GHz)
  • Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz (4th gen, turbo-boot up to 2.6 GHz)
  • SPARC T5 @ 3.6 GHz

When running the benchmarks in a virtual machine (VM), the VM complicates accessing the hardware counters. For example, VirtualBox doesn't support any counter access and doesn't care. There is support in Linux KVM, though.

With Solaris 10 ldoms, access is possible but must be configured, otherwise cputrack fails with errors like:

123456: fini_lwp: lwp1: perf counter contents invalidated

The error occurs if the ldom perf-counters property isn't set. If it is set to htstrand it should work.

Of course, the best thing is to run the benchmarks outside of any virtualization, if possible.

Results

The different measurements were done on a Intel Core i7, an Core i5 and a SPARC T5 CPU. The raw data is available in a Git repository.

Branch Prediction

Core i7-6660U branch mispredictionrate

In general, branches are predicted pretty good for all the different versions. The different results match previous observations.

Since the find_uclibx86 version uses the very CISC instruction REPNE SCSB which replaces much of the visible branching and thus there isn't much to predict (and this instruction is very slow on modern Intel CPUs).

Versions like find_avx_more or find_avx2_overflow that to a little bit more to avoid some final looping (and thus branching) for the tail portion of the input also benefit from less mispredictions.

Also not unexpected, the versions find_musl and find_uclibc have the highest branch misprediction rate because their chunked approach contains a lot of branching that is apparently not a good match for the branch-prediction units.

The difference between find_unroll and find_unroll is consistent with the previous runtime benchmark, where the first version has the loop unrolled 4 times (instead of 3) and runs slower. Unrolling more introduces more branches and more branching require more space to store the predictions.

Core i5-4250U branch mispredictionrate

The results on an Intel Core i5 are quite similar to the i7 ones.

SPARC T5 branch misprediction

In comparison with x86, the result is more partitioned into 2 groups. Either 0.5 to 1 % of the branches are mispredicted (e.g. find_naive) or the percentage rises to over 4 %. The lower bound is similar to the results on x86, but the upper bound is several times higher than on x86 and it is reached by many more versions.

Compiling the versions with different compilers (Solaris Studio 12.3 vs. GCC 4.9) doesn't make a difference, as far as branch-prediction is affected. Note that ss_find_find also uses a different STL.

Superscalarity

Core i7-6660U instructions per cycle

Those results are also consistent with the measured runtimes. Versions with the longest runtime also execute less instructions per cycle. Again, the usage of the very CISC REPNE SCSB instruction in find_uclibx86 yields the worst result. Probably do to a micro-code implementation that blocks the usual superscalar mechanism. Also again, the reduction of the unroll factor from 4 to 3 (in find_unroll vs. find_unroll_3) has a positive effect. As expected, slow versions likefind_muslandfind_uclibc` that do some branching and bit operations due to chunking yield a lower number of instructions per second.

Core i5-4250U instructions per cycle

The results on the 2 generations older i5 are similar to the i7 ones. The difference is just that the i7 rate saturates between 3 and 3.5 while the i5 rate satures at around 2.5 for most versions.

Also, omitting the vzeroupper instruction in the find_avx2_nozero version (cf. previous article) only impacts the rate on the i5 (CentOS 7) and not on the i7 (Fedora 23) system.

SPARC T5 instructions per cycle

On a SPARC T5 core there isn't much superscalar execution observable. Only up to 1.3 or so instruction per cycle are reached by some versions. Most versions execute at a rate of 1.

Different compilers can make a difference, e.g. for find_naive (a simple loop) the code generated by GCC yields a slightly better rate.

Other factors

On the Solaris 10 SPARC T5 system, the counted cycles yield a rate of 3.2 GHz or so. This is less than the nominal 3.6 GHz, because the hardware counter is only incremented on cycles executed in user space. It is consistent with the output of /usr/bin/time which reports up to 20 % or so time spend in the kernel (sys-tem calls).

In comparison with Linux x86 systems, this value is also surprisingly high, i.e. on Linux x86 about 4 % or so sys-time is reported.

The input file for the benchmark isn't large, i.e. it is about 1 MiB. For the benchmarks it is stored on local storage. In each benchmark run it is read several times with an block size of 128 KiB such that most reads should come from the VM cache and the system call overhead should be small.

Conclusion

For the character searching benchmark, recent and not so recent mid-level Intel x86 CPUs outperform SPARC T5 ones because of better branch prediction and superscalar execution. On x86, up to 3 or more instructions are executed per cycle, where a SPARC T5 CPU basically effectively doesn't show much signs of superscalar execution.

SPARC T5 marketing material highlights the hyperthreading performance of the chip, i.e. a system can come with up to 8 sockets, where each chip has 16 cores and does 128 fold hyperthreading.

In realitiy, more hyperthreads on a chip aren't a good substitute for bad single core performance. Amdahl's Law shows that a perfect speedup can only be accomplished by a program that is 100 % parallelizable and on a system without communication or synchronisation overhead. And if a program is embarrassingly parallizable without much synchronisation needs, one doesn't necessarily need a shared memory system with many cores and lots of memory. Instead, a cluster of cheap x86 systems suffices.

As reported before, for this use-case, the SPARC T5 single core performance cannot compete even with a 2 generation old i5 desktop level x86 CPU. The program versions run twice as fast on the Intel i5, and more than three times as fast on an Intel i7. Even the execution on an AMD K10 CPU from 2007 is slightly faster.