SPARC and PPC find benchmark results
The article std::find() and memchr() Optimizations contains benchmark results for an Intel Core i5, an i7 and an older AMD system. This followup also adds results for a relatively recent SPARC system and an older PPC one.
SPARC T5¶
First released in 2013, the SPARC T5 is a relative common
mid-level SPARC system - if you use SPARC. It runs at 3.6 GHz.
The test system is running Solaris 10 with GCC 4.9.2 from OpenCSW
and Solaris Studio 12.3 compilers available. Programs compiled
with Solaris Studio are prefixed with ss
. The following
compiler settings are used:
GCC: -std=c++11 -O3 -m64 -mcpu=niagara4
SS : -xtarget=native64 -xO5
Note that the default STL (i.e. Sun STL) is used with Solaris Studio and that C++11 constructs in the test programs are replaced with equivalent pre-C++11 ones.
exe | min | median | mean | max | sdev | speedup |
---|---|---|---|---|---|---|
ss_find_uclibc | 98.90 | 99.05 | 99.19 | 99.80 | 0.30 | 1.27 |
find_memchr | 99.10 | 100.00 | 100.17 | 102.20 | 0.80 | 1.26 |
ss_find_memchr | 99.40 | 100.15 | 100.05 | 100.60 | 0.38 | 1.26 |
ss_find_musl | 99.9 | 100.7 | 100.5 | 100.8 | 0.35 | 1.25 |
find_find | 100.5 | 100.8 | 100.9 | 102.1 | 0.41 | 1.25 |
find_uclibc | 101.4 | 101.8 | 101.7 | 102.0 | 0.19 | 1.24 |
find_musl | 100.9 | 101.9 | 101.6 | 102.0 | 0.42 | 1.24 |
ss_find_naive | 105.2 | 105.5 | 105.5 | 105.7 | 0.13 | 1.19 |
ss_find_find | 106.9 | 107.1 | 107.1 | 107.3 | 0.13 | 1.18 |
find_naive | 125.6 | 126.0 | 126.1 | 126.5 | 0.25 | 1.00 |
The results show that the Solaris Studio compiler yields much better code for the naive loop version. That means that just compiling the naive loop with Solaris Studio gives a speedup of 1.19.
Especially noteworthy is GNU STL's std::find()
compiled with
GCC that is 1.25 times faster than the naive implementation. Thus,
1.06 times faster than Suns STL version compiled with
Solaris Studio. And also faster than the naive version.
In contrast to x86, the non-SIMD chunked uclibc version is the
fastest one, followed by the version that just calls memchr()
from the Solaris libc. Although the runtime difference is just a
second or so. Stepping through the Solaris libc memchr()
implementation shows that no SIMD instructions are used. This
hints that although that CPU comes with some SIMD support it
might not as well suited for string processing as what is
available on x86.
The perhaps most surprising result is, that all runtimes are
several times higher than the ones reported on x86 systems.
Basically even an AMD K10 desktop CPU (with less GHz) from 2007
is 1.07 times as fast as this SPARC CPU (when comparing the
memchr
versions).
Looking at the Core i5 results, this low-cost and low-power desktop system has a speedup of 1.98 over the SPARC server. The Core i7 laptop CPU has a speedup of 3.16.
Because of the big differences to x86 I would like to verify the results on another SPARC system.
PPC G5¶
This test system is Apple G5 with 1.8 GHz PPC 970FX CPU (from
around 2004). It runs a ppc64 version of Debian 8, i.e. it comes
with GCC 4.9.2. Since GCC on PPC doesn't support -march=native
and creates 32 bit binaries, by default, the following flags are
used: -O3 -mcpu=G5 -m64
.
exe | min | median | mean | max | sdev | speedup |
---|---|---|---|---|---|---|
find_find | 306.4 | 307.5 | 307.9 | 311.8 | 1.34 | 1.23 |
find_musl | 308.1 | 309.2 | 309.2 | 311.3 | 0.78 | 1.22 |
find_uclibc | 311.8 | 312.8 | 313.3 | 317.3 | 1.50 | 1.21 |
find_memchr | 318.2 | 320.2 | 320.9 | 329.9 | 2.67 | 1.18 |
find_naive | 375.9 | 377.5 | 377.4 | 379.8 | 0.87 | 1.00 |
In contrast to all other systems, the libc memchr()
version is
not ranked at the top. It is significantly slower than the
std::find()
version that uses a simple loop unrolling scheme.
Thus, investigating that implementation and possibly replacing it
with another one would be a useful task.
Conclusion¶
Looking at the benchmark results of the different machines we can
see that the GNU libstdc++ based std::find()
version
(find_find
) is a good generic choice (for the usecases
described previously). Its runtime is everywhere less than
the one of the naive version and less than or equal to the one of
chunked implementations (e.g. uclibc/musl). In contrast to the
chunked version, the 4 times unrolled loop is also a much simpler
and straightforward implementation.
On x86, using specialized SIMD extensions like SSE or AVX, if available, always improves the runtime much and it is thus beneficial to provide specialized versions that are possible selected via a feature dependent runtime dispatch.
If not implementing a libc, the system provided memchr()
most
likely is an efficient version that is on a par with or an
improvement over std::find()
(with the exception of
glibc/PPC64).
As-is, the very slow runtimes under Solaris/SPARC are a reminder to do some benchmarks before buying this nowadays exotic architecture.
The results, test programs and benchmark scripts are available in the Git repository.
Update (2016-10-26): Measurements discussed in the follow-up article Counting CPU Events help to explain the causes for the huge runtime differences between the Linux/x86 and Solaris/SPARC systems.