Sep 16 2016 SPARC and PPC find benchmark results

The article std::find() and memchr() Optimizations contains benchmark results for an Intel Core i5, an i7 and an older AMD system. This followup also adds results for a relatively recent SPARC system and an older PPC one.

SPARC T5¶

First released in 2013, the SPARC T5 is a relative common mid-level SPARC system - if you use SPARC. It runs at 3.6 GHz. The test system is running Solaris 10 with GCC 4.9.2 from OpenCSW and Solaris Studio 12.3 compilers available. Programs compiled with Solaris Studio are prefixed with ss. The following compiler settings are used:

GCC: -std=c++11 -O3 -m64 -mcpu=niagara4
SS : -xtarget=native64 -xO5

Note that the default STL (i.e. Sun STL) is used with Solaris Studio and that C++11 constructs in the test programs are replaced with equivalent pre-C++11 ones.

exe	min	median	mean	max	sdev	speedup
ss_find_uclibc	98.90	99.05	99.19	99.80	0.30	1.27
find_memchr	99.10	100.00	100.17	102.20	0.80	1.26
ss_find_memchr	99.40	100.15	100.05	100.60	0.38	1.26
ss_find_musl	99.9	100.7	100.5	100.8	0.35	1.25
find_find	100.5	100.8	100.9	102.1	0.41	1.25
find_uclibc	101.4	101.8	101.7	102.0	0.19	1.24
find_musl	100.9	101.9	101.6	102.0	0.42	1.24
ss_find_naive	105.2	105.5	105.5	105.7	0.13	1.19
ss_find_find	106.9	107.1	107.1	107.3	0.13	1.18
find_naive	125.6	126.0	126.1	126.5	0.25	1.00

The results show that the Solaris Studio compiler yields much better code for the naive loop version. That means that just compiling the naive loop with Solaris Studio gives a speedup of 1.19.

Especially noteworthy is GNU STL's std::find() compiled with GCC that is 1.25 times faster than the naive implementation. Thus, 1.06 times faster than Suns STL version compiled with Solaris Studio. And also faster than the naive version.

In contrast to x86, the non-SIMD chunked uclibc version is the fastest one, followed by the version that just calls memchr() from the Solaris libc. Although the runtime difference is just a second or so. Stepping through the Solaris libc memchr() implementation shows that no SIMD instructions are used. This hints that although that CPU comes with some SIMD support it might not as well suited for string processing as what is available on x86.

The perhaps most surprising result is, that all runtimes are several times higher than the ones reported on x86 systems. Basically even an AMD K10 desktop CPU (with less GHz) from 2007 is 1.07 times as fast as this SPARC CPU (when comparing the memchr versions).

Looking at the Core i5 results, this low-cost and low-power desktop system has a speedup of 1.98 over the SPARC server. The Core i7 laptop CPU has a speedup of 3.16.

Because of the big differences to x86 I would like to verify the results on another SPARC system.

PPC G5¶

This test system is Apple G5 with 1.8 GHz PPC 970FX CPU (from around 2004). It runs a ppc64 version of Debian 8, i.e. it comes with GCC 4.9.2. Since GCC on PPC doesn't support -march=native and creates 32 bit binaries, by default, the following flags are used: -O3 -mcpu=G5 -m64.

exe	min	median	mean	max	sdev	speedup
find_find	306.4	307.5	307.9	311.8	1.34	1.23
find_musl	308.1	309.2	309.2	311.3	0.78	1.22
find_uclibc	311.8	312.8	313.3	317.3	1.50	1.21
find_memchr	318.2	320.2	320.9	329.9	2.67	1.18
find_naive	375.9	377.5	377.4	379.8	0.87	1.00

In contrast to all other systems, the libc memchr() version is not ranked at the top. It is significantly slower than the std::find() version that uses a simple loop unrolling scheme. Thus, investigating that implementation and possibly replacing it with another one would be a useful task.

Conclusion¶

Looking at the benchmark results of the different machines we can see that the GNU libstdc++ based std::find() version (find_find) is a good generic choice (for the usecases described previously). Its runtime is everywhere less than the one of the naive version and less than or equal to the one of chunked implementations (e.g. uclibc/musl). In contrast to the chunked version, the 4 times unrolled loop is also a much simpler and straightforward implementation.

On x86, using specialized SIMD extensions like SSE or AVX, if available, always improves the runtime much and it is thus beneficial to provide specialized versions that are possible selected via a feature dependent runtime dispatch.

If not implementing a libc, the system provided memchr() most likely is an efficient version that is on a par with or an improvement over std::find() (with the exception of glibc/PPC64).

As-is, the very slow runtimes under Solaris/SPARC are a reminder to do some benchmarks before buying this nowadays exotic architecture.

The results, test programs and benchmark scripts are available in the Git repository.

Update (2016-10-26): Measurements discussed in the follow-up article Counting CPU Events help to explain the causes for the huge runtime differences between the Linux/x86 and Solaris/SPARC systems.