std::find() and memchr() Optimizations
Assume you need to parse a record based format with flexible
width and one-byte delimiters. When using C++, the std::find()
STL
algorithm is the obvious choice for efficiently locating the
delimiters. The idiomatic C solution is to use memchr()
.
- Expectations
- Sample Code
- Methods
- How fast is std::find()?
- Does std::find() call memchr()?
- Who unrolled the loop?
- Doesn't a compiler optimize a naive loop on its own?
- Is std::find() faster as memchr()?
- Does SIMD really speed things up?
- How fast is just SSE?
- What about AVX-512
- Glibc memchr()
- Controlled Buffer Overflow
- Do more faster?
- Alignment
- Inlining
- Other C libraries
- Measurements
- Conclusion
Expectations¶
Like with other STL algorithms one expects that std::find()
is
highly optimized and specialized for different arguments. For
example, the minimum requirement for the range are input
iterators but if we use - say - random access iterators for
searching a range of char
elements we expect a specialized
version. For example, one that uses SIMD (single instruction
multiple data) vector instructions, where available.
Similarly, one expects that the libc implementation of memchr()
does runtime dispatching based on the available CPU
extensions (and perhaps the search range size) and uses a specialized
version. Again, if available one that uses SIMD instructions.
In fact, std::find()
could even just call memchr()
for a
random access character range.
It is also possible that the compiler directly inlines
memchr()
calls with an optimized implementation.
Let's see which of those expectations are actually met.
Sample Code¶
This example is a minimal version of a real parser that does enough to see differences in implementation details of a character search function:
int main(int argc, char **argv)
{
if (argc < 3)
return 1;
unsigned n = atoi(argv[1]);
int fd = x::open(argv[2], O_RDONLY);
enum { N = 128 * 1024 };
char buffer[N+64];
for (unsigned i = 0; i < n; ++i) {
size_t off = 0;
if (i)
x::lseek(fd, 0, SEEK_SET);
for (;;) {
auto r = x::read(fd, buffer, N);
const char *b = buffer;
auto e = b + r;
for (;;) {
auto x = g::find(b, e, '\n');
auto y = g::find(b, x, '|');
off += y - b;
if (y != x)
printf("%zu\n", off);
if (x == e)
break;
off = 0;
b = x + 1;
}
if (r != N)
break;
}
}
return 0;
}
The x
namespace contains wrappers for some POSIX
functions that throw exceptions on error. The function
g::find()
wraps different versions of std::find()
.
A read size of 128 KiB is used because it is a multiple of 4 KiB
and 16 KiB page sizes and it usually gives good throughput on
modern systems (cf. e.g. the comments in the GNU coreutils
implementation of cp
).
The buffer is 64 bytes bigger than strictly necessary to allow for implementations that read over the right boundary (i.e. up to 64 bytes).
The outermost loop is there for benchmarking purposes, i.e. to simulate large files with many records.
In the example input format, records are delimited by a newline and an important prefix is delimited by a pipe character. The simplified purpose is to print the prefix length of each record.
Basically, its functionality is equivalent to:
for i in $(seq $n); do awk -F'|' '{print length($1)}' $input; done
Or this Python 3 program:
import sys
n = int(sys.argv[1])
for i in range(n):
with open(sys.argv[2], 'r') as f:
for line in f:
print(line.find('|'))
The test program and other material contained in a Git repository.
Methods¶
We generate some sample records via dumping a dictionary and reformatting it a little bit:
aspell -d en dump master | aspell -l en expand \
| paste '-d,,,,|,,' - - - - - - - - > in
On a Fedora 23 system, this yields a file with 14972 records (i.e. lines) where the average prefix and suffix sizes are 47 and 27 characters.
Unless noted otherwise:
- The test input is generated on a Fedora 23 system as described above and it is processed with 30000 iterations.
- Throughout the text the rounded median runtimes on a
Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
CPU (6th generation Core 7) are reported. This CPU notably comes with the AVX2 SIMD extension, but not with the AVX-512 extension. More details and runtime results for other systems are available in the Section Measurements. - The main test system runs Fedora 23 (64 Bit).
- All examples are compiled with GCC 5.3.1 (Fedora 23) with the
options
-O3 -march=native
.
To get an upper bound, the following GNU awk call finishes in 60 seconds (on the described system):
time for i in {1..3000}; do cat in ; done \
| awk -F'|' '{print length($1)}' > /dev/null
While the Python version finishes in 41 seconds:
time ./find.py 3000 in > /dev/null
Note that the following benchmarks use 10 times more iterations, because the test programs run much faster.
The test program, the used input, the different std::find
and
memchr()
versions and the benchmark scripts are available in a
Git repository.
How fast is std::find()
?¶
The straight forward idiomatic C++ definition of g::find()
is
just to call std::find()
:
#include <algorithm>
namespace g {
const char *find(const char *b, const char *e, char c)
{
return std::find(b, e, c);
}
}
That version has a median runtime of 46 seconds with our 30000 times 14972 records test input. This is 9 times or so faster than the Awk and Python implementations.
Does std::find()
call memchr()
?¶
The easiest way to find out is to let GCC emit the assembler output like this:
g++ -O3 -march=native find_find_impl.cc -S -o - | grep -v '^'$'\t''\.'
Which produces:
g::find(char const*, char const*, char):
movq %rsi, %r9
movl %edx, %ecx
subq %rdi, %r9
movq %r9, %r8
sarq $2, %r8
testq %r8, %r8
jle .L14
cmpb (%rdi), %dl
je .L15
cmpb 1(%rdi), %dl
je .L31
cmpb 2(%rdi), %dl
je .L32
cmpb 3(%rdi), %dl
je .L33
leaq (%rdi,%r8,4), %r8
jmp .L9
.L10:
cmpb 4(%rdi), %cl
je .L29
cmpb 1(%rax), %cl
je .L34
cmpb 2(%rax), %cl
je .L35
cmpb 3(%rax), %cl
je .L36
movq %rax, %rdi
.L9:
leaq 4(%rdi), %rax
cmpq %r8, %rax
jne .L10
movq %rsi, %r9
subq %r8, %r9
.L2:
cmpq $2, %r9
je .L11
cmpq $3, %r9
je .L12
cmpq $1, %r9
je .L13
movq %rsi, %rax
.L29:
rep ret
.L34:
leaq 5(%rdi), %rax
ret
.L35:
leaq 6(%rdi), %rax
ret
.L36:
leaq 7(%rdi), %rax
ret
.L12:
cmpb (%rax), %dl
je .L29
addq $1, %rax
.L11:
cmpb (%rax), %dl
je .L29
addq $1, %rax
.L13:
cmpb %dl, (%rax)
cmovne %rsi, %rax
ret
.L15:
movq %rdi, %rax
ret
.L14:
movq %rdi, %rax
jmp .L2
.L33:
leaq 3(%rdi), %rax
ret
.L32:
leaq 2(%rdi), %rax
ret
.L31:
leaq 1(%rdi), %rax
ret
Alternatively, one can also copy and paste the C++ code into the GCC explorer and look at the assembler output. This is also convenient for comparing different compilers.
The code doesn't just call memchr()
.
This also shows:
- the function has 8 return instructions which hints at loop unrolling
- the loop body starting at
.L10
includes 4 similar looking compare instruction, again looks like an unrolled loop - no SIMD instructions are used
Who unrolled the loop?¶
A look at the GNU STL source code in
/usr/include/c++/5.3.1/bits/stl_algo.h
gives us the answer:
template<typename _InputIterator, typename _Tp>
inline _InputIterator
find(_InputIterator __first, _InputIterator __last,
const _Tp& __val)
{
// concept requirements
__glibcxx_function_requires(_InputIteratorConcept<_InputIterator>)
__glibcxx_function_requires(_EqualOpConcept<
typename iterator_traits<_InputIterator>::value_type, _Tp>)
__glibcxx_requires_valid_range(__first, __last);
return std::__find_if(__first, __last,
__gnu_cxx::__ops::__iter_equals_val(__val));
}
Which calls a generic equal_if()
implementation:
template<typename _Iterator, typename _Predicate>
inline _Iterator
__find_if(_Iterator __first, _Iterator __last, _Predicate __pred)
{
return __find_if(__first, __last, __pred,
std::__iterator_category(__first));
}
Which calls a version specialized for random access iterators:
template<typename _RandomAccessIterator, typename _Predicate>
_RandomAccessIterator
__find_if(_RandomAccessIterator __first, _RandomAccessIterator __last,
_Predicate __pred, random_access_iterator_tag)
{
typename iterator_traits<_RandomAccessIterator>::difference_type
__trip_count = (__last - __first) >> 2;
for (; __trip_count > 0; --__trip_count)
{
if (__pred(__first))
return __first;
++__first;
if (__pred(__first))
return __first;
++__first;
if (__pred(__first))
return __first;
++__first;
if (__pred(__first))
return __first;
++__first;
}
switch (__last - __first)
{
case 3:
if (__pred(__first))
return __first;
++__first;
case 2:
if (__pred(__first))
return __first;
++__first;
case 1:
if (__pred(__first))
return __first;
++__first;
case 0:
default:
return __last;
}
}
Indeed, the loop is manually unrolled. The 4 compare instructions are a direct result of having 4 if statements in the loop body. The 8 return statements are a consequence of that and the fall-through switch-case statement that deals with the last characters of the range.
This optimization is promising because on a super-scalar architecture, multiple statements might be executed (speculatively) in parallel.
Update (2016-09-20): Interestingly, the LLVM libc++
doesn't implement any specializations for std::find()
. It is
just one implementation that uses a simple loop, i.e. no
specialization for random-access iterators.
Update (2016-09-28): When compiling the naive implementation
with GCC and -funroll-loops
(or via
__attribute__((optimize("unroll-loops")))
- cf.
find_unroll_impl.cc
), the GCC automatically unrolls the loop 8
times. This unroll factor turns out to be sub-optimal on the
Intel i7, i.e. this version is 9 percent slower than the naive
one.
Taking this result into account, we may question if 4 times
unrolling is really the best we can do with respect to unrolling.
For this benchmark, on the Intel i7, the sweet spot is indeed an
unroll factor of 3 (cf. find_unroll2_impl.cc
). This version
runs 1 second or so faster than the 4 times unrolled find_find
version.
Doesn't a compiler optimize a naive loop on its own?¶
Modern compilers include sophisticated loop optimizations including auto-vectorization, thus, perhaps a naive loop also yields good code.
Example implementation:
const char *find_naive(const char *b, const char *e, char c)
{
for (const char *i = b; i != e; ++i)
if (*i == c)
return i;
return e;
}
With that GCC yields:
find_naive(char const*, char const*, char):
cmpq %rsi, %rdi
movl %edx, %eax
jnb .L6
cmpb (%rdi), %dl
jne .L4
jmp .L8
.L5:
cmpb %al, (%rdi)
je .L8
.L4:
addq $1, %rdi
cmpq %rdi, %rsi
jne .L5
.L6:
movq %rsi, %rax
ret
.L8:
movq %rdi, %rax
ret
(See also GCC explorer.)
Which is a pretty straight forward translation of the loop. There is no auto-vectorization and no loop-unrolling going on.
Adding the compiler switch -fopt-info-vec-all
yields diagnostics of the
different auto-vectorization phases:
!!note: not vectorized: control flow in loop.
!!note: bad loop form.
[..]
!!note: not vectorized: not enough data-refs in basic block.
[..]
But how bad is such a naive loop?
Its median runtime is 50 seconds, just a few seconds more than
std::find()
which implements loop unrolling ... (47 seconds)
Is std::find()
faster as memchr()
?¶
Following definition
#include <string.h>
const char *find_memchr(const char *b, const char *e, char c)
{
const char *r = static_cast<const char*>(memchr(b, c, e-b));
return r ? r : e;
}
yields a median runtime of 32 seconds. Thus, memchr
is
1.6 times or so faster than the naive implementation. Clearly, its
implementation is different from the std::find()
one.
Looking at the generated code
find_memchr(char const*, char const*, char):
pushq %rbx
movq %rsi, %rbx
movl %edx, %esi
movq %rbx, %rdx
movsbl %sil, %esi
subq %rdi, %rdx
call memchr
testq %rax, %rax
cmove %rbx, %rax
popq %rbx
ret
(cf. GCC explorer)
we see that the compiler doesn't directly inline the memchr()
call but just
calls the version from glibc.
Does SIMD really speed things up?¶
The AVX2 extension is a relatively recent SIMD implementation Intel has to offer on many of its CPUs - e.g. also on a standard Skylake CPU. It features registers of 256 bit width, i.e. that 32 characters can be packed into one register. Also, in contrast to some older SIMD extension it has relaxed alignment requirements, which simplifies its usage and enables more use cases, in general. Although, some instructions might perform better with aligned data (as documented) this isn't the general rule.
With GCC there are few alternatives how to use those SIMD instructions:
- Vector Extensions
- inline assembler
- external assembler
- builtins
- intrinsics
The idea of the vector extensions is to abstract away the different vector extension on different architectures to write portable vector code. This ansatz sounds elegant, but as-is the GCC vector extension syntax seems to be mainly focused on arithmetic operations and some useful operations are lacking.
Using inline or even an external assembler is ok - but interfacing with the rest of the program requires some care.
The intrinsics are available via a header that is standardized by Intel, thus those intrinsics are portable to other x86 compilers. The GCC maps them to its builtins (that are one-to-one mappings to the underlying instructions). In addition to that, there is good documentation online available, including a searchable reference and tutorials.
A quick consultation of the Intel Intrinsics Guide
leads to the following AVX find()
version:
#include <immintrin.h>
const char *find_avx2(const char *b, const char *e, char c)
{
const char *i = b;
__m256i q = _mm256_set1_epi8(c);
for (; i+32 < e; i+=32) {
__m256i x = _mm256_lddqu_si256(
reinterpret_cast<const __m256i*>(i));
__m256i r = _mm256_cmpeq_epi8(x, q);
int z = _mm256_movemask_epi8(r);
if (z)
return i + __builtin_ffs(z) - 1;
}
for (; i < e; ++i)
if (*i == c)
return i;
return e;
}
The __builtin_ffs
function is a standard GCC builtin
for the find-first-set bit operation.
Its median runtime is 30 seconds which is a tiny bit faster than
memchr()
(2 seconds) and thus about 1.7 times faster than
the naive implementation.
Looking at the assembler output
find_avx2(char const*, char const*, char):
leaq 8(%rsp), %r10
andq $-32, %rsp
movl %edx, %eax
pushq -8(%r10)
pushq %rbp
movq %rsp, %rbp
pushq %r10
movb %dl, -17(%rbp)
vpbroadcastb -17(%rbp), %ymm1
jmp .L5
.L2:
vlddqu (%rdi), %ymm0
vpcmpeqb %ymm0, %ymm1, %ymm0
vpmovmskb %ymm0, %ecx
testl %ecx, %ecx
jne .L17
movq %r8, %rdi
.L5:
leaq 32(%rdi), %r8
cmpq %rsi, %r8
jb .L2
cmpq %rsi, %rdi
jnb .L9
cmpb %dl, (%rdi)
jne .L7
jmp .L12
.L8:
cmpb (%rdi), %al
je .L12
.L7:
addq $1, %rdi
cmpq %rdi, %rsi
jne .L8
.L9:
movq %rsi, %rax
vzeroupper
popq %r10
popq %rbp
leaq -8(%r10), %rsp
ret
.L17:
tzcntl %ecx, %ecx
addl $1, %ecx
movslq %ecx, %rcx
leaq -1(%rdi,%rcx), %rax
vzeroupper
popq %r10
popq %rbp
leaq -8(%r10), %rsp
ret
.L12:
movq %rdi, %rax
vzeroupper
popq %r10
popq %rbp
leaq -8(%r10), %rsp
ret
(cf. GCC explorer)
we see how the intrinsics are directly translated to the
vpbroadcastb
, vlddqu
,
vpcmpeqb
and vpmovmskb
instructions.
Perhaps surprisingly, each ret
is prefixed with a
vzeroupper
instruction. Apparently, that
instruction is emitted to avoid slowdowns when mixing AVX
with SSE code (and the upper register half is not being zeroed
after the last AVX instruction and before the next SSE one).
See also the Introduction to Intel Advanced Vetor
extensions, Section Instruction Set Overview:
Another performance concern besides unaligned data issues is that mixing legacy XMM-only instructions and newer Intel AVX instructions causes delays, so minimize transitions between VEX-encoded instructions and legacy Intel SSE code. Said another way, do not mix VEX-prefixed instructions and non-VEX-prefixed instructions for optimal throughput. If you must do so, minimize transitions between the two by grouping instructions of the same VEX/non-VEX class. Alternatively, there is no transition penalty if the upper YMM bits are set to zero via VZEROUPPER or VZEROALL, which compilers should automatically insert. This insertion requires an extra instruction, so profiling is recommended.
This can be disabled with the -mno-vzeroupper
and doing so
improves the runtime a tiny bit on the Fedora 23 Core 7 system,
but significantly increases it on a CentOS 7 Intel Core 5 system
(i.e. no-zeroupper is 1.6 times slower, cf. Section Measurements).
Also, the vzeroupper
instructions aren't emitted with each
architecture, e.g. they aren't when compiling with -march=knl
and GCC 6.2.
Update (2016-09-28): When compiling the intrinsic SIMD
version in a separate translation unit, without any
link-time-optimization (LTO), the reinterpret_cast
shouldn't
cause any strict-aliasing related issues, since the aliased
pointer is only used for reading and not for writing. A look into
the assembler output matches this understanding.
Of course, the reinterpret_cast
can be amended with a
may_alias
attribute (when compiling with GCC). It can even be
eliminated via issuing a memcpy
that GCC and clang are able to
optimize away (cf. find_avx2_memcpy_impl.cc
):
__m256i x;
memcpy(&x, i, sizeof(__m256i));
__m256i r = _mm256_cmpeq_epi8(x, q);
Which just yields:
vpcmpeqb -32(%r8), %ymm1, %ymm0
Similar to the previous version, but there is no vlddqu
instruction and instead different addressing is used in the first
operand.
Another way to eliminate the reinterpret cast is to implement it outside of C/C++ without any intrinsics, i.e. as pure-assembler implementation.
How fast is just SSE?¶
In contrast with AVX, with the SSE SIMD extension the registers are 'only' 128 Bit wide. Otherwise, the code doesn't change much:
#include <immintrin.h>
const char *find_sse(const char *b, const char *e, char c)
{
const char *i = b;
__m128i q = _mm_set1_epi8(c);
for (; i+16 < e; i+=16) {
__m128i x = _mm_lddqu_si128(
reinterpret_cast<const __m128i*>(i));
__m128i r = _mm_cmpeq_epi8(x, q);
int z = _mm_movemask_epi8(r);
if (z)
return i + __builtin_ffs(z) - 1;
}
for (; i < e; ++i)
if (*i == c)
return i;
return e;
}
Its assembler output is not much different from the
previous one, basically the SIMD register names just change from
ymmX
to xmmX
. And, since there is no possibility of mixing
AVX with SSE the GCC doesn't emit any vzeroupper
statements.
The median runtime increases a little bit to 31 seconds, thus it is similar
to the runtime of memchr()
.
What about AVX-512¶
Unfortunately, normal Skylake CPUs don't include AVX-512, it is only available in server variants and later CPU generations. With AVX-512, not just the vector register size is doubled to 512 bit, but perhaps more importantly there are new instructions that compare and compute the mask at once:
const char *find_avx512(const char *b, const char *e, char c)
{
const char *i = b;
__m256i q = _mm256_set1_epi8(c);
for (; i+32 < e; i+=32) {
__m256i x = _mm256_lddqu_si256(
reinterpret_cast<const __m256i*>(i));
__mmask32 z = _mm256_cmpeq_epi8_mask(x, q);
if (z)
return i + __builtin_ffs(z) - 1;
}
for (; i < e; ++i)
if (*i == c)
return i;
}
(compile with e.g. --march=skylake-avx512
- GCC explorer)
Glibc memchr()
¶
When compiling with GCC g++
and glibc and
optimizations turned on, the /usr/include/string.h
header
defines memchr()
like this:
extern_always_inline const void *
memchr (const void *__s, int __c, size_t __n) __THROW
{
return __builtin_memchr (__s, __c, __n);
}
The __builtin_memchr()
is a GCC builtin. This enables the
compiler to apply memchr()
specific optimizations like folding.
For example, if all arguments are known at runtime, the compiler
could evaluate the function at compile time and just emit the
result as a constant. Or some variants could be replaced with
special code, e.g. if length is constant zero.
With the Fedora 23 system, the __builtin_memchr()
ultimately
yields a shared library symbol lookup for memchr
which gets
resolved into sysdeps/x86_64/memchr.S
. This is an SSE
version (i.e. that works with 128 Bit registers) implemented in ~
300 lines of assembler. In the main loop 4 comparisons are done
in a row, i.e. each loop iteration processes 64 byte, in a
chunked fashion. There is a prologue dealing with an unaligned
prefix, whereas the main loop processes increments in an aligned
fashion.
The main loop basically does 4 vector comparisons in a row and then computes the maximum between each result vector:
movdqa (%rdi), %xmm0
movdqa 16(%rdi), %xmm2
movdqa 32(%rdi), %xmm3
movdqa 48(%rdi), %xmm4
pcmpeqb %xmm1, %xmm0
pcmpeqb %xmm1, %xmm2
pcmpeqb %xmm1, %xmm3
pcmpeqb %xmm1, %xmm4
pmaxub %xmm0, %xmm3
pmaxub %xmm2, %xmm4
pmaxub %xmm3, %xmm4
pmovmskb %xmm4, %eax
add $64, %rdi
test %eax, %eax
jz L(align64_loop)
sub $64, %rdi
# .. if there is a match further masking/comparing ..
Glibc also comes with various system specific optimized versions for other architectures like i386, PowerPC, MIPS, SPARC, m86k, etc.
On Fedora, the source can be found under
/usr/src/debug/glibc-2.22/sysdeps/x86_64/memchr.S
if the glibc debug package was installed via e.g.
dnf debuginfo-install glibc-...
or in the glibc source package, e.g.:
dnf download --source glibc
rpm2cpio glibc-2.22-18.fc23.src.rpm | cpio --extract --make-directories
tar xf glibc-2.22.tar.gz
cd glibc-2.22
vim sysdeps/x86_64/memchr.S
To verify what implementation actually ends up getting used,
setting a breakpoint with gdb
and stepping instructions (i.e.
stepi
or short si
) is useful. Of course, this is more
convenient, if the glibc debug package is installed.
Controlled Buffer Overflow¶
The AVX find version is fast because it searches using 32 byte increments. Its weak point are the tails of the search range, i.e. when the range size is not a multiple of 32. In the worst case the 2nd loop has to iterate for 31 remaining bytes. If we could guarantee that the range size modulo 32 is always zero then the 2nd loop could be eliminated. We can't - but the next best thing is to just partially read some additional bytes and ignore any matches in those.
For that the contract with the caller has to be changed such that the end of the range isn't also the end of the allocated buffer. Otherwise there is the risk to read into a guard page. The change to our main function is minimal:
enum { N = 128 * 1024 };
char buffer[N+31];
size_t off = 0;
for (;;) {
auto r = x::read(fd, buffer, N);
const char *b = buffer;
auto e = b + r;
for (;;) {
auto x = g::find(b, e, '\n');
auto y = g::find(b, x, '|');
The minimal read buffer is just extended by 31 bytes.
The AVX find function can be thus simplified to:
#include <immintrin.h>
const char *find_avx2_overflow(const char *b, const char *e, char c)
{
const char *i = b;
__m256i q = _mm256_set1_epi8(c);
for (; i < e; i+=32) {
__m256i x = _mm256_lddqu_si256(
reinterpret_cast<const __m256i*>(i));
__m256i r = _mm256_cmpeq_epi8(x, q);
int z = _mm256_movemask_epi8(r);
if (z) {
const char *r = i + __builtin_ffs(z) - 1;
return r < e ? r : e;
}
}
return e;
}
For the test data and the main test machine the change in median runtime is only marginal, less than 1 second or so.
Do more faster?¶
Since it isn't always practical to allocate a larger buffer, another promising variant is to eliminate the second loop via doing one overlapping SIMD comparison for the remaining bytes. On the other hand, in the worst case 30 bytes would be compared twice.
Our example thus changes to:
const char *find_avx2_more(const char *b, const char *e, char c)
{
const char *i = b;
__m256i q = _mm256_set1_epi8(c);
for (; i+32 < e; i+=32) {
__m256i x = _mm256_lddqu_si256(
reinterpret_cast<const __m256i*>(i));
__m256i r = _mm256_cmpeq_epi8(x, q);
int z = _mm256_movemask_epi8(r);
if (z)
return i + __builtin_ffs(z) - 1;
}
if (i<e) {
if (e-b < 32) {
for (; i < e; ++i)
if (*i == c)
return i;
} else {
i = e-32;
__m256i x = _mm256_lddqu_si256(
reinterpret_cast<const __m256i*>(i));
__m256i r = _mm256_cmpeq_epi8(x, q);
int z = _mm256_movemask_epi8(r);
if (z)
return i + __builtin_ffs(z) - 1;
}
}
return e;
}
With that the median runtime basically stays the same in comparison with the overflow version..
Alignment¶
The Intel documentation contains hints that although alignment
requirements are relaxed with AVX2 some instructions might
perform faster when accessing aligned data. For example, an
indirect hint in the documentation of
_mm256_lddqu_si256
:
This intrinsic may perform better than
_mm256_loadu_si256
when the data crosses a cache line boundary.
A datum crosses a cache line boundary when it is not aligned.
And more specific the Introduction to Intel Advanced Vector extensions (Section Instruction Set Overview):
Intel® AVX has relaxed some memory alignment requirements, so now Intel AVX by default allows unaligned access; however, this access may come at a performance slowdown, so the old rule of designing your data to be memory aligned is still good practice (16-byte aligned for 128-bit access and 32-byte aligned for 256-bit access). The main exceptions are the VEX-extended versions of the SSE instructions that explicitly required memory-aligned data: These instructions still require aligned data. Other specific instructions requiring aligned access are listed in Table 2.4 of the Intel® Advanced Vector Extensions Programming Reference (see "For More Information" for a link).
Looking at our example, the records naturally are unaligned - due to their variable length. However, we can do the following: access the first 32 bytes unaligned and then do an increment less than 32 bytes such that all following loads are aligned. In the worst case we have an overlap of 31 bytes in such a way - but that still might pay off.
Since we are using 256 Bit registers in the AVX example the
address should be aligned at 32 bytes. An address is 32
byte-aligned (i.e. it is a multiple of 32) if the 5 least
significant bits are zero (i.e. log2(32)=5
).
Thus, our example changes to:
#include <immintrin.h>
#include <stdint.h>
const char *find_avx2_align(const char *b, const char *e, char c)
{
const char *i = b;
__m256i q = _mm256_set1_epi8(c);
__m256i x = _mm256_lddqu_si256(
reinterpret_cast<const __m256i*>(i));
__m256i r = _mm256_cmpeq_epi8(x, q);
int z = _mm256_movemask_epi8(r);
if (z) {
const char *r = i + __builtin_ffs(z) - 1;
return r < e ? r : e;
}
const void *a = b + 32;
uintptr_t ai = reinterpret_cast<uintptr_t>(a);
ai &= ~uintptr_t(0b11111u);
a = reinterpret_cast<const void*>(ai);
i = static_cast<const char*>(a);
for (; i < e; i+=32) {
__m256i x = _mm256_lddqu_si256(
reinterpret_cast<const __m256i*>(i));
__m256i r = _mm256_cmpeq_epi8(x, q);
int z = _mm256_movemask_epi8(r);
if (z) {
const char *r = i + __builtin_ffs(z) - 1;
return r < e ? r : e;
}
}
return e;
}
It turns out that this doesn't improve the runtime much (in comparison with the first AVX version).
With that version we can also replace _m256_lddqu_si256
with
_mm256_load_si256
in the main loop (which does require
alignment), but - at least on this CPU - this doesn't change the
runtime much.
Inlining¶
Putting an AVX find version into the same translation unit (or
into a header) and adding static inline
to its definition
allows the compiler to inline it into the main read loop such
that function call and return overhead is eliminated. With
find_avx2_overflow
there is no runtime improvement, though. In fact,
it runs half a second or so longer.
Other C libraries¶
Dietlibc¶
The dietlibc (0.33-8), a libc optimized for static
linking, also comes with some architecture specific memchr()
versions. The x86-64 version is written in assembler and
uses SSE, i.e. 128 Bit registers. In contrast to glibc, the main
loop just contains one vector comparison instruction and not four
(dietlibc/x86_64/memchr.S
):
pshufd $0,%xmm0,%xmm0 /* xmm0 = ch x 16 */
1:
movdqa (%rdi),%xmm1
pcmpeqb %xmm0,%xmm1
pmovmskb %xmm1,%ecx
and %eax,%ecx
jnz .Lfound /* found something */
lea 16(%rdi),%rdi
or $-1,%eax
cmp %rsi,%rdi
jb 1b
jmp .Lnull
.Lfound:
Its median runtime is similar to the glibc one, i.e. 32 seconds.
uClibc¶
UClibc (0.9.33.2), a C library optimized for embedded
systems (i.e. for space), comes with some architecture specific
memchr
implementations, but not for x86-64. Its generic version
in libc/string/generic/memchr.c
originates from the Glibc and
implements chunked looping (without vector instructions). The
chunks are of size sizeof(size_t)
and this is also the
alignment the prologue establishes. With comments, this
implementation has about 170 lines. Its median runtime is a few
seconds worse than the one of the naive implementation, i.e. 54
seconds vs. 53 seconds.
It also comes with an Intel 386 specific memchr
implementation
that uses the REPNE
prefix instruction with SCASB
(libc/string/i386/memchr.c
). The mnemonics stand for
repeat-while-not-equal and scan-string-bytewise. This instruction
basically replaces a whole loop, which is very CISC and
seems to come from a time with very limited compilers. It is
still available on x86-64 in 64 Bit mode. A 64 bit version with
REPNE SCASB
has suboptimal runtime, though - its runtime is 71
seconds, i.e. 21 seconds longer than the naive implementation.
The Intel 64 and IA-32 Architectures Optimization Reference
Manual (page 3-68, June 2016) also recommends against its use:
Using a REP prefix with string move instructions can provide high performance in the situations described above. However, using a REP prefix with string scan instructions (SCASB, SCASW, SCASD, SCASQ) or compare instructions (CMPSB, CMPSW, SMPSD, SMPSQ) is not recommended for high performance. Consider using SIMD instructions instead.
Apparently, it is just there for symmetry with the existing non-64 Bit instruction set.
Musl¶
Musl, a robust C library that strives for standard conformance,
also doesn't include a x86-64 specific memchr()
verison (as of
2016-09-10). But its generic version is more
concise:
#include <string.h>
#include <stdint.h>
#include <limits.h>
#define SS (sizeof(size_t))
#define ALIGN (sizeof(size_t)-1)
#define ONES ((size_t)-1/UCHAR_MAX)
#define HIGHS (ONES * (UCHAR_MAX/2+1))
#define HASZERO(x) ((x)-ONES & ~(x) & HIGHS)
void *memchr(const void *src, int c, size_t n)
{
const unsigned char *s = src;
c = (unsigned char)c;
for (; ((uintptr_t)s & ALIGN) && n && *s != c; s++, n--);
if (n && *s != c) {
const size_t *w;
size_t k = ONES * c;
for (w = (const void *)s; n>=SS && !HASZERO(*w^k); w++, n-=SS);
for (s = (const void *)w; n && *s != c; s++, n--);
}
return n ? (void *)s : 0;
}
Again, this version implements chunked looping with size_t
chunks. Thus, there is a prologue that takes care of alignment
issues and an epilogue that deals with trailing bytes. All the
necessary extra bit operations and tests certainly look
expensive. And indeed, the runtime is 4 seconds or so above the
one of the naive implementation, i.e. 54 seconds in total.
Update (2016-09-20): related discussion on the musl mailinglist
Measurements¶
The runtime results for the Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz machine (20 repetitions à 300000 records; Fedora 23, GCC 4.9), in seconds:
exe | min | median | mean | max | sdev | speedup |
---|---|---|---|---|---|---|
find_avx2_overflow_ext | 28.33 | 28.53 | 28.69 | 30.07 | 0.42 | 1.74 |
find_avx2_more | 28.63 | 28.79 | 29.07 | 30.88 | 0.55 | 1.72 |
find_avx2_loop | 28.82 | 29.05 | 29.23 | 30.42 | 0.47 | 1.71 |
find_avx2_align2 | 28.98 | 29.16 | 29.70 | 32.24 | 1.02 | 1.70 |
find_avx2_overflow | 29.08 | 29.18 | 29.36 | 30.39 | 0.35 | 1.70 |
find_avx2_align | 29.05 | 29.26 | 29.43 | 30.45 | 0.42 | 1.70 |
find_avx2_nozero | 28.93 | 29.30 | 29.53 | 30.73 | 0.58 | 1.69 |
find_avx2 | 29.08 | 29.45 | 29.56 | 30.69 | 0.54 | 1.69 |
find_avx2_memcpy | 29.27 | 29.52 | 29.65 | 30.39 | 0.34 | 1.68 |
find_sse | 30.56 | 30.86 | 31.13 | 32.43 | 0.59 | 1.61 |
find_diet | 31.66 | 31.81 | 31.99 | 33.35 | 0.47 | 1.56 |
find_memchr | 31.63 | 31.95 | 32.12 | 33.73 | 0.54 | 1.55 |
find_unroll2 | 45.20 | 45.68 | 45.80 | 46.75 | 0.43 | 1.09 |
find_find | 46.48 | 46.87 | 47.39 | 50.45 | 1.04 | 1.06 |
find_naive | 49.41 | 49.65 | 50.00 | 51.91 | 0.69 | 1.00 |
find_uclibc | 53.89 | 54.24 | 54.64 | 57.76 | 0.98 | 0.92 |
find_musl | 53.70 | 54.30 | 54.64 | 56.38 | 0.82 | 0.91 |
find_unroll | 53.76 | 54.50 | 54.74 | 56.64 | 0.75 | 0.91 |
find_uclibcx86 | 70.76 | 71.11 | 71.43 | 73.12 | 0.78 | 0.70 |
The runtime results for a Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz machine (20 repetitions à 300000 records; CentOS 7, GCC 4.8), in seconds:
exe | min | median | mean | max | sdev | speedup |
---|---|---|---|---|---|---|
find_avx2_overflow | 43.77 | 43.82 | 43.92 | 44.66 | 0.26 | 1.72 |
find_avx2_overflow_ext | 45.49 | 45.55 | 45.57 | 45.74 | 0.07 | 1.66 |
find_avx2_more | 46.50 | 46.62 | 46.65 | 47.06 | 0.13 | 1.62 |
find_avx2_align2 | 46.67 | 46.76 | 46.78 | 47.14 | 0.09 | 1.61 |
find_avx2_align | 46.70 | 46.76 | 46.76 | 46.89 | 0.04 | 1.61 |
find_avx2 | 47.03 | 47.09 | 47.11 | 47.25 | 0.06 | 1.60 |
find_avx2_loop | 48.20 | 48.33 | 48.37 | 48.69 | 0.11 | 1.56 |
find_sse | 49.21 | 49.52 | 49.50 | 49.85 | 0.16 | 1.52 |
find_memchr | 50.56 | 50.76 | 50.78 | 50.99 | 0.08 | 1.49 |
find_diet | 51.19 | 51.41 | 51.44 | 51.98 | 0.17 | 1.47 |
find_find | 69.03 | 69.21 | 69.35 | 70.32 | 0.36 | 1.09 |
find_avx2_nozero | 75.29 | 75.38 | 75.52 | 76.47 | 0.34 | 1.00 |
find_naive | 75.27 | 75.41 | 75.42 | 75.56 | 0.09 | 1.00 |
find_musl | 75.92 | 76.05 | 76.07 | 76.23 | 0.08 | 0.99 |
find_uclibc | 79.37 | 79.56 | 79.60 | 80.17 | 0.19 | 0.95 |
find_uclibcx86 | 101.6 | 102.3 | 102.3 | 102.5 | 0.16 | 0.74 |
The runtime results for an AMD Phenom(tm) 9750 Quad-Core Processor (@ 2.40 GHz, no AVX) machine (20 repetitions à 300000 records; CentOS 7, GCC 4.8), in seconds:
exe | min | median | mean | max | sdev | speedup |
---|---|---|---|---|---|---|
find_sse | 87.39 | 91.48 | 91.48 | 95.47 | 2.51 | 1.90 |
find_diet | 90.95 | 92.55 | 93.02 | 98.61 | 2.13 | 1.88 |
find_memchr | 89.31 | 92.80 | 92.92 | 101.22 | 2.87 | 1.88 |
find_find | 116.3 | 119.2 | 119.8 | 126.5 | 2.77 | 1.46 |
find_uclibc | 118.4 | 120.0 | 121.2 | 127.2 | 2.53 | 1.45 |
find_musl | 118.1 | 120.7 | 120.9 | 125.0 | 1.71 | 1.44 |
find_uclibcx86 | 125.7 | 129.0 | 129.0 | 136.4 | 2.46 | 1.35 |
find_naive | 170.9 | 174.2 | 174.2 | 178.2 | 2.01 | 1.00 |
Note that, in contrast to the more recent Intel CPUs above, this
is a CPU where the chunked versions are faster than the naive
one. Even the REPNE SCASB
version is faster than the naive one,
i.e. it doesn't have a runtime penalty on this CPU. Also, the more
recent i5-4250U has quite more performance at 1 GHz less - the
naive version runs 3.5 times or so as fast on that CPU - which
is significantly more energy efficient.
Update (2016-09-18): results for SPARC and PPC
Conclusion¶
A good default choice for std::find()
is to just call
memchr()
since it is highly optimized on many systems (e.g.
with Glibc on x86 and other architectures).
On x86, and in case the libc memchr()
doesn't use AVX, it pays
off to switch to an AVX version. Some tricks like using a
buffer overflowing version or inlining only bring marginal gains.
It makes sense to change the GNU STL and add
specialisation of std::find()
for char
and wchar_t
. The
specialisation for char
could directly call
__builtin_memchr()
(the builtin to save an include) or even
std::char_traits<char>::find()
because that
specialization already calls __bultin_memchr()
.
Another thing to look into is what CPU feature dependent runtime dispatching mechanisms the Glibc provides such that an AVX2 or even AVX-512 version is automatically selected at runtime (if the CPU supports the extension).
When dealing with an AVX-512 capable CPU, it makes sense to not
only test with 512 bit registers but also test the new compare
instructions that directly return the mask (e.g.
_mm256_cmpeq_epi8_mask()
), thus saving one vector instruction
in comparison to previous Intel SIMD variants.
Modern CPUs are even relatively forgiving to a naive
implementation, while classic manual unrolling as in GNU STL's
std::find
doesn't help much and some chunking techniques may
even yield worse runtimes.
As always, do your own performance tests, CPU architectures change and techniques that might have been beneficial on one architecture generation (e.g. aligned accesses) aren't anymore or even are counterproductive (e.g. excessive chunking). Of course, all this also depends on the range size and the character distribution of the search input, i.e. how many characters an implementation has to search through to get an match, on average.