std::find() and memchr() Optimizations

Assume you need to parse a record based format with flexible width and one-byte delimiters. When using C++, the std::find() STL algorithm is the obvious choice for efficiently locating the delimiters. The idiomatic C solution is to use memchr().

Expectations

Like with other STL algorithms one expects that std::find() is highly optimized and specialized for different arguments. For example, the minimum requirement for the range are input iterators but if we use - say - random access iterators for searching a range of char elements we expect a specialized version. For example, one that uses SIMD (single instruction multiple data) vector instructions, where available.

Similarly, one expects that the libc implementation of memchr() does runtime dispatching based on the available CPU extensions (and perhaps the search range size) and uses a specialized version. Again, if available one that uses SIMD instructions.

In fact, std::find() could even just call memchr() for a random access character range.

It is also possible that the compiler directly inlines memchr() calls with an optimized implementation.

Let's see which of those expectations are actually met.

Sample Code

This example is a minimal version of a real parser that does enough to see differences in implementation details of a character search function:

int main(int argc, char **argv)
{
  if (argc < 3)
    return 1;
  unsigned n = atoi(argv[1]);
  int fd = x::open(argv[2], O_RDONLY);
  enum { N = 128 * 1024 };
  char buffer[N+64];
  for (unsigned i = 0; i < n; ++i) {
    size_t off = 0;
    if (i)
      x::lseek(fd, 0, SEEK_SET);
    for (;;) {
      auto r = x::read(fd, buffer, N);
      const char *b = buffer;
      auto e = b + r;
      for (;;) {
        auto x = g::find(b, e, '\n');
        auto y = g::find(b, x, '|');
        off += y - b;
        if (y != x)
          printf("%zu\n", off);
        if (x == e)
          break;
        off = 0;
        b = x + 1;
      }
      if (r != N)
        break;
    }
  }
  return 0;
}

The x namespace contains wrappers for some POSIX functions that throw exceptions on error. The function g::find() wraps different versions of std::find().

A read size of 128 KiB is used because it is a multiple of 4 KiB and 16 KiB page sizes and it usually gives good throughput on modern systems (cf. e.g. the comments in the GNU coreutils implementation of cp).

The buffer is 64 bytes bigger than strictly necessary to allow for implementations that read over the right boundary (i.e. up to 64 bytes).

The outermost loop is there for benchmarking purposes, i.e. to simulate large files with many records.

In the example input format, records are delimited by a newline and an important prefix is delimited by a pipe character. The simplified purpose is to print the prefix length of each record.

Basically, its functionality is equivalent to:

for i in $(seq $n); do awk -F'|' '{print length($1)}' $input; done

Or this Python 3 program:

import sys

n = int(sys.argv[1])
for i in range(n):
  with open(sys.argv[2], 'r') as f:
    for line in f:
      print(line.find('|'))

The test program and other material contained in a Git repository.

Methods

We generate some sample records via dumping a dictionary and reformatting it a little bit:

aspell -d en dump master | aspell -l en expand \
  | paste '-d,,,,|,,'  - - - - - - - - > in

On a Fedora 23 system, this yields a file with 14972 records (i.e. lines) where the average prefix and suffix sizes are 47 and 27 characters.

Unless noted otherwise:

  • The test input is generated on a Fedora 23 system as described above and it is processed with 30000 iterations.
  • Throughout the text the rounded median runtimes on a Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz CPU (6th generation Core 7) are reported. This CPU notably comes with the AVX2 SIMD extension, but not with the AVX-512 extension. More details and runtime results for other systems are available in the Section Measurements.
  • The main test system runs Fedora 23 (64 Bit).
  • All examples are compiled with GCC 5.3.1 (Fedora 23) with the options -O3 -march=native.

To get an upper bound, the following GNU awk call finishes in 60 seconds (on the described system):

time for i in {1..3000}; do cat in ; done \
  | awk -F'|' '{print length($1)}' > /dev/null

While the Python version finishes in 41 seconds:

time ./find.py 3000 in > /dev/null

Note that the following benchmarks use 10 times more iterations, because the test programs run much faster.

The test program, the used input, the different std::find and memchr() versions and the benchmark scripts are available in a Git repository.

How fast is std::find()?

The straight forward idiomatic C++ definition of g::find() is just to call std::find():

#include <algorithm>

namespace g {
  const char *find(const char *b, const char *e, char c)
  {
    return std::find(b, e, c);
  }
}

That version has a median runtime of 46 seconds with our 30000 times 14972 records test input. This is 9 times or so faster than the Awk and Python implementations.

Does std::find() call memchr()?

The easiest way to find out is to let GCC emit the assembler output like this:

g++ -O3 -march=native find_find_impl.cc -S -o - | grep -v '^'$'\t''\.'

Which produces:

g::find(char const*, char const*, char):
        movq        %rsi, %r9
        movl        %edx, %ecx
        subq        %rdi, %r9
        movq        %r9, %r8
        sarq        $2, %r8
        testq        %r8, %r8
        jle        .L14
        cmpb        (%rdi), %dl
        je        .L15
        cmpb        1(%rdi), %dl
        je        .L31
        cmpb        2(%rdi), %dl
        je        .L32
        cmpb        3(%rdi), %dl
        je        .L33
        leaq        (%rdi,%r8,4), %r8
        jmp        .L9
.L10:
        cmpb        4(%rdi), %cl
        je        .L29
        cmpb        1(%rax), %cl
        je        .L34
        cmpb        2(%rax), %cl
        je        .L35
        cmpb        3(%rax), %cl
        je        .L36
        movq        %rax, %rdi
.L9:
        leaq        4(%rdi), %rax
        cmpq        %r8, %rax
        jne        .L10
        movq        %rsi, %r9
        subq        %r8, %r9
.L2:
        cmpq        $2, %r9
        je        .L11
        cmpq        $3, %r9
        je        .L12
        cmpq        $1, %r9
        je        .L13
        movq        %rsi, %rax
.L29:
        rep ret
.L34:
        leaq        5(%rdi), %rax
        ret
.L35:
        leaq        6(%rdi), %rax
        ret
.L36:
        leaq        7(%rdi), %rax
        ret
.L12:
        cmpb        (%rax), %dl
        je        .L29
        addq        $1, %rax
.L11:
        cmpb        (%rax), %dl
        je        .L29
        addq        $1, %rax
.L13:
        cmpb        %dl, (%rax)
        cmovne        %rsi, %rax
        ret
.L15:
        movq        %rdi, %rax
        ret
.L14:
        movq        %rdi, %rax
        jmp        .L2
.L33:
        leaq        3(%rdi), %rax
        ret
.L32:
        leaq        2(%rdi), %rax
        ret
.L31:
        leaq        1(%rdi), %rax
        ret

Alternatively, one can also copy and paste the C++ code into the GCC explorer and look at the assembler output. This is also convenient for comparing different compilers.

The code doesn't just call memchr().

This also shows:

  • the function has 8 return instructions which hints at loop unrolling
  • the loop body starting at .L10 includes 4 similar looking compare instruction, again looks like an unrolled loop
  • no SIMD instructions are used

Who unrolled the loop?

A look at the GNU STL source code in /usr/include/c++/5.3.1/bits/stl_algo.h gives us the answer:

template<typename _InputIterator, typename _Tp>
  inline _InputIterator
  find(_InputIterator __first, _InputIterator __last,
       const _Tp& __val)
  {
    // concept requirements
    __glibcxx_function_requires(_InputIteratorConcept<_InputIterator>)
    __glibcxx_function_requires(_EqualOpConcept<
              typename iterator_traits<_InputIterator>::value_type, _Tp>)
    __glibcxx_requires_valid_range(__first, __last);
    return std::__find_if(__first, __last,
                          __gnu_cxx::__ops::__iter_equals_val(__val));
  }

Which calls a generic equal_if() implementation:

template<typename _Iterator, typename _Predicate>
  inline _Iterator
  __find_if(_Iterator __first, _Iterator __last, _Predicate __pred)
  {
    return __find_if(__first, __last, __pred,
                     std::__iterator_category(__first));
  }

Which calls a version specialized for random access iterators:

template<typename _RandomAccessIterator, typename _Predicate>
  _RandomAccessIterator
  __find_if(_RandomAccessIterator __first, _RandomAccessIterator __last,
            _Predicate __pred, random_access_iterator_tag)
  {
    typename iterator_traits<_RandomAccessIterator>::difference_type
      __trip_count = (__last - __first) >> 2;

    for (; __trip_count > 0; --__trip_count)
      {
        if (__pred(__first))
          return __first;
        ++__first;

        if (__pred(__first))
          return __first;
        ++__first;

        if (__pred(__first))
          return __first;
        ++__first;

        if (__pred(__first))
          return __first;
        ++__first;
      }

    switch (__last - __first)
      {
      case 3:
        if (__pred(__first))
          return __first;
        ++__first;
      case 2:
        if (__pred(__first))
          return __first;
        ++__first;
      case 1:
        if (__pred(__first))
          return __first;
        ++__first;
      case 0:
      default:
        return __last;
      }
  }

Indeed, the loop is manually unrolled. The 4 compare instructions are a direct result of having 4 if statements in the loop body. The 8 return statements are a consequence of that and the fall-through switch-case statement that deals with the last characters of the range.

This optimization is promising because on a super-scalar architecture, multiple statements might be executed (speculatively) in parallel.

Update (2016-09-20): Interestingly, the LLVM libc++ doesn't implement any specializations for std::find(). It is just one implementation that uses a simple loop, i.e. no specialization for random-access iterators.

Update (2016-09-28): When compiling the naive implementation with GCC and -funroll-loops (or via __attribute__((optimize("unroll-loops"))) - cf. find_unroll_impl.cc), the GCC automatically unrolls the loop 8 times. This unroll factor turns out to be sub-optimal on the Intel i7, i.e. this version is 9 percent slower than the naive one.

Taking this result into account, we may question if 4 times unrolling is really the best we can do with respect to unrolling. For this benchmark, on the Intel i7, the sweet spot is indeed an unroll factor of 3 (cf. find_unroll2_impl.cc). This version runs 1 second or so faster than the 4 times unrolled find_find version.

Doesn't a compiler optimize a naive loop on its own?

Modern compilers include sophisticated loop optimizations including auto-vectorization, thus, perhaps a naive loop also yields good code.

Example implementation:

const char *find_naive(const char *b, const char *e, char c)
{
  for (const char *i = b; i != e; ++i)
    if (*i == c)
      return i;
  return e;
}

With that GCC yields:

find_naive(char const*, char const*, char):
        cmpq    %rsi, %rdi
        movl    %edx, %eax
        jnb     .L6
        cmpb    (%rdi), %dl
        jne     .L4
        jmp     .L8
.L5:
        cmpb    %al, (%rdi)
        je      .L8
.L4:
        addq    $1, %rdi
        cmpq    %rdi, %rsi
        jne     .L5
.L6:
        movq    %rsi, %rax
        ret
.L8:
        movq    %rdi, %rax
        ret

(See also GCC explorer.)

Which is a pretty straight forward translation of the loop. There is no auto-vectorization and no loop-unrolling going on.

Adding the compiler switch -fopt-info-vec-all yields diagnostics of the different auto-vectorization phases:

!!note: not vectorized: control flow in loop.
!!note: bad loop form.
[..]
!!note: not vectorized: not enough data-refs in basic block.
[..]

But how bad is such a naive loop?

Its median runtime is 50 seconds, just a few seconds more than std::find() which implements loop unrolling ... (47 seconds)

Is std::find() faster as memchr()?

Following definition

#include <string.h>

const char *find_memchr(const char *b, const char *e, char c)
{
  const char *r = static_cast<const char*>(memchr(b, c, e-b));
  return r ? r : e;
}

yields a median runtime of 32 seconds. Thus, memchr is 1.6 times or so faster than the naive implementation. Clearly, its implementation is different from the std::find() one.

Looking at the generated code

find_memchr(char const*, char const*, char):
    pushq   %rbx
    movq    %rsi, %rbx
    movl    %edx, %esi
    movq    %rbx, %rdx
    movsbl  %sil, %esi
    subq    %rdi, %rdx
    call    memchr
    testq   %rax, %rax
    cmove   %rbx, %rax
    popq    %rbx
    ret

(cf. GCC explorer)

we see that the compiler doesn't directly inline the memchr() call but just calls the version from glibc.

Does SIMD really speed things up?

The AVX2 extension is a relatively recent SIMD implementation Intel has to offer on many of its CPUs - e.g. also on a standard Skylake CPU. It features registers of 256 bit width, i.e. that 32 characters can be packed into one register. Also, in contrast to some older SIMD extension it has relaxed alignment requirements, which simplifies its usage and enables more use cases, in general. Although, some instructions might perform better with aligned data (as documented) this isn't the general rule.

With GCC there are few alternatives how to use those SIMD instructions:

The idea of the vector extensions is to abstract away the different vector extension on different architectures to write portable vector code. This ansatz sounds elegant, but as-is the GCC vector extension syntax seems to be mainly focused on arithmetic operations and some useful operations are lacking.

Using inline or even an external assembler is ok - but interfacing with the rest of the program requires some care.

The intrinsics are available via a header that is standardized by Intel, thus those intrinsics are portable to other x86 compilers. The GCC maps them to its builtins (that are one-to-one mappings to the underlying instructions). In addition to that, there is good documentation online available, including a searchable reference and tutorials.

A quick consultation of the Intel Intrinsics Guide leads to the following AVX find() version:

#include <immintrin.h>

const char *find_avx2(const char *b, const char *e, char c)
{
  const char *i = b;

  __m256i q = _mm256_set1_epi8(c);

  for (; i+32 < e; i+=32) {
    __m256i x = _mm256_lddqu_si256(
        reinterpret_cast<const __m256i*>(i));
    __m256i r = _mm256_cmpeq_epi8(x, q);
    int z     = _mm256_movemask_epi8(r);
    if (z)
      return i + __builtin_ffs(z) - 1;
  }

  for (; i < e; ++i)
    if (*i == c)
      return i;

  return e;
}

The __builtin_ffs function is a standard GCC builtin for the find-first-set bit operation.

Its median runtime is 30 seconds which is a tiny bit faster than memchr() (2 seconds) and thus about 1.7 times faster than the naive implementation.

Looking at the assembler output

find_avx2(char const*, char const*, char):
        leaq    8(%rsp), %r10
        andq    $-32, %rsp
        movl    %edx, %eax
        pushq   -8(%r10)
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r10
        movb    %dl, -17(%rbp)
        vpbroadcastb    -17(%rbp), %ymm1
        jmp     .L5
.L2:
        vlddqu  (%rdi), %ymm0
        vpcmpeqb        %ymm0, %ymm1, %ymm0
        vpmovmskb       %ymm0, %ecx
        testl   %ecx, %ecx
        jne     .L17
        movq    %r8, %rdi
.L5:
        leaq    32(%rdi), %r8
        cmpq    %rsi, %r8
        jb      .L2
        cmpq    %rsi, %rdi
        jnb     .L9
        cmpb    %dl, (%rdi)
        jne     .L7
        jmp     .L12
.L8:
        cmpb    (%rdi), %al
        je      .L12
.L7:
        addq    $1, %rdi
        cmpq    %rdi, %rsi
        jne     .L8
.L9:
        movq    %rsi, %rax
        vzeroupper
        popq    %r10
        popq    %rbp
        leaq    -8(%r10), %rsp
        ret
.L17:
        tzcntl  %ecx, %ecx
        addl    $1, %ecx
        movslq  %ecx, %rcx
        leaq    -1(%rdi,%rcx), %rax
        vzeroupper
        popq    %r10
        popq    %rbp
        leaq    -8(%r10), %rsp
        ret
.L12:
        movq    %rdi, %rax
        vzeroupper
        popq    %r10
        popq    %rbp
        leaq    -8(%r10), %rsp
        ret

(cf. GCC explorer)

we see how the intrinsics are directly translated to the vpbroadcastb, vlddqu, vpcmpeqb and vpmovmskb instructions. Perhaps surprisingly, each ret is prefixed with a vzeroupper instruction. Apparently, that instruction is emitted to avoid slowdowns when mixing AVX with SSE code (and the upper register half is not being zeroed after the last AVX instruction and before the next SSE one). See also the Introduction to Intel Advanced Vetor extensions, Section Instruction Set Overview:

Another performance concern besides unaligned data issues is that mixing legacy XMM-only instructions and newer Intel AVX instructions causes delays, so minimize transitions between VEX-encoded instructions and legacy Intel SSE code. Said another way, do not mix VEX-prefixed instructions and non-VEX-prefixed instructions for optimal throughput. If you must do so, minimize transitions between the two by grouping instructions of the same VEX/non-VEX class. Alternatively, there is no transition penalty if the upper YMM bits are set to zero via VZEROUPPER or VZEROALL, which compilers should automatically insert. This insertion requires an extra instruction, so profiling is recommended.

This can be disabled with the -mno-vzeroupper and doing so improves the runtime a tiny bit on the Fedora 23 Core 7 system, but significantly increases it on a CentOS 7 Intel Core 5 system (i.e. no-zeroupper is 1.6 times slower, cf. Section Measurements).

Also, the vzeroupper instructions aren't emitted with each architecture, e.g. they aren't when compiling with -march=knl and GCC 6.2.

Update (2016-09-28): When compiling the intrinsic SIMD version in a separate translation unit, without any link-time-optimization (LTO), the reinterpret_cast shouldn't cause any strict-aliasing related issues, since the aliased pointer is only used for reading and not for writing. A look into the assembler output matches this understanding.

Of course, the reinterpret_cast can be amended with a may_alias attribute (when compiling with GCC). It can even be eliminated via issuing a memcpy that GCC and clang are able to optimize away (cf. find_avx2_memcpy_impl.cc):

__m256i x;
memcpy(&x, i, sizeof(__m256i));
__m256i r = _mm256_cmpeq_epi8(x, q);

Which just yields:

vpcmpeqb -32(%r8), %ymm1, %ymm0

Similar to the previous version, but there is no vlddqu instruction and instead different addressing is used in the first operand.

Another way to eliminate the reinterpret cast is to implement it outside of C/C++ without any intrinsics, i.e. as pure-assembler implementation.

How fast is just SSE?

In contrast with AVX, with the SSE SIMD extension the registers are 'only' 128 Bit wide. Otherwise, the code doesn't change much:

#include <immintrin.h>

const char *find_sse(const char *b, const char *e, char c)
{
  const char *i = b;
  __m128i q = _mm_set1_epi8(c);
  for (; i+16 < e; i+=16) {
    __m128i x = _mm_lddqu_si128(
      reinterpret_cast<const __m128i*>(i));
    __m128i r = _mm_cmpeq_epi8(x, q);
    int     z = _mm_movemask_epi8(r);
    if (z)
      return i + __builtin_ffs(z) - 1;
  }

  for (; i < e; ++i)
    if (*i == c)
      return i;

  return e;
}

Its assembler output is not much different from the previous one, basically the SIMD register names just change from ymmX to xmmX. And, since there is no possibility of mixing AVX with SSE the GCC doesn't emit any vzeroupper statements.

The median runtime increases a little bit to 31 seconds, thus it is similar to the runtime of memchr().

What about AVX-512

Unfortunately, normal Skylake CPUs don't include AVX-512, it is only available in server variants and later CPU generations. With AVX-512, not just the vector register size is doubled to 512 bit, but perhaps more importantly there are new instructions that compare and compute the mask at once:

const char *find_avx512(const char *b, const char *e, char c)
{
  const char *i = b;

  __m256i q = _mm256_set1_epi8(c);

  for (; i+32 < e; i+=32) {
    __m256i   x = _mm256_lddqu_si256(
      reinterpret_cast<const __m256i*>(i));
    __mmask32 z = _mm256_cmpeq_epi8_mask(x, q);

    if (z)
      return i + __builtin_ffs(z) - 1;
  }

  for (; i < e; ++i)
    if (*i == c)
      return i;
}

(compile with e.g. --march=skylake-avx512 - GCC explorer)

Glibc memchr()

When compiling with GCC g++ and glibc and optimizations turned on, the /usr/include/string.h header defines memchr() like this:

extern_always_inline const void *
memchr (const void *__s, int __c, size_t __n) __THROW
{
  return __builtin_memchr (__s, __c, __n);
}

The __builtin_memchr() is a GCC builtin. This enables the compiler to apply memchr() specific optimizations like folding. For example, if all arguments are known at runtime, the compiler could evaluate the function at compile time and just emit the result as a constant. Or some variants could be replaced with special code, e.g. if length is constant zero.

With the Fedora 23 system, the __builtin_memchr() ultimately yields a shared library symbol lookup for memchr which gets resolved into sysdeps/x86_64/memchr.S. This is an SSE version (i.e. that works with 128 Bit registers) implemented in ~ 300 lines of assembler. In the main loop 4 comparisons are done in a row, i.e. each loop iteration processes 64 byte, in a chunked fashion. There is a prologue dealing with an unaligned prefix, whereas the main loop processes increments in an aligned fashion.

The main loop basically does 4 vector comparisons in a row and then computes the maximum between each result vector:

movdqa  (%rdi), %xmm0
movdqa  16(%rdi), %xmm2
movdqa  32(%rdi), %xmm3
movdqa  48(%rdi), %xmm4

pcmpeqb %xmm1, %xmm0
pcmpeqb %xmm1, %xmm2
pcmpeqb %xmm1, %xmm3
pcmpeqb %xmm1, %xmm4

pmaxub  %xmm0, %xmm3
pmaxub  %xmm2, %xmm4
pmaxub  %xmm3, %xmm4
pmovmskb %xmm4, %eax

add     $64, %rdi

test    %eax, %eax
jz      L(align64_loop)

sub     $64, %rdi

# .. if there is a match further masking/comparing ..

Glibc also comes with various system specific optimized versions for other architectures like i386, PowerPC, MIPS, SPARC, m86k, etc.

On Fedora, the source can be found under

/usr/src/debug/glibc-2.22/sysdeps/x86_64/memchr.S

if the glibc debug package was installed via e.g.

dnf debuginfo-install glibc-...

or in the glibc source package, e.g.:

dnf download --source glibc
rpm2cpio glibc-2.22-18.fc23.src.rpm | cpio --extract --make-directories
tar xf glibc-2.22.tar.gz
cd glibc-2.22
vim sysdeps/x86_64/memchr.S

To verify what implementation actually ends up getting used, setting a breakpoint with gdb and stepping instructions (i.e. stepi or short si) is useful. Of course, this is more convenient, if the glibc debug package is installed.

Controlled Buffer Overflow

The AVX find version is fast because it searches using 32 byte increments. Its weak point are the tails of the search range, i.e. when the range size is not a multiple of 32. In the worst case the 2nd loop has to iterate for 31 remaining bytes. If we could guarantee that the range size modulo 32 is always zero then the 2nd loop could be eliminated. We can't - but the next best thing is to just partially read some additional bytes and ignore any matches in those.

For that the contract with the caller has to be changed such that the end of the range isn't also the end of the allocated buffer. Otherwise there is the risk to read into a guard page. The change to our main function is minimal:

enum { N = 128 * 1024 };
char buffer[N+31];
size_t off = 0;
for (;;) {
  auto r = x::read(fd, buffer, N);
  const char *b = buffer;
  auto e = b + r;
  for (;;) {
    auto x = g::find(b, e, '\n');
    auto y = g::find(b, x, '|');

The minimal read buffer is just extended by 31 bytes.

The AVX find function can be thus simplified to:

#include <immintrin.h>

const char *find_avx2_overflow(const char *b, const char *e, char c)
{
  const char *i = b;

  __m256i q = _mm256_set1_epi8(c);

  for (; i < e; i+=32) {
    __m256i x = _mm256_lddqu_si256(
        reinterpret_cast<const __m256i*>(i));
    __m256i r = _mm256_cmpeq_epi8(x, q);
    int z     = _mm256_movemask_epi8(r);
    if (z) {
      const char *r = i + __builtin_ffs(z) - 1;
      return r < e ? r : e;
    }
}

return e;
}

For the test data and the main test machine the change in median runtime is only marginal, less than 1 second or so.

Do more faster?

Since it isn't always practical to allocate a larger buffer, another promising variant is to eliminate the second loop via doing one overlapping SIMD comparison for the remaining bytes. On the other hand, in the worst case 30 bytes would be compared twice.

Our example thus changes to:

const char *find_avx2_more(const char *b, const char *e, char c)
{
  const char *i = b;

  __m256i q = _mm256_set1_epi8(c);

  for (; i+32 < e; i+=32) {
    __m256i x = _mm256_lddqu_si256(
        reinterpret_cast<const __m256i*>(i));
    __m256i r = _mm256_cmpeq_epi8(x, q);
    int z     = _mm256_movemask_epi8(r);
    if (z)
      return i + __builtin_ffs(z) - 1;
  }
  if (i<e) {
    if (e-b < 32) {
      for (; i < e; ++i)
        if (*i == c)
          return i;
    } else {
      i = e-32;
      __m256i x = _mm256_lddqu_si256(
          reinterpret_cast<const __m256i*>(i));
      __m256i r = _mm256_cmpeq_epi8(x, q);
      int z     = _mm256_movemask_epi8(r);
      if (z)
        return i + __builtin_ffs(z) - 1;
    }
  }
  return e;
}

With that the median runtime basically stays the same in comparison with the overflow version..

Alignment

The Intel documentation contains hints that although alignment requirements are relaxed with AVX2 some instructions might perform faster when accessing aligned data. For example, an indirect hint in the documentation of _mm256_lddqu_si256:

This intrinsic may perform better than _mm256_loadu_si256 when the data crosses a cache line boundary.

A datum crosses a cache line boundary when it is not aligned.

And more specific the Introduction to Intel Advanced Vector extensions (Section Instruction Set Overview):

Intel® AVX has relaxed some memory alignment requirements, so now Intel AVX by default allows unaligned access; however, this access may come at a performance slowdown, so the old rule of designing your data to be memory aligned is still good practice (16-byte aligned for 128-bit access and 32-byte aligned for 256-bit access). The main exceptions are the VEX-extended versions of the SSE instructions that explicitly required memory-aligned data: These instructions still require aligned data. Other specific instructions requiring aligned access are listed in Table 2.4 of the Intel® Advanced Vector Extensions Programming Reference (see "For More Information" for a link).

Looking at our example, the records naturally are unaligned - due to their variable length. However, we can do the following: access the first 32 bytes unaligned and then do an increment less than 32 bytes such that all following loads are aligned. In the worst case we have an overlap of 31 bytes in such a way - but that still might pay off.

Since we are using 256 Bit registers in the AVX example the address should be aligned at 32 bytes. An address is 32 byte-aligned (i.e. it is a multiple of 32) if the 5 least significant bits are zero (i.e. log2(32)=5).

Thus, our example changes to:

#include <immintrin.h>
#include <stdint.h>

const char *find_avx2_align(const char *b, const char *e, char c)
{
  const char *i = b;

  __m256i q = _mm256_set1_epi8(c);

  __m256i x = _mm256_lddqu_si256(
      reinterpret_cast<const __m256i*>(i));
  __m256i r = _mm256_cmpeq_epi8(x, q);
  int     z = _mm256_movemask_epi8(r);
  if (z) {
    const char *r = i + __builtin_ffs(z) - 1;
    return r < e ? r : e;
  }
  const void *a = b + 32;
  uintptr_t ai = reinterpret_cast<uintptr_t>(a);
  ai &= ~uintptr_t(0b11111u);
  a = reinterpret_cast<const void*>(ai);
  i = static_cast<const char*>(a);
  for (; i < e; i+=32) {
    __m256i x = _mm256_lddqu_si256(
        reinterpret_cast<const __m256i*>(i));
    __m256i r = _mm256_cmpeq_epi8(x, q);
    int     z = _mm256_movemask_epi8(r);
    if (z) {
      const char *r = i + __builtin_ffs(z) - 1;
      return r < e ? r : e;
    }
  }

  return e;
}

It turns out that this doesn't improve the runtime much (in comparison with the first AVX version).

With that version we can also replace _m256_lddqu_si256 with _mm256_load_si256 in the main loop (which does require alignment), but - at least on this CPU - this doesn't change the runtime much.

Inlining

Putting an AVX find version into the same translation unit (or into a header) and adding static inline to its definition allows the compiler to inline it into the main read loop such that function call and return overhead is eliminated. With find_avx2_overflow there is no runtime improvement, though. In fact, it runs half a second or so longer.

Other C libraries

Dietlibc

The dietlibc (0.33-8), a libc optimized for static linking, also comes with some architecture specific memchr() versions. The x86-64 version is written in assembler and uses SSE, i.e. 128 Bit registers. In contrast to glibc, the main loop just contains one vector comparison instruction and not four (dietlibc/x86_64/memchr.S):

  pshufd $0,%xmm0,%xmm0        /* xmm0 = ch x 16 */
1:
  movdqa (%rdi),%xmm1
  pcmpeqb %xmm0,%xmm1
  pmovmskb %xmm1,%ecx
  and %eax,%ecx
  jnz .Lfound        /* found something */
  lea 16(%rdi),%rdi
  or $-1,%eax
  cmp %rsi,%rdi
  jb 1b
  jmp .Lnull
.Lfound:

Its median runtime is similar to the glibc one, i.e. 32 seconds.

uClibc

UClibc (0.9.33.2), a C library optimized for embedded systems (i.e. for space), comes with some architecture specific memchr implementations, but not for x86-64. Its generic version in libc/string/generic/memchr.c originates from the Glibc and implements chunked looping (without vector instructions). The chunks are of size sizeof(size_t) and this is also the alignment the prologue establishes. With comments, this implementation has about 170 lines. Its median runtime is a few seconds worse than the one of the naive implementation, i.e. 54 seconds vs. 53 seconds.

It also comes with an Intel 386 specific memchr implementation that uses the REPNE prefix instruction with SCASB (libc/string/i386/memchr.c). The mnemonics stand for repeat-while-not-equal and scan-string-bytewise. This instruction basically replaces a whole loop, which is very CISC and seems to come from a time with very limited compilers. It is still available on x86-64 in 64 Bit mode. A 64 bit version with REPNE SCASB has suboptimal runtime, though - its runtime is 71 seconds, i.e. 21 seconds longer than the naive implementation. The Intel 64 and IA-32 Architectures Optimization Reference Manual (page 3-68, June 2016) also recommends against its use:

Using a REP prefix with string move instructions can provide high performance in the situations described above. However, using a REP prefix with string scan instructions (SCASB, SCASW, SCASD, SCASQ) or compare instructions (CMPSB, CMPSW, SMPSD, SMPSQ) is not recommended for high performance. Consider using SIMD instructions instead.

Apparently, it is just there for symmetry with the existing non-64 Bit instruction set.

Musl

Musl, a robust C library that strives for standard conformance, also doesn't include a x86-64 specific memchr() verison (as of 2016-09-10). But its generic version is more concise:

#include <string.h>
#include <stdint.h>
#include <limits.h>

#define SS (sizeof(size_t))
#define ALIGN (sizeof(size_t)-1)
#define ONES ((size_t)-1/UCHAR_MAX)
#define HIGHS (ONES * (UCHAR_MAX/2+1))
#define HASZERO(x) ((x)-ONES & ~(x) & HIGHS)

void *memchr(const void *src, int c, size_t n)
{
    const unsigned char *s = src;
    c = (unsigned char)c;
    for (; ((uintptr_t)s & ALIGN) && n && *s != c; s++, n--);
    if (n && *s != c) {
        const size_t *w;
        size_t k = ONES * c;
        for (w = (const void *)s; n>=SS && !HASZERO(*w^k); w++, n-=SS);
        for (s = (const void *)w; n && *s != c; s++, n--);
    }
    return n ? (void *)s : 0;
}

Again, this version implements chunked looping with size_t chunks. Thus, there is a prologue that takes care of alignment issues and an epilogue that deals with trailing bytes. All the necessary extra bit operations and tests certainly look expensive. And indeed, the runtime is 4 seconds or so above the one of the naive implementation, i.e. 54 seconds in total.

Update (2016-09-20): related discussion on the musl mailinglist

Measurements

The runtime results for the Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz machine (20 repetitions à 300000 records; Fedora 23, GCC 4.9), in seconds:

exe min median mean max sdev speedup
find_avx2_overflow_ext 28.33 28.53 28.69 30.07 0.42 1.74
find_avx2_more 28.63 28.79 29.07 30.88 0.55 1.72
find_avx2_loop 28.82 29.05 29.23 30.42 0.47 1.71
find_avx2_align2 28.98 29.16 29.70 32.24 1.02 1.70
find_avx2_overflow 29.08 29.18 29.36 30.39 0.35 1.70
find_avx2_align 29.05 29.26 29.43 30.45 0.42 1.70
find_avx2_nozero 28.93 29.30 29.53 30.73 0.58 1.69
find_avx2 29.08 29.45 29.56 30.69 0.54 1.69
find_avx2_memcpy 29.27 29.52 29.65 30.39 0.34 1.68
find_sse 30.56 30.86 31.13 32.43 0.59 1.61
find_diet 31.66 31.81 31.99 33.35 0.47 1.56
find_memchr 31.63 31.95 32.12 33.73 0.54 1.55
find_unroll2 45.20 45.68 45.80 46.75 0.43 1.09
find_find 46.48 46.87 47.39 50.45 1.04 1.06
find_naive 49.41 49.65 50.00 51.91 0.69 1.00
find_uclibc 53.89 54.24 54.64 57.76 0.98 0.92
find_musl 53.70 54.30 54.64 56.38 0.82 0.91
find_unroll 53.76 54.50 54.74 56.64 0.75 0.91
find_uclibcx86 70.76 71.11 71.43 73.12 0.78 0.70

The runtime results for a Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz machine (20 repetitions à 300000 records; CentOS 7, GCC 4.8), in seconds:

exe min median mean max sdev speedup
find_avx2_overflow 43.77 43.82 43.92 44.66 0.26 1.72
find_avx2_overflow_ext 45.49 45.55 45.57 45.74 0.07 1.66
find_avx2_more 46.50 46.62 46.65 47.06 0.13 1.62
find_avx2_align2 46.67 46.76 46.78 47.14 0.09 1.61
find_avx2_align 46.70 46.76 46.76 46.89 0.04 1.61
find_avx2 47.03 47.09 47.11 47.25 0.06 1.60
find_avx2_loop 48.20 48.33 48.37 48.69 0.11 1.56
find_sse 49.21 49.52 49.50 49.85 0.16 1.52
find_memchr 50.56 50.76 50.78 50.99 0.08 1.49
find_diet 51.19 51.41 51.44 51.98 0.17 1.47
find_find 69.03 69.21 69.35 70.32 0.36 1.09
find_avx2_nozero 75.29 75.38 75.52 76.47 0.34 1.00
find_naive 75.27 75.41 75.42 75.56 0.09 1.00
find_musl 75.92 76.05 76.07 76.23 0.08 0.99
find_uclibc 79.37 79.56 79.60 80.17 0.19 0.95
find_uclibcx86 101.6 102.3 102.3 102.5 0.16 0.74

The runtime results for an AMD Phenom(tm) 9750 Quad-Core Processor (@ 2.40 GHz, no AVX) machine (20 repetitions à 300000 records; CentOS 7, GCC 4.8), in seconds:

exe min median mean max sdev speedup
find_sse 87.39 91.48 91.48 95.47 2.51 1.90
find_diet 90.95 92.55 93.02 98.61 2.13 1.88
find_memchr 89.31 92.80 92.92 101.22 2.87 1.88
find_find 116.3 119.2 119.8 126.5 2.77 1.46
find_uclibc 118.4 120.0 121.2 127.2 2.53 1.45
find_musl 118.1 120.7 120.9 125.0 1.71 1.44
find_uclibcx86 125.7 129.0 129.0 136.4 2.46 1.35
find_naive 170.9 174.2 174.2 178.2 2.01 1.00

Note that, in contrast to the more recent Intel CPUs above, this is a CPU where the chunked versions are faster than the naive one. Even the REPNE SCASB version is faster than the naive one, i.e. it doesn't have a runtime penalty on this CPU. Also, the more recent i5-4250U has quite more performance at 1 GHz less - the naive version runs 3.5 times or so as fast on that CPU - which is significantly more energy efficient.

Update (2016-09-18): results for SPARC and PPC

Conclusion

A good default choice for std::find() is to just call memchr() since it is highly optimized on many systems (e.g. with Glibc on x86 and other architectures).

On x86, and in case the libc memchr() doesn't use AVX, it pays off to switch to an AVX version. Some tricks like using a buffer overflowing version or inlining only bring marginal gains.

It makes sense to change the GNU STL and add specialisation of std::find() for char and wchar_t. The specialisation for char could directly call __builtin_memchr() (the builtin to save an include) or even std::char_traits<char>::find() because that specialization already calls __bultin_memchr().

Another thing to look into is what CPU feature dependent runtime dispatching mechanisms the Glibc provides such that an AVX2 or even AVX-512 version is automatically selected at runtime (if the CPU supports the extension).

When dealing with an AVX-512 capable CPU, it makes sense to not only test with 512 bit registers but also test the new compare instructions that directly return the mask (e.g. _mm256_cmpeq_epi8_mask()), thus saving one vector instruction in comparison to previous Intel SIMD variants.

Modern CPUs are even relatively forgiving to a naive implementation, while classic manual unrolling as in GNU STL's std::find doesn't help much and some chunking techniques may even yield worse runtimes.

As always, do your own performance tests, CPU architectures change and techniques that might have been beneficial on one architecture generation (e.g. aligned accesses) aren't anymore or even are counterproductive (e.g. excessive chunking). Of course, all this also depends on the range size and the character distribution of the search input, i.e. how many characters an implementation has to search through to get an match, on average.