diff mbox

[v4] generic string skeleton.

Message ID 20150529190952.GA23952@domone
State New
Headers show

Commit Message

Ondřej Bílka May 29, 2015, 7:09 p.m. UTC
Hi, this is next iteration of string skeleton.

I moved these to generic with brief description of primitives. Adding
hardware instructions for fast zero/equality check should be easy.

However what wouldn't is bytewise minimum. Its one of main advantages of
x64 string handling as it offers considerable speedup but we would need
different skeletons.

What also needs to be solved on per-arch basis is how fast is clz/ctz
for determining last byte. I don't know so testing alternatives is
needed.

Also I have idea that one could first do byteswap on big endian
architecture to be able use little-endian tricks. How good is byteswap
and would it work?

Then I also encoded to skeleton some tricks from x64 which also needs to
be checked. There are several tunable variables that need to be
adjusted. My plan is to do that automatically and use dryrun profile
trace to find optimal setting without much effort.

That results on more macroized skeleton, which is needed to specify
unrolling and other options.

Next is that I added support for multibyte patterns. That will improve 
strstr, strcasestr and memmem. These could be also adapted for generic
strcpy.

Comments?

Comments

Joseph Myers May 29, 2015, 8:36 p.m. UTC | #1
On Fri, 29 May 2015, Ondřej Bílka wrote:

> +#ifndef VECTOR_INT
> +# define VECTOR_INT unsigned long int
> +#endif

I think a separate header for this would be better to avoid the #ifndef 
pattern.  I also wonder if actually this information about register size 
(which is effectively what this is) really ought to go in bits/wordsize.h 
in some way, with other headers then working from that.  Because it's not 
just this code that can use such information - gmp-mparam.h can (it ought 
to be possible to eliminate machine-specific versions of gmp-mparam.h) as 
can sfp-machine.h (quite a bit of the sfp-machine.h files is actually 
generic).  But since bits/wordsize.h is installed, there's a case for this 
going in a non-installed header, where the default version just uses 
bits/wordsize.h.

> +static const vector_int ones = (~0UL / 255); /* 0x0101...*/
> +static const vector_int add = 127 * (~0UL / 255);
> +static const vector_int high_bits = 128 * (~0UL / 255);

These need to use ((vector_int) -1) or similar instead of ~0UL, for when 
the type is wider than int.

> +#define LSIZE sizeof (vector_int)
> +#ifdef PAGE_SIZE
> +# define CROSS_PAGE(x, n) (((uintptr_t) x) % PAGE_SIZE > PAGE_SIZE - n)

If PAGE_SIZE might sometimes be nonconstant (see sysdeps/mach/pagecopy.h), 
I'd tend to think a separate macro (for the minimum size of a page, always 
constant) would be better here.

> +#ifndef BOUND
> +# define BOUND(x) 0
> +#endif

I've no idea what the semantics of this macro are.  It definitely needs a 
comment.  Similarly for subsequent macros in this header.
Ondřej Bílka May 30, 2015, 8:23 p.m. UTC | #2
On Fri, May 29, 2015 at 08:36:23PM +0000, Joseph Myers wrote:
> On Fri, 29 May 2015, Ondřej Bílka wrote:
> 
> > +#ifndef VECTOR_INT
> > +# define VECTOR_INT unsigned long int
> > +#endif
> 
> I think a separate header for this would be better to avoid the #ifndef 
> pattern.  I also wonder if actually this information about register size 
> (which is effectively what this is) really ought to go in bits/wordsize.h 
> in some way, with other headers then working from that.  Because it's not 
> just this code that can use such information - gmp-mparam.h can (it ought 
> to be possible to eliminate machine-specific versions of gmp-mparam.h) as 
> can sfp-machine.h (quite a bit of the sfp-machine.h files is actually 
> generic).  But since bits/wordsize.h is installed, there's a case for this 
> going in a non-installed header, where the default version just uses 
> bits/wordsize.h.
>
Possible but I don't know how to do that yet. A problem I try avoid are
interdependencies. Without making that optional with ifdef to make it
optional you would need include other file. It could result on having
five phases of this header and you need to add redefinition between
correct headers to work. 
 
> > +static const vector_int ones = (~0UL / 255); /* 0x0101...*/
> > +static const vector_int add = 127 * (~0UL / 255);
> > +static const vector_int high_bits = 128 * (~0UL / 255);
> 
> These need to use ((vector_int) -1) or similar instead of ~0UL, for when 
> the type is wider than int.
> 
will do.
> > +#define LSIZE sizeof (vector_int)
> > +#ifdef PAGE_SIZE
> > +# define CROSS_PAGE(x, n) (((uintptr_t) x) % PAGE_SIZE > PAGE_SIZE - n)
> 
> If PAGE_SIZE might sometimes be nonconstant (see sysdeps/mach/pagecopy.h), 
> I'd tend to think a separate macro (for the minimum size of a page, always 
> constant) would be better here.
> 
Will add MIN_PAGE_SIZE in next version.
> > +#ifndef BOUND
> > +# define BOUND(x) 0
> > +#endif
> 
> I've no idea what the semantics of this macro are.  It definitely needs a 
> comment.  Similarly for subsequent macros in this header.
> 
These are for strn* functions to check a if we read past input. My
original implementation of these checks in memchr was overkill so I will
simplify these checks.
Ondřej Bílka May 31, 2015, 2:05 p.m. UTC | #3
Hi,

Now with a string skeleton we could optimize we need to adjust a
workflow.

With atmost zero effort we could probably make these routines ten
percent faster.

There three main issues, first are relatively easy. First one is
autogenerating ifuncs. A machine maintainer would need to write 
function that checks what foo in gcc -march=foo is current cpu.

Then we would compile every function for each -march combination and
ifunc would select given architecture.

Second is profile feedback. That gives ten percent that I promised.

A problem of that is that it needs to make string functions standalone.

Unless we allow compiling libc with -fprofile-generate we need to first
compile these in standalone library that we run to collect profile data.

I would use call traces that I collected with dryrun.

A third issue is that I want to make tuning more generic. I added
several tunable variables and it isn't clear which option is better.

I wrote a simple evolutionary algorithm to optimize these when there are
too many combinations of tunables.

Maintainer would need to run these for day on machine to find optimum.
With ifunc he would need to do this for each architecture.

As example of possible gains run following benchmark. On haswell running
times when differently compiled are following:


-O3 -I.. -c memchr.c
gcc -O3 testm.c memchr.o
./a.out



real	0m0.701s
user	0m0.701s
sys	0m0.000s

real	0m0.700s
user	0m0.701s
sys	0m0.000s

real	0m0.700s
user	0m0.700s
sys	0m0.000s

gcc -O3 -march=native -I.. -c memchr.c
gcc -O3 testm.c memchr.o
./a.out



real	0m0.651s
user	0m0.651s
sys	0m0.000s

real	0m0.650s
user	0m0.647s
sys	0m0.003s

real	0m0.650s
user	0m0.650s
sys	0m0.000s

gcc -O3 -I.. -c memchr.c -fprofile-generate
gcc -O3 testm.c memchr.o -fprofile-generate
./a.out
gcc -O3 -I.. -S memchr.c -fprofile-use
gcc -O3 testm.c memchr.s

real	0m0.705s
user	0m0.705s
sys	0m0.000s

real	0m0.703s
user	0m0.704s
sys	0m0.000s

real	0m0.703s
user	0m0.704s
sys	0m0.000s

gcc -march=native -O3 -I.. -c memchr.c -fprofile-generate
gcc -march=native -O3 testm.c memchr.o -fprofile-generate
./a.out
gcc -march=native -O3 -I.. -S memchr.c -fprofile-use
gcc -O3 testm.c memchr.s

real	0m0.635s
user	0m0.635s
sys	0m0.000s

real	0m0.634s
user	0m0.634s
sys	0m0.000s

real	0m0.633s
user	0m0.634s
sys	0m0.000s
Ondřej Bílka May 31, 2015, 6:36 p.m. UTC | #4
Also forget to mention different implementations of builtins, these also
need to be selected by benchmarking so we are with same situation as
tunable with multiple values.

Here architecture maintainer could supply custom builtin but it may be
available only for some architectures and there could be several
alternatives or instruction may be too slow. Also there are several
altenative ways to implement generic builtins.

So there should be some system to test these.

I would like to keep system that I use, for each builtin we would make a
directory sysdeps/generic/builtin where each file contains implementation.
Arch maintainer would make builtin directory in his sysdeps.

Then we would first run benchmark that enumerates files
sysdeps/generic/builtin and sysdeps/arch/builtin and creates symlink to
builtin that should be used.



As example question for primitives now I dont know if broadcasting byte
is faster done by:

x * 0x0101010101010101

or

x |= x << 8
x |= x << 16
x |= x << 32

also for first_nonzero byte there are questions like how fast is clz,
and how exploit that you use only highest bits in bytes, and these could
wary based on cpu.


Comments?
diff mbox

Patch

diff --git a/sysdeps/generic/string_vector.h b/sysdeps/generic/string_vector.h
new file mode 100644
index 0000000..5a835e1
--- /dev/null
+++ b/sysdeps/generic/string_vector.h
@@ -0,0 +1,211 @@ 
+/* Vectorized arithmetic used in string functions.
+   Copyright (C) 1991-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <stdint.h>
+
+/* This is intended for optimized string functions by using vectorization.
+
+   Main idea is simple. Most string functions can be described as searching
+   for first byte that satisfies certain expression followed by
+   simple output conversion. For example in strlen we look for first byte x
+   where x == 0 followed by subtracting pointer from start to get length.
+
+   When you have expression you use skeleton to execute it in parallel for
+   eigth bytes at once which will give you considerable speedup.
+
+   You need to make expression from primitives that allow vectorization.
+
+   Bitwise arithmetic (&,|,^,~) is allowed. For most tricks you also need
+   to do addition and subtraction where you must be more careful.
+   If you could ensure that your expression don't overflows then you can
+   use it as well. However expressions that could overflow are considerably
+   faster and you can use them when you are bit careful. When you only want
+   to find first byte where expression is true then you most of time
+   don't care that on success it could corrupt following bytes. However you
+   can't overflow on failure. You need to supply two versions of expression,
+   EXPRESSION macro that can overflow on success and EXPRESSION_NOCARRY that
+   can't.
+
+   Use vector arithmetic tricks. Idea is to take expression works on
+   unsigned byte and evaluates 0 for nozero byte and nonzero on zero byte.
+   Our expression is ((s - 1) & (~s)) & 128
+   Now we evaluate this expression on each byte in parallel and on first
+   nonzero byte our expression will have nonzero value.
+
+   We need to provide version of expression that doesn't cause carry
+   propagation and opperations could be done in parallel. However its
+   not needed on little endian architectures as we end on first byte
+   that succeeds and we don't care that next ones could be corrupted.
+
+   Then you could use premade predicates. There are contains_zero(x) that
+   returns 128 when x is zero, 0 otherwise and bytes_equal(x, y) that
+   returns 128 when x == y, zero otherwise, these come with variants that
+   don't cause overflow.
+
+   For performance architecture with hardware support should redefine
+   primitives. Most often one wants to supply its own first_nonzero_byte
+   and define CUSTOM_FIRST_NONZERO_BYTE to avoid defining default one.
+
+   Others are contains_zero and bytes_equal that need to be redefined
+   along with their nocarry counterparts.
+
+   Having hardware fast bytewise minimum is game changer as it allows
+   considerable speedup. However it would need to create separate skeletons
+   as main benefit is in faster aggregation.
+
+   After that there comes tuning as there are several variables that
+   affect performance like number of times unrolled. These could be
+   automated by running with different options versus dryrun profile
+   trace and selecting best one.
+ */
+
+#ifndef VECTOR_INT
+# define VECTOR_INT unsigned long int
+#endif
+
+#if _STRING_ARCH_unaligned
+# define UNALIGNED_HEADER_UNROLL 4
+#else
+# define UNALIGNED_HEADER_UNROLL 0
+#endif
+#define ALIGNED_HEADER_UNROLL 4
+#define LOOP_UNROLL 4
+
+typedef VECTOR_INT vector_int;
+
+static const vector_int ones = (~0UL / 255); /* 0x0101...*/
+static const vector_int add = 127 * (~0UL / 255);
+static const vector_int high_bits = 128 * (~0UL / 255);
+
+
+#define LSIZE sizeof (vector_int)
+#ifdef PAGE_SIZE
+# define CROSS_PAGE(x, n) (((uintptr_t) x) % PAGE_SIZE > PAGE_SIZE - n)
+#else
+# define CROSS_PAGE(x, n) (((uintptr_t) x) % 4096 > 4096 - n)
+#endif
+
+#if __BYTE_ORDER == __BIG_ENDIAN
+# define SHIFT_BYTES(x, n) ((x) << (8 * (n)))
+# define SHIFT_BYTES_UP(x, n) ((x) >> (8 * (n)))
+#else
+# define SHIFT_BYTES(x, n) ((x) >> (8 * (n)))
+# define SHIFT_BYTES_UP(x, n) ((x) << (8 * (n)))
+#endif
+
+/* Sets n first bytes to zero.  */
+#define FORGET_BYTES(x, n) SHIFT_BYTES_UP (SHIFT_BYTES (x, n), n)
+
+
+/* Load vector. Needs to be macro for cast.
+   While for LOAD(x) needs to be aligned for LOADU it dont have to.
+   Unaligned loads are emulated on platforms that dont support it.  */
+
+#define LOAD(x) (*((vector_int *) (x)))
+#if _STRING_ARCH_unaligned == 0
+/* Here we could combine shifts if architecture sets x << 64 to zero.  */
+
+# define LOADU(x) ({ \
+		     char *aligned = PTR_ALIGN_DOWN ((char *) (x), LSIZE);                  \
+		     unsigned int align = ((char *) (x)) - aligned;                         \
+		     (SHIFT_BYTES (SHIFT_BYTES (LOAD (aligned), LSIZE - 1 - align), 1)      \
+		      | SHIFT_BYTES_UP (LOAD (aligned + LSIZE), align));                    \
+		   })
+
+#else
+# define LOADU(x) LOAD (x)
+#endif
+
+
+#ifndef CUSTOM_CONTAINS_ZERO
+# if __BYTE_ORDER == __LITTLE_ENDIAN
+/* A possible question is how fast is byteswap on big endian
+   architectures. If it can be done withing cycle it migth be
+   profitable to emulate little endian there by overriding LOAD and LOADU.  */
+
+static __always_inline
+vector_int
+contains_zero (vector_int s)
+{
+  return (s - ones) & ~s & high_bits;
+}
+# else
+#  define contains_zero contains_zero_nocarry
+# endif
+
+static __always_inline
+vector_int
+contains_zero_nocarry (vector_int s)
+{
+  return (((s | high_bits) - ones) ^ high_bits) & ~s & high_bits;
+}
+#endif
+
+#ifndef CUSTOM_BYTES_EQUAL
+static __always_inline
+vector_int
+bytes_equal (vector_int x, vector_int y)
+{
+  return contains_zero (x ^ y);
+}
+
+static __always_inline
+vector_int
+bytes_equal_nocarry (vector_int x, vector_int y)
+{
+  return contains_zero_nocarry (x ^ y);
+}
+#endif
+
+/*
+  When you have hardware ctz/clz its probably best bet. However
+  for softare emulation you could get better than generic one as you
+  dont need to consider each bit, just highest bits in byte which can be
+  calculated more effectively.
+ */
+#ifndef CUSTOM_FIRST_NONZERO_BYTE
+static __always_inline
+size_t
+first_nonzero_byte (vector_int u)
+{
+# if __BYTE_ORDER == __BIG_ENDIAN
+#  ifdef NEED_BITWISE
+  u = u | (u >> 1);
+  u = u | (u >> 2);
+  u = u | (u >> 4);
+#  else
+  u = u >> 7;
+#  endif
+  u = u | (u >> 8);
+  u = u | (u >> 16);
+  u = u | (u >> 32);
+#  ifdef NEED_BITWISE
+  u = u & ones;
+#  endif
+  u = u * ones;
+  return 8 - (u >> (8 * LSIZE - 8));
+
+# else
+  /* Note that this works also in bitwise case.  */
+  u = u ^ (u - 1);
+  u = u & ones;
+  u = u * ones;
+  return (u >> (8 * LSIZE - 8)) - 1;
+# endif
+}
+#endif
diff --git a/sysdeps/generic/string_vector_skeleton.h b/sysdeps/generic/string_vector_skeleton.h
new file mode 100644
index 0000000..5f9bfeb
--- /dev/null
+++ b/sysdeps/generic/string_vector_skeleton.h
@@ -0,0 +1,239 @@ 
+/* Skeleton of generic string functions.
+   Copyright (C) 1991-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <assert.h>
+#include <string.h>
+#include <libc-internal.h>
+#include <stdint.h>
+
+#ifndef BOUND
+# define BOUND(x) 0
+#endif
+
+/* On high endian an positive could cause false positive in previous byte.  */
+
+#if __BYTE_ORDER == __BIG_ENDIAN
+# undef EXPRESSION
+# define EXPRESSION(x, y) EXPRESSION_NOCARRY (x, y)
+#endif
+
+#ifdef CUSTOM_CMASK
+# define CMASK_PARAM_MASK CMASK_PARAM
+#else
+# define CMASK_PARAM int c_in
+# define CMASK_PARAM_MASK vector_int cmask
+#endif
+
+#ifdef LOOKAHEAD
+# define BYTE(n) \
+  (SHIFT_BYTES (SHIFT_BYTES (previous, LSIZE - (LOOKAHEAD - n) - 1), 1)  \
+   | SHIFT_BYTES_UP (v, LOOKAHEAD - n))
+#endif
+
+#ifdef LOOKAHEAD
+# define SHIFT_LOOKAHEAD(x) FORGET_BYTES (x, LOOKAHEAD)
+# define ADJUST LOOKAHEAD
+#else
+# define SHIFT_LOOKAHEAD(x) x
+# define ADJUST 0
+#endif
+
+static __always_inline
+char *
+string_skeleton (const char *s_in, CMASK_PARAM, char *end)
+{
+  vector_int mask;
+  char *s = (char *) s_in;
+  vector_int v = 0;
+  vector_int __attribute__ ((unused)) previous = 0;
+#ifndef CUSTOM_CMASK
+  unsigned char c = (unsigned char) c_in;
+  vector_int __attribute__ ((unused)) cmask = c * ones;
+#endif
+
+  /* We fetch 32 bytes while not crossing page boundary.
+     Most strings in practice are of that size and we avoid a loop.
+     This looks as best in practice, alternative below uses aligned load
+     but is slower when string starts just few
+     bytes before 32 byte boundary. A tradeoff is that we rarely could
+     fetch extra cache line without needing it but this optimization
+     does pay for that. */
+
+  if (!CROSS_PAGE (s, 32) && UNALIGNED_HEADER_UNROLL > 0)
+    {
+#define UNALIGNED_HEADER_CHECK(i) \
+  if (UNALIGNED_HEADER_UNROLL >= i)                                      \
+    {                                                                    \
+      previous = v;                                                      \
+      v = LOADU (s + (i - 1) * LSIZE);                                   \
+      mask = EXPRESSION (v, cmask);                                      \
+      if (i == 1)                                                        \
+	mask = SHIFT_LOOKAHEAD (mask);                                   \
+      if (mask)                                                          \
+	return s + (i - 1) * LSIZE + first_nonzero_byte (mask) - ADJUST; \
+    }
+
+      UNALIGNED_HEADER_CHECK (1);
+      UNALIGNED_HEADER_CHECK (2);
+      UNALIGNED_HEADER_CHECK (3);
+      UNALIGNED_HEADER_CHECK (4);
+      UNALIGNED_HEADER_CHECK (5);
+      UNALIGNED_HEADER_CHECK (6);
+      UNALIGNED_HEADER_CHECK (7);
+      UNALIGNED_HEADER_CHECK (8);
+
+      if (BOUND (s + LSIZE * UNALIGNED_HEADER_UNROLL))
+	return NULL;
+    }
+  else
+    {
+      /* We need use aligned loads. For first load we read some bytes before
+         start that we discard by shifting them down. */
+
+      char *s_aligned = PTR_ALIGN_DOWN (s, LSIZE);
+      v = LOAD (s_aligned);
+      /* We need be careful here as bytes before start can corrupt it.  */
+      mask = SHIFT_BYTES ((EXPRESSION_NOCARRY (v, cmask)), s - s_aligned);
+      mask = SHIFT_LOOKAHEAD (mask);
+      if (mask)
+	return s + first_nonzero_byte (mask) - ADJUST;
+
+      /* When lookahead is i we need to ignore matches in first i bytes as
+         false positives. For lookahead 2 or more it could cross word
+         boundary and we need to mask these.  */
+
+#if defined (LOOKAHEAD) && LOOKAHEAD > 1
+# define FIX_LOOKAHEAD(i) \
+  if (i == 2 && s + LOOKAHEAD > s_aligned + LSIZE)                     \
+    mask = FORGET_BYTES (mask, s + LOOKAHEAD - (s_aligned + LSIZE));
+#else
+# define FIX_LOOKAHEAD(i)
+#endif
+
+      /* We need to check for crossing end of string after each iteration
+         as each one could cross page boundary. Note that we don't have
+         to make check after last iteration as it would duplicate check in
+         loop.  */
+
+#define ALIGNED_HEADER_CHECK(i) \
+  if (ALIGNED_HEADER_UNROLL >= i)                                       \
+    {                                                                   \
+      if (BOUND (s_aligned + (i - 1) + LSIZE))                          \
+	return NULL;                                                     \
+      previous = v;                                                     \
+      v = LOAD (s_aligned + (i - 1) * LSIZE);                           \
+      mask = EXPRESSION (v, cmask);                                     \
+      FIX_LOOKAHEAD (i);                                                \
+      if (mask)                                                         \
+	return s_aligned + (i - 1) * LSIZE + first_nonzero_byte (mask)  \
+	       - ADJUST;                                                \
+    }
+      ALIGNED_HEADER_CHECK (2);
+      ALIGNED_HEADER_CHECK (3);
+      ALIGNED_HEADER_CHECK (4);
+      ALIGNED_HEADER_CHECK (5);
+      ALIGNED_HEADER_CHECK (6);
+      ALIGNED_HEADER_CHECK (7);
+      ALIGNED_HEADER_CHECK (8);
+    }
+
+  /* Now we read enough bytes to start a loop, assuming following:  */
+
+  assert (UNALIGNED_HEADER_UNROLL <= 8);
+  assert (ALIGNED_HEADER_UNROLL <= 8);
+
+  assert (LOOP_UNROLL <= ALIGNED_HEADER_UNROLL);
+  assert (LOOP_UNROLL <= UNALIGNED_HEADER_UNROLL
+	  || UNALIGNED_HEADER_UNROLL == 0);
+
+  char *s_loop = PTR_ALIGN_DOWN (s, LOOP_UNROLL * LSIZE);
+
+#ifdef LOOKAHEAD
+  v = LOAD (s_loop + (LOOP_UNROLL - 1) * LSIZE);
+#endif
+
+  while (!BOUND (s_loop + LOOP_UNROLL * LSIZE))
+    {
+      s_loop += LOOP_UNROLL * LSIZE;
+      vector_int masks[9];
+      masks[0] = 0;
+
+#define MERGE_MASK(i) \
+  if (LOOP_UNROLL >= i)                                                 \
+    {                                                                   \
+      previous = v;                                                     \
+      v = LOAD (s_loop + (i - 1) * LSIZE);                              \
+      masks[i] = masks[i - 1] | EXPRESSION (v, cmask);                  \
+    }
+
+      MERGE_MASK (1);
+      MERGE_MASK (2);
+      MERGE_MASK (3);
+      MERGE_MASK (4);
+      MERGE_MASK (5);
+      MERGE_MASK (6);
+      MERGE_MASK (7);
+      MERGE_MASK (8);
+
+      if (masks[LOOP_UNROLL])
+	{
+
+          /* Here we have two possibilities depending on register pressure.
+             When you don't have enough registers this recalculating of
+             results would create fastest loop for large inputs. However
+             its likely affected by gcc bug where gcc will try to save
+             intermediate results causing spill in each iteration to speed
+             up final iteration a bit. To avoid that we need compiler barrier
+             here.  */
+
+#ifdef RECALCULATE_ON_TAIL
+          asm volatile ("" : : : "memory");
+# define CHECK_MASK(i) \
+      previous = v;                                                     \
+      v = LOAD (s_aligned + (i - 1) * LSIZE);                           \
+      mask = EXPRESSION (v, cmask);                                     \
+      if (mask)                                                         \
+	return s_aligned + (i - 1) * LSIZE + first_nonzero_byte (mask)  \
+	       - ADJUST;                                            
+
+# else
+          /* On other hand when you have enough free registers then you can
+             save intermediate ors. When you checks masks then as you know
+             that to reach each one previous ones must be zero you know that
+             or of previous masks will be exactly current one.  */
+
+# define CHECK_MASK(i) \
+  if (masks[i])                                                              \
+    return s_loop + (i - 1) * LSIZE + first_nonzero_byte (masks[i]) - ADJUST;
+# endif
+
+#ifdef LOOKAHEAD
+	  v = LOAD (s_loop - LSIZE);
+#endif
+	  CHECK_MASK (1);
+	  CHECK_MASK (2);
+	  CHECK_MASK (3);
+	  CHECK_MASK (4);
+	  CHECK_MASK (5);
+	  CHECK_MASK (6);
+	  CHECK_MASK (7);
+	  CHECK_MASK (8);
+	}
+    }
+  return NULL;
+}