diff mbox

[libcpp] : Use asm flag outputs in search_line_sse42 main loop

Message ID CAFULd4bHyQQvSbocbs2ReW+7YonKq4ocJj_AmCwNQx9QT07V7w@mail.gmail.com
State New
Headers show

Commit Message

Uros Bizjak June 29, 2015, 7:07 p.m. UTC
Hello!

Attached patch introduces asm flag outputs in seach_line_sse42 main
loop to handle carry flag value from pcmpestri insn. Slightly improved
old code that uses asm loop compiles to:

      96:    66 0f 6f 05 00 00 00     movdqa 0x0(%rip),%xmm0
      9d:    00
      9e:    48 83 ef 10              sub    $0x10,%rdi
      a2:    ba 10 00 00 00           mov    $0x10,%edx
      a7:    b8 04 00 00 00           mov    $0x4,%eax
      ac:    0f 1f 40 00              nopl   0x0(%rax)
      b0:    48 83 c7 10              add    $0x10,%rdi
      b4:    66 0f 3a 61 07 00        pcmpestri $0x0,(%rdi),%xmm0
      ba:    73 f4                    jae    b0
<_ZL17search_line_sse42PKhS0_+0x20>
      bc:    48 8d 04 0f              lea    (%rdi,%rcx,1),%rax
      c0:    c3                       retq

and new code results in:

      96:    66 0f 6f 05 00 00 00     movdqa 0x0(%rip),%xmm0
      9d:    00
      9e:    ba 10 00 00 00           mov    $0x10,%edx
      a3:    b8 04 00 00 00           mov    $0x4,%eax
      a8:    66 0f 3a 61 07 00        pcmpestri $0x0,(%rdi),%xmm0
      ae:    72 0c                    jb     bc
<_ZL17search_line_sse42PKhS0_+0x2c>
      b0:    48 83 c7 10              add    $0x10,%rdi
      b4:    66 0f 3a 61 07 00        pcmpestri $0x0,(%rdi),%xmm0
      ba:    73 f4                    jae    b0
<_ZL17search_line_sse42PKhS0_+0x20>
      bc:    48 8d 04 0f              lea    (%rdi,%rcx,1),%rax
      c0:    c3                       retq

which looks like an improvement to me.

2015-06-29  Uros Bizjak  <ubizjak@gmail.com>

    * lex.c (search_line_sse42) [__GCC_ASM_FLAG_OUTPUTS__]: New main
    loop using asm flag outputs.

Patch was bootstrapped and regression tested on x86_64-linux-gnu
{,m32} (ivybridge), so both code paths were exercised.

Since this is a new feature - does the approach look OK?

Uros.

Comments

Richard Henderson June 30, 2015, 6:11 a.m. UTC | #1
On 06/29/2015 08:07 PM, Uros Bizjak wrote:
> Index: lex.c
> ===================================================================
> --- lex.c	(revision 225138)
> +++ lex.c	(working copy)
> @@ -450,15 +450,30 @@ search_line_sse42 (const uchar *s, const uchar *en
>         s = (const uchar *)((si + 16) & -16);
>       }
>
> -  /* Main loop, processing 16 bytes at a time.  By doing the whole loop
> -     in inline assembly, we can make proper use of the flags set.  */
> -  __asm (      "sub $16, %1\n"
> -	"	.balign 16\n"
> +  /* Main loop, processing 16 bytes at a time.  */
> +#ifdef __GCC_ASM_FLAG_OUTPUTS__
> +  while (1)
> +    {
> +      char f;
> +      __asm ("%vpcmpestri\t$0, %2, %3"
> +	     : "=c"(index), "=@ccc"(f)
> +	     : "m"(*s), "x"(search), "a"(4), "d"(16));
> +      if (f)
> +	break;
> +
> +      s += 16;
> +    }

This change looks good.  Modulo keeping a comment mentioning why we can't use 
the builtin.

> +#else
> +  s -= 16;
> +  /* By doing the whole loop in inline assembly,
> +     we can make proper use of the flags set.  */
> +  __asm (      ".balign 16\n"
>   	"0:	add $16, %1\n"
> -	"	%vpcmpestri $0, (%1), %2\n"
> +	"	%vpcmpestri\t$0, (%1), %2\n"
>   	"	jnc 0b"
>   	: "=&c"(index), "+r"(s)
>   	: "x"(search), "a"(4), "d"(16));
> +#endif

I do wonder about keeping this bit around.  Surely we only really care about 
the performance of search_line after a full bootstrap, at which point we've got 
the new path.

I think maybe better to adjust the #ifdef HAVE_SSE4 line above to include the 
G_A_F_O check.


r~
Uros Bizjak June 30, 2015, 6:46 a.m. UTC | #2
On Tue, Jun 30, 2015 at 8:11 AM, Richard Henderson <rth@redhat.com> wrote:
> On 06/29/2015 08:07 PM, Uros Bizjak wrote:
>>
>> Index: lex.c
>> ===================================================================
>> --- lex.c       (revision 225138)
>> +++ lex.c       (working copy)
>> @@ -450,15 +450,30 @@ search_line_sse42 (const uchar *s, const uchar *en
>>         s = (const uchar *)((si + 16) & -16);
>>       }
>>
>> -  /* Main loop, processing 16 bytes at a time.  By doing the whole loop
>> -     in inline assembly, we can make proper use of the flags set.  */
>> -  __asm (      "sub $16, %1\n"
>> -       "       .balign 16\n"
>> +  /* Main loop, processing 16 bytes at a time.  */
>> +#ifdef __GCC_ASM_FLAG_OUTPUTS__
>> +  while (1)
>> +    {
>> +      char f;
>> +      __asm ("%vpcmpestri\t$0, %2, %3"
>> +            : "=c"(index), "=@ccc"(f)
>> +            : "m"(*s), "x"(search), "a"(4), "d"(16));
>> +      if (f)
>> +       break;
>> +
>> +      s += 16;
>> +    }
>
>
> This change looks good.  Modulo keeping a comment mentioning why we can't
> use the builtin.

OK, I'll say something like "By using inline assembly instead of the
builtin, we can use the result, as well as the flags set." here,
similar to the moved comment.

>> +#else
>> +  s -= 16;
>> +  /* By doing the whole loop in inline assembly,
>> +     we can make proper use of the flags set.  */
>> +  __asm (      ".balign 16\n"
>>         "0:     add $16, %1\n"
>> -       "       %vpcmpestri $0, (%1), %2\n"
>> +       "       %vpcmpestri\t$0, (%1), %2\n"
>>         "       jnc 0b"
>>         : "=&c"(index), "+r"(s)
>>         : "x"(search), "a"(4), "d"(16));
>> +#endif
>
>
> I do wonder about keeping this bit around.  Surely we only really care about
> the performance of search_line after a full bootstrap, at which point we've
> got the new path.

According to [1], " ... the instructions for building libgccjit
recommend --disable-bootstrap, ...". IMO, we could leave this part for
now, it is not that much of a maintenance burden.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66593#c0

Thanks,
Uros.
Richard Henderson June 30, 2015, 6:55 a.m. UTC | #3
On 06/30/2015 07:46 AM, Uros Bizjak wrote:
> According to [1], " ... the instructions for building libgccjit
> recommend --disable-bootstrap, ...". IMO, we could leave this part for
> now, it is not that much of a maintenance burden.
>
> [1]https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66593#c0

Fair enough.


r~
Ondřej Bílka July 2, 2015, 8:50 a.m. UTC | #4
On Mon, Jun 29, 2015 at 09:07:22PM +0200, Uros Bizjak wrote:
> Hello!
> 
> Attached patch introduces asm flag outputs in seach_line_sse42 main
> loop to handle carry flag value from pcmpestri insn. Slightly improved
> old code that uses asm loop compiles to:
>
Using sse4.2 here is bit dubios as pcmpistri has horrible latency, and
four checks are near boundary where replacing it by sse2 sequence is
faster.

So I looked closer and wrote program to count number of source file lines to compute.

I found that there is almost no difference between sse2, sse4.2 code or
just calling strpbrk.

But there were significant performance mistakes in sse2 code. First one
is that a comment

  /* Create a mask for the bytes that are valid within the first
     16-byte block.  The Idea here is that the AND with the mask
     within the loop is "free", since we need some AND or TEST
     insn in order to set the flags for the branch anyway.  */

First claim about free is false as gcc does repeat setting mask to 1 in
each iteration instead only on first.

Then there is problem that here jumping directly into loop is bad idea
due to branch misprediction. Its better to use header when its likely
that loop ends in first iteration.

A worst problem is that using aligned load and masking is unpredictable
loop, depending on alignment it could only check one byte.

A correct approach here is check if we cross page boundary and use
unaligned load. That always checks 16 bytes instead of 8 on average when
alignment is completely random.

That improved a sse2 code to be around 5% faster than sse4.2 code.

A second optimization is that most lines are less than 80 characters
long. So don't bother with loop just do checks in header. That gives
another 5%

A benchmark is bit ugly, usage is

./benchmark file function repeat
where you need supply source named file that will be scanned repeat
times. A functions tested are following:
./benchmark foo.c 1 100000 # strpbrk
./benchmark foo.c 2 100000 # current sse2
./benchmark foo.c 3 100000 # current sse4.2
./benchmark foo.c 4 100000 # improved sse2 with unaligned check of 16 bytes.
./benchmark foo.c 5 100000 # improved sse2 with unaligned check of 16 bytes.


I will send patch later, do you have comments about that improvements?
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
#define handle_error(msg) \
    do { perror(msg); exit(EXIT_FAILURE); } while (0)

char *next_line(char *x, char *y)
{
  return strpbrk(x,"\r\n?\\");
}

#include <emmintrin.h>
#define __v16qi v16qi
#define uchar unsigned char


/* Replicated character data to be shared between implementations.
   Recall that outside of a context with vector support we can't
   define compatible vector types, therefore these are all defined
   in terms of raw characters.  */
static const char repl_chars[4][16] __attribute__((aligned(16))) = {
  { '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n',
    '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n' },
  { '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r',
    '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r' },
  { '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\',
    '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\' },
  { '?', '?', '?', '?', '?', '?', '?', '?',
    '?', '?', '?', '?', '?', '?', '?', '?' },
};

/* A version of the fast scanner using SSE2 vectorized byte compare insns.  */

static const uchar *
#ifndef __SSE2__
__attribute__((__target__("sse2")))
#endif
search_line_sse2 (const uchar *s, const uchar *end )
{
  typedef char v16qi __attribute__ ((__vector_size__ (16)));

  const v16qi repl_nl = *(const v16qi *)repl_chars[0];
  const v16qi repl_cr = *(const v16qi *)repl_chars[1];
  const v16qi repl_bs = *(const v16qi *)repl_chars[2];
  const v16qi repl_qm = *(const v16qi *)repl_chars[3];

  unsigned int misalign, found, mask;
  const v16qi *p;
  v16qi data, t;

  /* Align the source pointer.  */
  misalign = (uintptr_t)s & 15;
  p = (const v16qi *)((uintptr_t)s & -16);
  data = *p;

  /* Create a mask for the bytes that are valid within the first
     16-byte block.  The Idea here is that the AND with the mask
     within the loop is "free", since we need some AND or TEST
     insn in order to set the flags for the branch anyway.  */
  mask = -1u << misalign;

  /* Main loop processing 16 bytes at a time.  */
  goto start;
  do
    {
      data = *++p;
      mask = -1;

    start:
      t  = __builtin_ia32_pcmpeqb128(data, repl_nl);
      t |= __builtin_ia32_pcmpeqb128(data, repl_cr);
      t |= __builtin_ia32_pcmpeqb128(data, repl_bs);
      t |= __builtin_ia32_pcmpeqb128(data, repl_qm);
      found = __builtin_ia32_pmovmskb128 (t);
      found &= mask;
    }
  while (!found);

  /* FOUND contains 1 in bits for which we matched a relevant
     character.  Conversion to the byte index is trivial.  */
  found = __builtin_ctz(found);
  return (const uchar *)p + found;
}
#define OR(x,y) ((x)|(y))
static const uchar *
#ifndef __SSE2__
__attribute__((__target__("sse2")))
#endif
search_line_sse2v3 (const uchar *s, const uchar *end )
{
  typedef char v16qi __attribute__ ((__vector_size__ (16)));

  const v16qi repl_nl = *(const v16qi *)repl_chars[0];
  const v16qi repl_cr = *(const v16qi *)repl_chars[1];
  const v16qi repl_bs = *(const v16qi *)repl_chars[2];
  const v16qi repl_qm = *(const v16qi *)repl_chars[3];

  unsigned long misalign, found, mask;
  const v16qi *p;
  v16qi data, t;
 
  if (s + 96 < end)
    {
      v16qi x0= (v16qi)  (v16qi) _mm_loadu_si128((__m128i *) s);
          v16qi tx;
      tx =  __builtin_ia32_pcmpeqb128(x0, repl_nl);
      tx =  OR(__builtin_ia32_pcmpeqb128(x0, repl_cr),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x0, repl_bs),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x0, repl_qm),tx);

      found =  __builtin_ia32_pmovmskb128 (tx);
      if (found)
      {
      found = __builtin_ctz(found);
      return (const uchar *)s + found;
      }
      v16qi x1=  (v16qi) _mm_loadu_si128((__m128i *) (s+16));
      v16qi x2=  (v16qi) _mm_loadu_si128((__m128i *) (s+32));
      v16qi x3=  (v16qi) _mm_loadu_si128((__m128i *) (s+48));
      v16qi x4=  (v16qi) _mm_loadu_si128((__m128i *) (s+64));

      tx =  __builtin_ia32_pcmpeqb128(x1, repl_nl);
      tx =  OR(__builtin_ia32_pcmpeqb128(x1, repl_cr),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x1, repl_bs),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x1, repl_qm),tx);

      found =  __builtin_ia32_pmovmskb128 (tx);


      if (found)
      {
      found = __builtin_ctzl(found);
      return (const uchar *)s + 16 + found;
      }

      tx =  __builtin_ia32_pcmpeqb128(x2, repl_nl);
      tx =  OR(__builtin_ia32_pcmpeqb128(x2, repl_cr),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x2, repl_bs),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x2, repl_qm),tx);

      found |=  ((unsigned long) __builtin_ia32_pmovmskb128 (tx))<<16;
      tx =  __builtin_ia32_pcmpeqb128(x3, repl_nl);

      if (found)
      {
      found = __builtin_ctzl(found);
      return (const uchar *)s + 16 + found;
      }


      tx =  OR(__builtin_ia32_pcmpeqb128(x3, repl_cr),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x3, repl_bs),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x3, repl_qm),tx);

      found |=  ((unsigned long) __builtin_ia32_pmovmskb128 (tx))<<32;

      if (found)
      {
      found = __builtin_ctzl(found);
      return (const uchar *)s + 16 + found;
      }

      tx =  __builtin_ia32_pcmpeqb128(x4, repl_nl);
      tx =  OR(__builtin_ia32_pcmpeqb128(x4, repl_cr),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x4, repl_bs),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x4, repl_qm),tx);

      found |=  ((unsigned long) __builtin_ia32_pmovmskb128 (tx))<<48;

      if (found)
      {
      found = __builtin_ctzl(found);
      return (const uchar *)s + 16 + found;
      }

    s += 80;
   }

  /* Align the source pointer.  */
  misalign = (uintptr_t)s & 15;
  p = (const v16qi *)((uintptr_t)s & -16);
  data = *p;

  /* Create a mask for the bytes that are valid within the first
     16-byte block.  The Idea here is that the AND with the mask
     within the loop is "free", since we need some AND or TEST
     insn in order to set the flags for the branch anyway.  */
  mask = -1u << misalign;

  /* Main loop processing 16 bytes at a time.  */
  goto start;
  do
    {
      data = *++p;
      mask = -1;

    start:
      t  = __builtin_ia32_pcmpeqb128(data, repl_nl);
      t |= __builtin_ia32_pcmpeqb128(data, repl_cr);
      t |= __builtin_ia32_pcmpeqb128(data, repl_bs);
      t |= __builtin_ia32_pcmpeqb128(data, repl_qm);
      found = __builtin_ia32_pmovmskb128 (t);
      found &= mask;
    }
  while (!found);

  /* FOUND contains 1 in bits for which we matched a relevant
     character.  Conversion to the byte index is trivial.  */
  found = __builtin_ctz(found);
  return (const uchar *)p + found;
}



static const uchar *
#ifndef __SSE2__
__attribute__((__target__("sse2")))
#endif
search_line_sse2v2 (const uchar *s, const uchar *end )
{
  typedef char v16qi __attribute__ ((__vector_size__ (16)));

  const v16qi repl_nl = *(const v16qi *)repl_chars[0];
  const v16qi repl_cr = *(const v16qi *)repl_chars[1];
  const v16qi repl_bs = *(const v16qi *)repl_chars[2];
  const v16qi repl_qm = *(const v16qi *)repl_chars[3];

  unsigned long misalign, found, mask;
  const v16qi *p;
  v16qi data, t;
 
  if (s + 96 < end)
    {
      v16qi x0=  (v16qi) _mm_loadu_si128((__m128i *) s);
          v16qi tx;
      tx =  __builtin_ia32_pcmpeqb128(x0, repl_nl);
      tx =  OR(__builtin_ia32_pcmpeqb128(x0, repl_cr),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x0, repl_bs),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x0, repl_qm),tx);

      found =  __builtin_ia32_pmovmskb128 (tx);
      if (found)
      {
      found = __builtin_ctz(found);
      return (const uchar *)s + found;
      }

      s += 16;
     goto next;
      v16qi x1=  (v16qi) _mm_loadu_si128((__m128i *) (s+16));
      v16qi x2=  (v16qi) _mm_loadu_si128((__m128i *) (s+32));
      v16qi x3=  (v16qi) _mm_loadu_si128((__m128i *) (s+48));
      v16qi x4=  (v16qi) _mm_loadu_si128((__m128i *) (s+64));

      tx =  __builtin_ia32_pcmpeqb128(x1, repl_nl);
      tx =  OR(__builtin_ia32_pcmpeqb128(x1, repl_cr),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x1, repl_bs),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x1, repl_qm),tx);

      found =  __builtin_ia32_pmovmskb128 (tx);


      if (found)
      {
      found = __builtin_ctzl(found);
      return (const uchar *)s + 16 + found;
      }

      tx =  __builtin_ia32_pcmpeqb128(x2, repl_nl);
      tx =  OR(__builtin_ia32_pcmpeqb128(x2, repl_cr),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x2, repl_bs),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x2, repl_qm),tx);

      found |=  ((unsigned long) __builtin_ia32_pmovmskb128 (tx))<<16;
      tx =  __builtin_ia32_pcmpeqb128(x3, repl_nl);

      if (found)
      {
      found = __builtin_ctzl(found);
      return (const uchar *)s + 16 + found;
      }


      tx =  OR(__builtin_ia32_pcmpeqb128(x3, repl_cr),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x3, repl_bs),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x3, repl_qm),tx);

      found |=  ((unsigned long) __builtin_ia32_pmovmskb128 (tx))<<32;

      if (found)
      {
      found = __builtin_ctzl(found);
      return (const uchar *)s + 16 + found;
      }

      tx =  __builtin_ia32_pcmpeqb128(x4, repl_nl);
      tx =  OR(__builtin_ia32_pcmpeqb128(x4, repl_cr),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x4, repl_bs),tx);
      tx =  OR(__builtin_ia32_pcmpeqb128(x4, repl_qm),tx);

      found |=  ((unsigned long) __builtin_ia32_pmovmskb128 (tx))<<48;

      if (found)
      {
      found = __builtin_ctzl(found);
      return (const uchar *)s + 16 + found;
      }

    s += 80;
   }
   next:
  /* Align the source pointer.  */
  misalign = (uintptr_t)s & 15;
  p = (const v16qi *)((uintptr_t)s & -16);
  data = *p;

  /* Create a mask for the bytes that are valid within the first
     16-byte block.  The Idea here is that the AND with the mask
     within the loop is "free", since we need some AND or TEST
     insn in order to set the flags for the branch anyway.  */
  mask = -1u << misalign;

  /* Main loop processing 16 bytes at a time.  */
  goto start;
  do
    {
      data = *++p;
      mask = -1;

    start:
      t  = __builtin_ia32_pcmpeqb128(data, repl_nl);
      t |= __builtin_ia32_pcmpeqb128(data, repl_cr);
      t |= __builtin_ia32_pcmpeqb128(data, repl_bs);
      t |= __builtin_ia32_pcmpeqb128(data, repl_qm);
      found = __builtin_ia32_pmovmskb128 (t);
      found &= mask;
    }
  while (!found);

  /* FOUND contains 1 in bits for which we matched a relevant
     character.  Conversion to the byte index is trivial.  */
  found = __builtin_ctz(found);
  return (const uchar *)p + found;
}

#ifdef HAVE_SSE4
/* A version of the fast scanner using SSE 4.2 vectorized string insns.  */

static const uchar *
#ifndef __SSE4_2__
__attribute__((__target__("sse4.2")))
#endif
search_line_sse42 (const uchar *s, const uchar *end)
{
  typedef char v16qi __attribute__ ((__vector_size__ (16)));
  static const v16qi search = { '\n', '\r', '?', '\\' };

  uintptr_t si = (uintptr_t)s;
  uintptr_t index;

  /* Check for unaligned input.  */
  if (si & 15)
    {
      if (__builtin_expect (end - s < 16, 0)
	  && __builtin_expect ((si & 0xfff) > 0xff0, 0))
	{
	  /* There are less than 16 bytes left in the buffer, and less
	     than 16 bytes left on the page.  Reading 16 bytes at this
	     point might generate a spurious page fault.  Defer to the
	     SSE2 implementation, which already handles alignment.  */
	  return search_line_sse2 (s, end);
	}

      /* ??? The builtin doesn't understand that the PCMPESTRI read from
	 memory need not be aligned.  */
      __asm ("%vpcmpestri $0, (%1), %2"
	     : "=c"(index) : "r"(s), "x"(search), "a"(4), "d"(16));
      if (__builtin_expect (index < 16, 0))
	goto found;

      /* Advance the pointer to an aligned address.  We will re-scan a
	 few bytes, but we no longer need care for reading past the
	 end of a page, since we're guaranteed a match.  */
      s = (const uchar *)((si + 16) & -16);
    }

  /* Main loop, processing 16 bytes at a time.  By doing the whole loop
     in inline assembly, we can make proper use of the flags set.  */
  __asm (      "sub $16, %1\n"
	"	.balign 16\n"
	"0:	add $16, %1\n"
	"	%vpcmpestri $0, (%1), %2\n"
	"	jnc 0b"
	: "=&c"(index), "+r"(s)
	: "x"(search), "a"(4), "d"(16));

 found:
  return s + index;
}

#else
/* Work around out-dated assemblers without sse4 support.  */
#define search_line_sse42 search_line_sse2
#endif




int line_count(char *start, char *end)
{
  int c = 0;
  while (start != end)
    {
      start = next_line(start+1,end);
      c++;  
    }
  return c;
}
int line_count2(char *start, char *end)
{
  int c = 0;
  while (start != end)
    {
      start = (char *) search_line_sse2(start+1,end);
      c++;  
    }
  return c;
}
int line_count3(char *start, char *end)
{
  int c = 0;
  while (start != end)
    {
      start = (char *) search_line_sse42(start+1,end);
      c++;  
    }
  return c;
}
int line_count4(char *start, char *end)
{
  int c = 0;
  while (start != end)
    {
      start = (char *) search_line_sse2v2(start+1,end);
      c++;
    }
  return c;
}
int line_count5(char *start, char *end)
{
  int c = 0;
  while (start != end)
    {
      start = (char *) search_line_sse2v3(start+1,end);
      c++;
    }
  return c;
}
int
main(int argc, char *argv[])
{
    char *addr;
    int fd;
    struct stat sb;
    off_t offset, pa_offset;
    size_t length;
    ssize_t s;

    fd = open(argv[1], O_RDONLY);
    if (fd == -1)
        handle_error("open");
    if (fstat(fd, &sb) == -1)           /* To obtain file size */
        handle_error("fstat");


    addr = mmap(NULL, sb.st_size, PROT_READ | PROT_WRITE,
                MAP_PRIVATE, fd, 0);
    if (addr == MAP_FAILED)
        handle_error("mmap");
    addr[sb.st_size] = '\n';
    int sum = 0;
    int i;
    if (atoi(argv[2]) == 1)
      for (i=0;i<atoi(argv[3]);i++)
        sum += line_count(addr, addr+sb.st_size);
    if (atoi(argv[2]) == 2)
      for (i=0;i<atoi(argv[3]);i++)
        sum += line_count2(addr, addr+sb.st_size);
    if (atoi(argv[2]) == 3)
      for (i=0;i<atoi(argv[3]);i++)
        sum += line_count3(addr, addr+sb.st_size);
    if (atoi(argv[2]) == 4)
      for (i=0;i<atoi(argv[3]);i++)
        sum += line_count4(addr, addr+sb.st_size);
    if (atoi(argv[2]) == 5)
      for (i=0;i<atoi(argv[3]);i++)
        sum += line_count5(addr, addr+sb.st_size);


    return sum;
}
diff mbox

Patch

Index: lex.c
===================================================================
--- lex.c	(revision 225138)
+++ lex.c	(working copy)
@@ -450,15 +450,30 @@  search_line_sse42 (const uchar *s, const uchar *en
       s = (const uchar *)((si + 16) & -16);
     }
 
-  /* Main loop, processing 16 bytes at a time.  By doing the whole loop
-     in inline assembly, we can make proper use of the flags set.  */
-  __asm (      "sub $16, %1\n"
-	"	.balign 16\n"
+  /* Main loop, processing 16 bytes at a time.  */
+#ifdef __GCC_ASM_FLAG_OUTPUTS__
+  while (1)
+    {
+      char f;
+      __asm ("%vpcmpestri\t$0, %2, %3"
+	     : "=c"(index), "=@ccc"(f)
+	     : "m"(*s), "x"(search), "a"(4), "d"(16));
+      if (f)
+	break;
+      
+      s += 16;
+    }
+#else
+  s -= 16;
+  /* By doing the whole loop in inline assembly,
+     we can make proper use of the flags set.  */
+  __asm (      ".balign 16\n"
 	"0:	add $16, %1\n"
-	"	%vpcmpestri $0, (%1), %2\n"
+	"	%vpcmpestri\t$0, (%1), %2\n"
 	"	jnc 0b"
 	: "=&c"(index), "+r"(s)
 	: "x"(search), "a"(4), "d"(16));
+#endif
 
  found:
   return s + index;