Improve rotation by mode bitsize - 1

Message ID	20130509184556.GI1377@tucnak.redhat.com
State	New
Headers	show Return-Path: <gcc-patches-return-341310-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:cc:subject:message-id:reply-to:mime-version :content-type; q=dns; s=default; b=NxqT1WqcLZPGQuWi8hVB342c9ual+ zcC5KLNhC4Wi8AsqhBpCICTFDCUErfDgAfM5nx1VQnqM/cYudwv1dtrx7saUt47v E0G4aixTTH5prHkDcnazFYkU9UpPIyWHUt21YItSBdhgCjUmsOIisrcGnzdhLAw3 OaAshF1DOBWhFI= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org Date: Thu, 9 May 2013 20:45:56 +0200 From: Jakub Jelinek <jakub@redhat.com> To: Uros Bizjak <ubizjak@gmail.com> Cc: gcc-patches@gcc.gnu.org Subject: [PATCH] Improve rotation by mode bitsize - 1 Message-ID: <20130509184556.GI1377@tucnak.redhat.com> Reply-To: Jakub Jelinek <jakub@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15)

Jakub Jelinek May 9, 2013, 6:45 p.m. UTC

Hi!

This is something I've noticed while working on the rotate recognizer
patch I've just posted.  We emit say
  roll %eax
instead of
  roll $1, %eax
because the former is shorter, but emit
  roll $31, %eax
instead of the equivalent, but shorter
  rorl %eax
The following patch let us optimize even those.  Bootstrapped/regtested
on x86_64-linux and i686-linux, ok for trunk?

2013-05-09  Jakub Jelinek  <jakub@redhat.com>

	* config/i386/i386.md (rotateinv): New code attr.
	(*<rotate_insn><mode>3_1, *<rotate_insn>si3_1_zext,
	*<rotate_insn>qi3_1_slp): Emit rorl %eax instead of
	roll $31, %eax, etc.


	Jakub

Steven Bosscher May 9, 2013, 6:50 p.m. UTC | #1

On Thu, May 9, 2013 at 8:45 PM, Jakub Jelinek wrote:
> Hi!
>
> This is something I've noticed while working on the rotate recognizer
> patch I've just posted.  We emit say
>   roll %eax
> instead of
>   roll $1, %eax
> because the former is shorter, but emit
>   roll $31, %eax
> instead of the equivalent, but shorter
>   rorl %eax

Wouldn't this be better done as one or more peephole2s?

Ciao!
Steven

Jakub Jelinek May 9, 2013, 6:54 p.m. UTC | #2

On Thu, May 09, 2013 at 08:50:59PM +0200, Steven Bosscher wrote:
> On Thu, May 9, 2013 at 8:45 PM, Jakub Jelinek wrote:
> > This is something I've noticed while working on the rotate recognizer
> > patch I've just posted.  We emit say
> >   roll %eax
> > instead of
> >   roll $1, %eax
> > because the former is shorter, but emit
> >   roll $31, %eax
> > instead of the equivalent, but shorter
> >   rorl %eax
> 
> Wouldn't this be better done as one or more peephole2s?

Given that the output routine is in all cases already C code,
I think peephole2s would be only slower and more code.

	Jakub

Uros Bizjak May 9, 2013, 7:47 p.m. UTC | #3

On Thu, May 9, 2013 at 8:45 PM, Jakub Jelinek <jakub@redhat.com> wrote:

> This is something I've noticed while working on the rotate recognizer
> patch I've just posted.  We emit say
>   roll %eax
> instead of
>   roll $1, %eax
> because the former is shorter, but emit
>   roll $31, %eax
> instead of the equivalent, but shorter
>   rorl %eax
> The following patch let us optimize even those.  Bootstrapped/regtested
> on x86_64-linux and i686-linux, ok for trunk?
>
> 2013-05-09  Jakub Jelinek  <jakub@redhat.com>
>
>         * config/i386/i386.md (rotateinv): New code attr.
>         (*<rotate_insn><mode>3_1, *<rotate_insn>si3_1_zext,
>         *<rotate_insn>qi3_1_slp): Emit rorl %eax instead of
>         roll $31, %eax, etc.

OK.

Thanks,
Uros.

Jakub Jelinek May 9, 2013, 8:24 p.m. UTC | #4

On Thu, May 09, 2013 at 09:47:49PM +0200, Uros Bizjak wrote:
> On Thu, May 9, 2013 at 8:45 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> > 2013-05-09  Jakub Jelinek  <jakub@redhat.com>
> >
> >         * config/i386/i386.md (rotateinv): New code attr.
> >         (*<rotate_insn><mode>3_1, *<rotate_insn>si3_1_zext,
> >         *<rotate_insn>qi3_1_slp): Emit rorl %eax instead of
> >         roll $31, %eax, etc.
> 
> OK.

Thanks.  While talking about rotates, seems they are quite expensive
on SandyBridge if the destination (== source1) is memory.

With -O3 -mavx:

unsigned int a[1024] = { 0, 1, 2, 3, 4, 5, 6 };

__attribute__((noinline, noclone)) void
foo (void)
{
  int i;
  for (i = 0; i < 1024; i++)
    {
      int j = i & 31;
      a[i] = (a[i] << j) ^ (a[i] >> ((-j) & 31));
    }
}

int
main ()
{
  int i;
  for (i = 0; i < 1000000; i++)
    foo ();
  return 0;
}

the timing for vanilla gcc, gcc with the rotate recognizer patch and
the latter hand editted to use rotate on register gives:
Intel i7-2600	0m1.421s	0m1.914s	0m0.826s
AMD 8354 	0m2.353s	0m0.988s	0m1.407s

The vanilla gcc loop is long:
.L2:
	movl	a(,%rax,4), %edi
	movl	%eax, %esi
	andl	$31, %esi
	movl	%esi, %ecx
	negl	%ecx
	movl	%edi, %edx
	shrl	%cl, %edx
	movl	%esi, %ecx
	sall	%cl, %edi
	xorl	%edi, %edx
	movl	%edx, a(,%rax,4)
	addq	$1, %rax
	cmpq	$1024, %rax
	jne	.L2
With rotate using mem:
.L2:
	roll	%cl, a(,%rcx,4)
	addq	$1, %rcx
	cmpq	$1024, %rcx
	jne	.L2
and with the hand tweaked rotate:
.L2:
	movl	a(,%rcx,4), %eax
	roll	%cl, %eax
	addq	$1, %rcx
	cmpq	$1024, %rcx
	movl	%eax, a-4(,%rcx,4)
	jne	.L2

So, on SandyBridge (haven't tried other Intel CPUs) it perhaps might be
beneficial to disallow rotate with =m , 0 constraints, while for the above
mentioned AMD it is best to use it.  Guess the above isn't sufficient data
for that, we'd need info on more CPUs and more rotation sizes.

BTW, with -mavx2 we vectorize the loop without the rotate recognition patch
and don't vectorize anymore, I'll need to adjust the pattern recognizer to
handle those.  The question is if for missing vector rotate optab
we can replace it with the probably faster and shorter
x r<< y  -> (x << y) | (x >> (32 - y)) - which is invalid for y = 0,
but perhaps if the only thing the vector shift would do for shift by 32
would be either return 0, or x, it would still work, or if we want to go
for the (x << y) | (x >> ((-y) & 31)).  The i?86/x86_64 vector shifts don't
truncate the shift count, so perhaps best would be a vector rotate expander
that expands it into some some special insn like vectorized y ? x >> (32 - y) : 0

One question is whether our LROTATE_EXPR or RROTATE_EXPR is only defined for
rotation counts 0 through bitsize - 1 (as is the case when we pattern detect
it, do we create *ROTATE_EXPR other way?), or whether we allow arbitrary
rotation counts (would assume the rotation be done always with modulo
bitsize).  Because if the rotation count could be arbitrary, we'd need
to vectorize it as (x << (y & 31)) | (x >> ((-y) & 31))

	Jakub

Richard Biener May 10, 2013, 9:07 a.m. UTC | #5

On Thu, May 9, 2013 at 10:24 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, May 09, 2013 at 09:47:49PM +0200, Uros Bizjak wrote:
>> On Thu, May 9, 2013 at 8:45 PM, Jakub Jelinek <jakub@redhat.com> wrote:
>> > 2013-05-09  Jakub Jelinek  <jakub@redhat.com>
>> >
>> >         * config/i386/i386.md (rotateinv): New code attr.
>> >         (*<rotate_insn><mode>3_1, *<rotate_insn>si3_1_zext,
>> >         *<rotate_insn>qi3_1_slp): Emit rorl %eax instead of
>> >         roll $31, %eax, etc.
>>
>> OK.
>
> Thanks.  While talking about rotates, seems they are quite expensive
> on SandyBridge if the destination (== source1) is memory.
>
> With -O3 -mavx:
>
> unsigned int a[1024] = { 0, 1, 2, 3, 4, 5, 6 };
>
> __attribute__((noinline, noclone)) void
> foo (void)
> {
>   int i;
>   for (i = 0; i < 1024; i++)
>     {
>       int j = i & 31;
>       a[i] = (a[i] << j) ^ (a[i] >> ((-j) & 31));
>     }
> }
>
> int
> main ()
> {
>   int i;
>   for (i = 0; i < 1000000; i++)
>     foo ();
>   return 0;
> }
>
> the timing for vanilla gcc, gcc with the rotate recognizer patch and
> the latter hand editted to use rotate on register gives:
> Intel i7-2600   0m1.421s        0m1.914s        0m0.826s
> AMD 8354        0m2.353s        0m0.988s        0m1.407s
>
> The vanilla gcc loop is long:
> .L2:
>         movl    a(,%rax,4), %edi
>         movl    %eax, %esi
>         andl    $31, %esi
>         movl    %esi, %ecx
>         negl    %ecx
>         movl    %edi, %edx
>         shrl    %cl, %edx
>         movl    %esi, %ecx
>         sall    %cl, %edi
>         xorl    %edi, %edx
>         movl    %edx, a(,%rax,4)
>         addq    $1, %rax
>         cmpq    $1024, %rax
>         jne     .L2
> With rotate using mem:
> .L2:
>         roll    %cl, a(,%rcx,4)
>         addq    $1, %rcx
>         cmpq    $1024, %rcx
>         jne     .L2
> and with the hand tweaked rotate:
> .L2:
>         movl    a(,%rcx,4), %eax
>         roll    %cl, %eax
>         addq    $1, %rcx
>         cmpq    $1024, %rcx
>         movl    %eax, a-4(,%rcx,4)
>         jne     .L2
>
> So, on SandyBridge (haven't tried other Intel CPUs) it perhaps might be
> beneficial to disallow rotate with =m , 0 constraints, while for the above
> mentioned AMD it is best to use it.  Guess the above isn't sufficient data
> for that, we'd need info on more CPUs and more rotation sizes.
>
> BTW, with -mavx2 we vectorize the loop without the rotate recognition patch
> and don't vectorize anymore, I'll need to adjust the pattern recognizer to
> handle those.  The question is if for missing vector rotate optab
> we can replace it with the probably faster and shorter
> x r<< y  -> (x << y) | (x >> (32 - y)) - which is invalid for y = 0,
> but perhaps if the only thing the vector shift would do for shift by 32
> would be either return 0, or x, it would still work, or if we want to go
> for the (x << y) | (x >> ((-y) & 31)).  The i?86/x86_64 vector shifts don't
> truncate the shift count, so perhaps best would be a vector rotate expander
> that expands it into some some special insn like vectorized y ? x >> (32 - y) : 0
>
> One question is whether our LROTATE_EXPR or RROTATE_EXPR is only defined for
> rotation counts 0 through bitsize - 1 (as is the case when we pattern detect
> it, do we create *ROTATE_EXPR other way?), or whether we allow arbitrary
> rotation counts (would assume the rotation be done always with modulo
> bitsize).  Because if the rotation count could be arbitrary, we'd need
> to vectorize it as (x << (y & 31)) | (x >> ((-y) & 31))

The rotation count should be 0 to bitsize - 1 similar to shifts.

Richard.

>         Jakub

Jan Hubicka May 10, 2013, 1:48 p.m. UTC | #6

> Hi!
> 
> This is something I've noticed while working on the rotate recognizer
> patch I've just posted.  We emit say
>   roll %eax
> instead of
>   roll $1, %eax
> because the former is shorter, but emit
>   roll $31, %eax
> instead of the equivalent, but shorter
>   rorl %eax
> The following patch let us optimize even those.  Bootstrapped/regtested
> on x86_64-linux and i686-linux, ok for trunk?
> 
> 2013-05-09  Jakub Jelinek  <jakub@redhat.com>
> 
> 	* config/i386/i386.md (rotateinv): New code attr.
> 	(*<rotate_insn><mode>3_1, *<rotate_insn>si3_1_zext,
> 	*<rotate_insn>qi3_1_slp): Emit rorl %eax instead of
> 	roll $31, %eax, etc.

Going this route you will also need to update

  [(set_attr "type" "rotate1")
   (set (attr "length_immediate")
     (if_then_else
       (and (match_operand 1 "const1_operand")
            (ior (match_test "TARGET_SHIFT1")
                 (match_test "optimize_function_for_size_p (cfun)")))
       (const_string "0")
       (const_string "*")))
   (set_attr "mode" "QI")])

computing the immediate size.  Why we can't canonicalize this in
folding/simplify_rtx/combiner? I do not see much benefit allowing both
forms of instructions in our insn streams...

Honza

Jakub Jelinek May 10, 2013, 2:09 p.m. UTC | #7

Hi!

Note, the patch has been already committed.

On Fri, May 10, 2013 at 03:48:50PM +0200, Jan Hubicka wrote:
> Going this route you will also need to update
> 
>   [(set_attr "type" "rotate1")
>    (set (attr "length_immediate")
>      (if_then_else
>        (and (match_operand 1 "const1_operand")
>             (ior (match_test "TARGET_SHIFT1")
>                  (match_test "optimize_function_for_size_p (cfun)")))
>        (const_string "0")
>        (const_string "*")))
>    (set_attr "mode" "QI")])
> 
> computing the immediate size.  Why we can't canonicalize this in
> folding/simplify_rtx/combiner? I do not see much benefit allowing both
> forms of instructions in our insn streams...

Both directions of rotation make sense for variable length rotation, and
why would the generic simplify-rtx.c code want to prefer one rotate over
the other direction for constant rotation counts?  That is clearly target
specific preference.

What we could do instead of the patch I've committed just tweak expansion
of the patterns instead, assuming it is unlikely RTL optimizations
materialize a constant rotation out of non-constant rotations from expansion
time.

Or the length_immediate attributes could be adjusted, shouldn't be too hard
(just replace that (match_operand 1 "const1_operand") in there with
(and (match_operand 1 "const_int_operand")
     (ior (match_operand 1 "const1_operand")
	  (match_test "INTVAL (operands[1]) == GET_MODE_BITSIZE (<MODE>mode) - 1")))
or similar (untested).

	Jakub

Jan Hubicka May 10, 2013, 5:15 p.m. UTC | #8

> Hi!
> 
> Note, the patch has been already committed.
> 
> On Fri, May 10, 2013 at 03:48:50PM +0200, Jan Hubicka wrote:
> > Going this route you will also need to update
> > 
> >   [(set_attr "type" "rotate1")
> >    (set (attr "length_immediate")
> >      (if_then_else
> >        (and (match_operand 1 "const1_operand")
> >             (ior (match_test "TARGET_SHIFT1")
> >                  (match_test "optimize_function_for_size_p (cfun)")))
> >        (const_string "0")
> >        (const_string "*")))
> >    (set_attr "mode" "QI")])
> > 
> > computing the immediate size.  Why we can't canonicalize this in
> > folding/simplify_rtx/combiner? I do not see much benefit allowing both
> > forms of instructions in our insn streams...
> 
> Both directions of rotation make sense for variable length rotation, and
> why would the generic simplify-rtx.c code want to prefer one rotate over
> the other direction for constant rotation counts?  That is clearly target
> specific preference.

It seems to me that it is not different from normalizing reg-10 into reg+(-10)
we do for years (and for good reason).  It is still target preference when use
add and when sub to perform the arithmetic, but it makes sense to keep single
canonical form of the expression both in RTL and Gimple.

For example we may want to be able to prove that 
  (rotate reg 31) == (rotatert reg 1)
is true or
  (rotate reg 30) == (rotatert reg 2)
is also true or cross jump both variants into one instruction.

So it seems to me that canonicalizing constant rotations to get the operand
into range 0....bitsize/2 makes sense in general and will make the extra
special case in i386.md unnecesary.
> 
> What we could do instead of the patch I've committed just tweak expansion
> of the patterns instead, assuming it is unlikely RTL optimizations
> materialize a constant rotation out of non-constant rotations from expansion
> time.
> 
> Or the length_immediate attributes could be adjusted, shouldn't be too hard
> (just replace that (match_operand 1 "const1_operand") in there with
> (and (match_operand 1 "const_int_operand")
>      (ior (match_operand 1 "const1_operand")
> 	  (match_test "INTVAL (operands[1]) == GET_MODE_BITSIZE (<MODE>mode) - 1")))
> or similar (untested).

Yep, we should either update the pattern or add canonicalization.  I still think
the second variant is better...

Honza

Improve rotation by mode bitsize - 1

Commit Message

Comments

Patch