diff mbox

[RFA,PR,tree-optimization/64910] Fix reassociation of binary bitwise operations with 3 operands

Message ID 56788DA6.80904@redhat.com
State New
Headers show

Commit Message

Jeff Law Dec. 21, 2015, 11:39 p.m. UTC
As outlined in the BZ, given

(x & y) & C

reassociation will turn that into

(x & C) & y

Which inhibits bit operations on various targets.

Associating as

(x & y) & C

is what we want.

My original patch did this for all binary operations.  However, 
reviewing the assembly code & dump files before/after it was pretty 
clear that doing this for arithmetic was losing (mostly in that we were 
missing CSE opportunities).

Restricting to logicals means there's far fewer triggering cases, but in 
every one I looked at the resultant code was clearly better.

The testcase is a bit unusual in that it's in tree-ssa.  It checks the 
output of reassociation.  However, on x86 and m68k it also checks the 
resulting assembly code, which I believe is unique in the tree-ssa 
directory.

Bootstrapped and regression tested on x86_64-linux-gnu.  The test has 
been verified on x86_64-linux-gnu, i686-linux-gnu and m68k-linux-gnu.

OK for the trunk?

Comments

Richard Biener Jan. 8, 2016, 9:36 a.m. UTC | #1
On Tue, Dec 22, 2015 at 12:39 AM, Jeff Law <law@redhat.com> wrote:
>
> As outlined in the BZ, given
>
> (x & y) & C
>
> reassociation will turn that into
>
> (x & C) & y
>
> Which inhibits bit operations on various targets.
>
> Associating as
>
> (x & y) & C
>
> is what we want.
>
> My original patch did this for all binary operations.  However, reviewing
> the assembly code & dump files before/after it was pretty clear that doing
> this for arithmetic was losing (mostly in that we were missing CSE
> opportunities).

The missed CSE opportunities really are a CSE missed optimization in
general with respect to two reassociatable chains with some (but not all)
common operands.

> Restricting to logicals means there's far fewer triggering cases, but in
> every one I looked at the resultant code was clearly better.
>
> The testcase is a bit unusual in that it's in tree-ssa.  It checks the
> output of reassociation.  However, on x86 and m68k it also checks the
> resulting assembly code, which I believe is unique in the tree-ssa
> directory.
>
> Bootstrapped and regression tested on x86_64-linux-gnu.  The test has been
> verified on x86_64-linux-gnu, i686-linux-gnu and m68k-linux-gnu.
>
> OK for the trunk?

So you are really trying to make RTL expansion see a different pattern?

It seems to me that in this case using TER and get_def_for_expr /
get_gimple_for_ssa_name
should be used there.  [or for future direction instruction selection
should be performed on
GIMPLE with some match-and-simplify patterns creating IFNs matching
optabs if available]

Richard.

> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
> index c7626ff..f5ca85e 100644
> --- a/gcc/ChangeLog
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,9 @@
> +2015-12-20  Jeff Law  <law@redhat.com>
> +
> +       PR tree-optimization/64910
> +       * tree-ssa-reassoc.c (swap_ops_for_binary_stmt): Make sure constant
> +       is last for binary bit operations.
> +
>  2015-12-21  Pierre-Marie de Rodat  <derodat@adacore.com>
>
>         * dwarf2out.c (add_data_member_location_attribute): Do not
> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
> index bb2ed22..e2f55f3 100644
> --- a/gcc/testsuite/ChangeLog
> +++ b/gcc/testsuite/ChangeLog
> @@ -1,3 +1,8 @@
> +2015-12-20  Jeff Law  <law@redhat.com>
> +
> +       PR tree-optimization/64910
> +       * gcc.dg/tree-ssa/pr64910-2.c.c: New test.
> +
>  2015-12-21  Claudiu Zissulescu  <claziss@synopsys.com>
>
>         * gcc.target/arc/builtin_general.c: New test.
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c
> b/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c
> new file mode 100644
> index 0000000..2e3d679
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c
> @@ -0,0 +1,85 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-reassoc1" } */
> +
> +/* We want to make sure that we reassociate in a way that has the
> +   constant last.  With the constant last, it's more likely to result
> +   in a bitfield test on targets with such capabilities.  */
> +
> +extern void boo ();
> +
> +int b2b_uc (unsigned char u, unsigned char w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +
> +int b2b_us (unsigned short u, unsigned short w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +
> +int b2b_ui (unsigned int u, unsigned int w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +int b2b_ul (unsigned long u, unsigned long w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +int b2b_ull (unsigned long long u, unsigned long long w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +
> +int b2b_sc (signed char u, signed char w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +
> +int b2b_ss (signed short u, signed short w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +
> +int b2b_si (signed int u, signed int w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +int b2b_sl (signed long u, signed long w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +int b2b_sll (signed long long u, signed long long w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +
> +/* The AND of U & W should go into a temporary, when is then ANDed
> +   with the constant.
> +
> +   First verify that we have the right number of ANDs between U and W.  */
> +/* { dg-final { scan-tree-dump-times "\[uw\]_\[0-9\]+.D. \&
> \[uw\]_\[0-9\]+.D.;" 10 "reassoc1"} } */
> +
> +/* Then verify that we have the right number of ANDS between a temporary
> +   and the constant.  */
> +/* { dg-final { scan-tree-dump-times "_\[0-9]+ \& 32;" 10 "reassoc1"} } */
> +
> +/* Each function has one AND.  It will have either a second AND or TEST.
> So
> +   we can count the number of AND and TEST instructions.  They must be 2X
> +   the number of test functions in this file.  */
> +/* { dg-final { scan-assembler-times "and|test" 20 { target { i?86-*-*
> x86_64-*-*} } } } */
> +
> +/* Similarly on the m68k.  The code for the long long tests is suboptimal,
> +   which catch via the second pattern and xfail.  */
> +/* { dg-final { scan-assembler-times "and|btst" 20 { target { m68k-*-* } }
> } } */
> +/* { dg-final { scan-assembler-not "or" { target { m68k-*-* } xfail { *-*-*
> } } } } */
> +
> diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c
> index e54700e..1dcfce3 100644
> --- a/gcc/tree-ssa-reassoc.c
> +++ b/gcc/tree-ssa-reassoc.c
> @@ -3449,6 +3449,26 @@ swap_ops_for_binary_stmt (vec<operand_entry *> ops,
>        oe1->op = temp.op;
>        oe1->rank = temp.rank;
>      }
> +  /* For X OP Y OP C, associate as (X OP Y) OP C, but only for
> +     logicals, which encourages bit operations.  Experimentation
> +     has shown this generally isn't a win for arithmetic.  */
> +  else if (stmt)
> +    {
> +      enum tree_code opcode = gimple_assign_rhs_code (stmt);
> +      if ((opcode == BIT_AND_EXPR
> +          || opcode == BIT_IOR_EXPR
> +          || opcode == BIT_XOR_EXPR)
> +         && TREE_CODE (oe1->op) != INTEGER_CST
> +         && TREE_CODE (oe2->op) != INTEGER_CST
> +         && TREE_CODE (oe3->op) == INTEGER_CST)
> +       {
> +         operand_entry temp = *oe3;
> +         oe3->op = oe1->op;
> +         oe3->rank = oe1->rank;
> +         oe1->op = temp.op;
> +         oe1->rank= temp.rank;
> +       }
> +    }
>  }
>
>  /* If definition of RHS1 or RHS2 dominates STMT, return the later of those
>
Jeff Law Jan. 8, 2016, 11:20 p.m. UTC | #2
On 01/08/2016 02:36 AM, Richard Biener wrote:
>> My original patch did this for all binary operations.  However, reviewing
>> the assembly code & dump files before/after it was pretty clear that doing
>> this for arithmetic was losing (mostly in that we were missing CSE
>> opportunities).
>
> The missed CSE opportunities really are a CSE missed optimization in
> general with respect to two reassociatable chains with some (but not all)
> common operands.
Agreed.  It was a choice between expanding the scope and tracking down 
the missed CSEs or nailing down 64910 without regressing and putting 
that the missed CSEs on the TODO list.

How about I build some testcases for the missed CSEs, xfailed with a BZ 
too, so that we at least don't lose track of 'em.

> So you are really trying to make RTL expansion see a different pattern?
Yes, absolutely.  The code we generate now does

(x OP C) OP y

which makes it bloody hard for the RTL/target bits to recognize bit 
operations.  After the patch we generate

(x OP y) OP C

Which the RTL & backends can easily see as bitwise operations.


>
> It seems to me that in this case using TER and get_def_for_expr /
> get_gimple_for_ssa_name
> should be used there.  [or for future direction instruction selection
> should be performed on
> GIMPLE with some match-and-simplify patterns creating IFNs matching
> optabs if available]
That seems backwards to me -- we have all the infrastructure in place in 
reassoc.c we just make a terrible decision for the ordering of the 
operands.    In fact, the code is fine going into reassoc, it's reassoc 
that mucks things up:

Before reassoc we have:


;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
;;    prev block 0, next block 3, flags: (NEW, REACHABLE)
;;    pred:       ENTRY [100.0%]  (FALLTHRU,EXECUTABLE)
   _4 = u_2(D) & w_3(D);
   _7 = _4 & 32;
   if (_7 != 0)
     goto <bb 3>;
   else
     goto <bb 4>;


Which looks just like a bit test.  Sadly reassoc comes around and turns 
it into:

;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
;;    prev block 0, next block 3, flags: (NEW, REACHABLE)
;;    pred:       ENTRY [100.0%]  (FALLTHRU,EXECUTABLE)
   _8 = w_3(D) & 32;
   _7 = _8 & u_2(D);
   if (_7 != 0)
     goto <bb 3>;
   else
     goto <bb 4>;


And we've lost.

Plus doing it in TER is almost certainly more complex than getting it 
right in reassoc to begin with.

Jeff
Richard Biener Jan. 11, 2016, 10:32 a.m. UTC | #3
On Sat, Jan 9, 2016 at 12:20 AM, Jeff Law <law@redhat.com> wrote:
> On 01/08/2016 02:36 AM, Richard Biener wrote:
>>>
>>> My original patch did this for all binary operations.  However, reviewing
>>> the assembly code & dump files before/after it was pretty clear that
>>> doing
>>> this for arithmetic was losing (mostly in that we were missing CSE
>>> opportunities).
>>
>>
>> The missed CSE opportunities really are a CSE missed optimization in
>> general with respect to two reassociatable chains with some (but not all)
>> common operands.
>
> Agreed.  It was a choice between expanding the scope and tracking down the
> missed CSEs or nailing down 64910 without regressing and putting that the
> missed CSEs on the TODO list.
>
> How about I build some testcases for the missed CSEs, xfailed with a BZ too,
> so that we at least don't lose track of 'em.
>
>> So you are really trying to make RTL expansion see a different pattern?
>
> Yes, absolutely.  The code we generate now does
>
> (x OP C) OP y
>
> which makes it bloody hard for the RTL/target bits to recognize bit
> operations.  After the patch we generate
>
> (x OP y) OP C
>
> Which the RTL & backends can easily see as bitwise operations.
>
>
>>
>> It seems to me that in this case using TER and get_def_for_expr /
>> get_gimple_for_ssa_name
>> should be used there.  [or for future direction instruction selection
>> should be performed on
>> GIMPLE with some match-and-simplify patterns creating IFNs matching
>> optabs if available]
>
> That seems backwards to me -- we have all the infrastructure in place in
> reassoc.c we just make a terrible decision for the ordering of the operands.
> In fact, the code is fine going into reassoc, it's reassoc that mucks things
> up:
>
> Before reassoc we have:
>
>
> ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
> ;;    prev block 0, next block 3, flags: (NEW, REACHABLE)
> ;;    pred:       ENTRY [100.0%]  (FALLTHRU,EXECUTABLE)
>   _4 = u_2(D) & w_3(D);
>   _7 = _4 & 32;
>   if (_7 != 0)
>     goto <bb 3>;
>   else
>     goto <bb 4>;
>
>
> Which looks just like a bit test.  Sadly reassoc comes around and turns it
> into:
>
> ;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
> ;;    prev block 0, next block 3, flags: (NEW, REACHABLE)
> ;;    pred:       ENTRY [100.0%]  (FALLTHRU,EXECUTABLE)
>   _8 = w_3(D) & 32;
>   _7 = _8 & u_2(D);
>   if (_7 != 0)
>     goto <bb 3>;
>   else
>     goto <bb 4>;
>
>
> And we've lost.

Yeah, reassoc is largely about canonicalization.

> Plus doing it in TER is almost certainly more complex than getting it right
> in reassoc to begin with.

I guess canonicalizing differently is ok but you'll still create
((a & b) & 1) & c then if you only change the above place.

So I'm not sure what pattern the backend is looking for?

Richard.

> Jeff
>
Jeff Law Jan. 12, 2016, 5:10 a.m. UTC | #4
On 01/11/2016 03:32 AM, Richard Biener wrote:

>
> Yeah, reassoc is largely about canonicalization.
>
>> Plus doing it in TER is almost certainly more complex than getting it right
>> in reassoc to begin with.
>
> I guess canonicalizing differently is ok but you'll still create
> ((a & b) & 1) & c then if you only change the above place.
What's best for that expression would depend on factors like whether or 
not the target can exploit ILP.  ie (a & b) & (1 & c) exposes more 
parallelism while (((a & b) & c) & 1) is not good for parallelism, but 
does expose the bit test.

reassoc currently generates ((a & 1) & b) & c which is dreadful as 
there's no ILP or chance of creating a bit test.  My patch shuffles 
things around, but still doesn't expose the ILP or bit test in the 4 
operand case.  Based on the comments in reassoc, it didn't seem like the 
author thought anything beyond the 3-operand case was worth handling. 
So my patch just handles the 3-operand case.



>
> So I'm not sure what pattern the backend is looking for?
It just wants the constant last in the sequence.  That exposes bit 
clear, set, flip, test, etc idioms.



Jeff
Richard Biener Jan. 12, 2016, 3:11 p.m. UTC | #5
On Tue, Jan 12, 2016 at 6:10 AM, Jeff Law <law@redhat.com> wrote:
> On 01/11/2016 03:32 AM, Richard Biener wrote:
>
>>
>> Yeah, reassoc is largely about canonicalization.
>>
>>> Plus doing it in TER is almost certainly more complex than getting it
>>> right
>>> in reassoc to begin with.
>>
>>
>> I guess canonicalizing differently is ok but you'll still create
>> ((a & b) & 1) & c then if you only change the above place.
>
> What's best for that expression would depend on factors like whether or not
> the target can exploit ILP.  ie (a & b) & (1 & c) exposes more parallelism
> while (((a & b) & c) & 1) is not good for parallelism, but does expose the
> bit test.
>
> reassoc currently generates ((a & 1) & b) & c which is dreadful as there's
> no ILP or chance of creating a bit test.  My patch shuffles things around,
> but still doesn't expose the ILP or bit test in the 4 operand case.  Based
> on the comments in reassoc, it didn't seem like the author thought anything
> beyond the 3-operand case was worth handling. So my patch just handles the
> 3-operand case.
>
>
>
>>
>> So I'm not sure what pattern the backend is looking for?
>
> It just wants the constant last in the sequence.  That exposes bit clear,
> set, flip, test, etc idioms.

But those don't feed another bit operation, right?  Thus we'd like to see
((a & b) & c) & 1, not ((a & b) & 1) & c?  It sounds like the instructions
are designed to feed conditionals (aka CC consuming ops)?

Richard.

>
>
> Jeff
Jeff Law Jan. 13, 2016, 6:39 a.m. UTC | #6
On 01/12/2016 08:11 AM, Richard Biener wrote:
> On Tue, Jan 12, 2016 at 6:10 AM, Jeff Law <law@redhat.com> wrote:
>> On 01/11/2016 03:32 AM, Richard Biener wrote:
>>
>>>
>>> Yeah, reassoc is largely about canonicalization.
>>>
>>>> Plus doing it in TER is almost certainly more complex than getting it
>>>> right
>>>> in reassoc to begin with.
>>>
>>>
>>> I guess canonicalizing differently is ok but you'll still create
>>> ((a & b) & 1) & c then if you only change the above place.
>>
>> What's best for that expression would depend on factors like whether or not
>> the target can exploit ILP.  ie (a & b) & (1 & c) exposes more parallelism
>> while (((a & b) & c) & 1) is not good for parallelism, but does expose the
>> bit test.
>>
>> reassoc currently generates ((a & 1) & b) & c which is dreadful as there's
>> no ILP or chance of creating a bit test.  My patch shuffles things around,
>> but still doesn't expose the ILP or bit test in the 4 operand case.  Based
>> on the comments in reassoc, it didn't seem like the author thought anything
>> beyond the 3-operand case was worth handling. So my patch just handles the
>> 3-operand case.
>>
>>
>>
>>>
>>> So I'm not sure what pattern the backend is looking for?
>>
>> It just wants the constant last in the sequence.  That exposes bit clear,
>> set, flip, test, etc idioms.
>
> But those don't feed another bit operation, right?  Thus we'd like to see
> ((a & b) & c) & 1, not ((a & b) & 1) & c?  It sounds like the instructions
> are designed to feed conditionals (aka CC consuming ops)?
At the gimple level they could feed a conditional, or be part of a 
series of ops on an SSA_NAME that eventually gets stored to memory, etc. 
  At the RTL level they'll feed CC consumers and bit manipulations of 
pseudos or memory.

For the 3-op case, we always want the constant last.  For the 4-op case 
it's less clear.  Though ((a & b) & c) & 1 is certainly better than ((a 
& b) & 1) & c.

Jeff
Richard Biener Jan. 13, 2016, 12:30 p.m. UTC | #7
On Wed, Jan 13, 2016 at 7:39 AM, Jeff Law <law@redhat.com> wrote:
> On 01/12/2016 08:11 AM, Richard Biener wrote:
>>
>> On Tue, Jan 12, 2016 at 6:10 AM, Jeff Law <law@redhat.com> wrote:
>>>
>>> On 01/11/2016 03:32 AM, Richard Biener wrote:
>>>
>>>>
>>>> Yeah, reassoc is largely about canonicalization.
>>>>
>>>>> Plus doing it in TER is almost certainly more complex than getting it
>>>>> right
>>>>> in reassoc to begin with.
>>>>
>>>>
>>>>
>>>> I guess canonicalizing differently is ok but you'll still create
>>>> ((a & b) & 1) & c then if you only change the above place.
>>>
>>>
>>> What's best for that expression would depend on factors like whether or
>>> not
>>> the target can exploit ILP.  ie (a & b) & (1 & c) exposes more
>>> parallelism
>>> while (((a & b) & c) & 1) is not good for parallelism, but does expose
>>> the
>>> bit test.
>>>
>>> reassoc currently generates ((a & 1) & b) & c which is dreadful as
>>> there's
>>> no ILP or chance of creating a bit test.  My patch shuffles things
>>> around,
>>> but still doesn't expose the ILP or bit test in the 4 operand case.
>>> Based
>>> on the comments in reassoc, it didn't seem like the author thought
>>> anything
>>> beyond the 3-operand case was worth handling. So my patch just handles
>>> the
>>> 3-operand case.
>>>
>>>
>>>
>>>>
>>>> So I'm not sure what pattern the backend is looking for?
>>>
>>>
>>> It just wants the constant last in the sequence.  That exposes bit clear,
>>> set, flip, test, etc idioms.
>>
>>
>> But those don't feed another bit operation, right?  Thus we'd like to see
>> ((a & b) & c) & 1, not ((a & b) & 1) & c?  It sounds like the instructions
>> are designed to feed conditionals (aka CC consuming ops)?
>
> At the gimple level they could feed a conditional, or be part of a series of
> ops on an SSA_NAME that eventually gets stored to memory, etc.  At the RTL
> level they'll feed CC consumers and bit manipulations of pseudos or memory.
>
> For the 3-op case, we always want the constant last.  For the 4-op case it's
> less clear.  Though ((a & b) & c) & 1 is certainly better than ((a & b) & 1)
> & c.

Ok, so handling it in swap_ops_for_binary_stmt is merely a convenient place
to special-case bitwise ops.  The "real" fix to the sorting heuristic would be
to sort constants at the opposite end.

That might be too invasive right now but there is another "convenient" place:

              /* If the operand vector is now empty, all operands were
                 consumed by the __builtin_powi optimization.  */
...
              else
                {
                  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
                  int ops_num = ops.length ();
                  int width = get_reassociation_width (ops_num, rhs_code, mode);
                  tree new_lhs = lhs;

                  if (dump_file && (dump_flags & TDF_DETAILS))
                    fprintf (dump_file,
                             "Width = %d was chosen for
reassociation\n", width);

at this point you can check rhs_code and move the (single) constant
entry in ops (if there is any constant entry) from .last () to the beginning.

That'll get the 4 operand case correct as well and properly models
"constant last" for the specified operation kind.

Richard.


> Jeff
Jeff Law Sept. 3, 2017, 2:44 p.m. UTC | #8
On 01/13/2016 05:30 AM, Richard Biener wrote:
> On Wed, Jan 13, 2016 at 7:39 AM, Jeff Law <law@redhat.com> wrote:
>> On 01/12/2016 08:11 AM, Richard Biener wrote:
>>>
>>> On Tue, Jan 12, 2016 at 6:10 AM, Jeff Law <law@redhat.com> wrote:
>>>>
>>>> On 01/11/2016 03:32 AM, Richard Biener wrote:
>>>>
>>>>>
>>>>> Yeah, reassoc is largely about canonicalization.
>>>>>
>>>>>> Plus doing it in TER is almost certainly more complex than getting it
>>>>>> right
>>>>>> in reassoc to begin with.
>>>>>
>>>>>
>>>>>
>>>>> I guess canonicalizing differently is ok but you'll still create
>>>>> ((a & b) & 1) & c then if you only change the above place.
>>>>
>>>>
>>>> What's best for that expression would depend on factors like whether or
>>>> not
>>>> the target can exploit ILP.  ie (a & b) & (1 & c) exposes more
>>>> parallelism
>>>> while (((a & b) & c) & 1) is not good for parallelism, but does expose
>>>> the
>>>> bit test.
>>>>
>>>> reassoc currently generates ((a & 1) & b) & c which is dreadful as
>>>> there's
>>>> no ILP or chance of creating a bit test.  My patch shuffles things
>>>> around,
>>>> but still doesn't expose the ILP or bit test in the 4 operand case.
>>>> Based
>>>> on the comments in reassoc, it didn't seem like the author thought
>>>> anything
>>>> beyond the 3-operand case was worth handling. So my patch just handles
>>>> the
>>>> 3-operand case.
>>>>
>>>>
>>>>
>>>>>
>>>>> So I'm not sure what pattern the backend is looking for?
>>>>
>>>>
>>>> It just wants the constant last in the sequence.  That exposes bit clear,
>>>> set, flip, test, etc idioms.
>>>
>>>
>>> But those don't feed another bit operation, right?  Thus we'd like to see
>>> ((a & b) & c) & 1, not ((a & b) & 1) & c?  It sounds like the instructions
>>> are designed to feed conditionals (aka CC consuming ops)?
>>
>> At the gimple level they could feed a conditional, or be part of a series of
>> ops on an SSA_NAME that eventually gets stored to memory, etc.  At the RTL
>> level they'll feed CC consumers and bit manipulations of pseudos or memory.
>>
>> For the 3-op case, we always want the constant last.  For the 4-op case it's
>> less clear.  Though ((a & b) & c) & 1 is certainly better than ((a & b) & 1)
>> & c.
> 
> Ok, so handling it in swap_ops_for_binary_stmt is merely a convenient place
> to special-case bitwise ops.  The "real" fix to the sorting heuristic would be
> to sort constants at the opposite end.
> 
> That might be too invasive right now but there is another "convenient" place:
> 
>               /* If the operand vector is now empty, all operands were
>                  consumed by the __builtin_powi optimization.  */
> ...
>               else
>                 {
>                   machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
>                   int ops_num = ops.length ();
>                   int width = get_reassociation_width (ops_num, rhs_code, mode);
>                   tree new_lhs = lhs;
> 
>                   if (dump_file && (dump_flags & TDF_DETAILS))
>                     fprintf (dump_file,
>                              "Width = %d was chosen for
> reassociation\n", width);
> 
> at this point you can check rhs_code and move the (single) constant
> entry in ops (if there is any constant entry) from .last () to the beginning.
> 
> That'll get the 4 operand case correct as well and properly models
> "constant last" for the specified operation kind.
Resurrecting an old thread...  Just now getting around to flushing this
out of the queue.

To recap, given something like (x & y) & C reassociation will turn that
into (x & C) & y.  It's functionally equivalent, but it will inhibit
generation of bit test instructions.

I originally hacked up swap_ops_for_binary_stmt.  You requested that
change be made in reassociate_bb so that it would apply to cases where
there are more than 3 args.

So that's what this patch does.   OK for the trunk now?

Bootstrapped and regression tested on x86_64.  Also tested the new
testcase on m68k.
commit c10ae0339674c27c89a1fa1904217a55bf530cb3
Author: Jeff Law <law@torsion.usersys.redhat.com>
Date:   Sun Sep 3 10:42:30 2017 -0400

    2017-09-03  Jeff Law  <law@redhat.com>
    
            PR tree-optimization/64910
            * tree-ssa-reassoc.c (reassociate_bb): For bitwise binary ops,
            swap the first and last operand if the last is a constant.
    
            PR tree-optimization/64910
            * gcc.dg/tree-ssa/pr64910-2.c: New test.

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 3f632ca31c2..2c9a8c8265a 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,9 @@
+2017-09-03  Jeff Law  <law@redhat.com>
+
+	PR tree-optimization/64910
+	* tree-ssa-reassoc.c (reassociate_bb): For bitwise binary ops,
+	swap the first and last operand if the last is a constant.
+
 2017-09-01  Segher Boessenkool  <segher@kernel.crashing.org>
 
 	PR rtl-optimization/82024
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 4ead57edfa2..766677c899b 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2017-09-03  Jeff Law  <law@redhat.com>
+
+	PR tree-optimization/64910
+	* gcc.dg/tree-ssa/pr64910-2.c: New test.
+
 2017-09-01  Jakub Jelinek  <jakub@redhat.com>
 
 	PR target/81766
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c
new file mode 100644
index 00000000000..2e3d6790776
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c
@@ -0,0 +1,85 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-reassoc1" } */
+
+/* We want to make sure that we reassociate in a way that has the
+   constant last.  With the constant last, it's more likely to result
+   in a bitfield test on targets with such capabilities.  */
+
+extern void boo ();
+
+int b2b_uc (unsigned char u, unsigned char w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+
+int b2b_us (unsigned short u, unsigned short w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+
+int b2b_ui (unsigned int u, unsigned int w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+int b2b_ul (unsigned long u, unsigned long w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+int b2b_ull (unsigned long long u, unsigned long long w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+
+int b2b_sc (signed char u, signed char w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+
+int b2b_ss (signed short u, signed short w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+
+int b2b_si (signed int u, signed int w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+int b2b_sl (signed long u, signed long w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+int b2b_sll (signed long long u, signed long long w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+
+/* The AND of U & W should go into a temporary, when is then ANDed
+   with the constant.
+
+   First verify that we have the right number of ANDs between U and W.  */
+/* { dg-final { scan-tree-dump-times "\[uw\]_\[0-9\]+.D. \& \[uw\]_\[0-9\]+.D.;" 10 "reassoc1"} } */
+
+/* Then verify that we have the right number of ANDS between a temporary
+   and the constant.  */
+/* { dg-final { scan-tree-dump-times "_\[0-9]+ \& 32;" 10 "reassoc1"} } */
+
+/* Each function has one AND.  It will have either a second AND or TEST.  So
+   we can count the number of AND and TEST instructions.  They must be 2X
+   the number of test functions in this file.  */
+/* { dg-final { scan-assembler-times "and|test" 20 { target { i?86-*-* x86_64-*-*} } } } */
+
+/* Similarly on the m68k.  The code for the long long tests is suboptimal,
+   which catch via the second pattern and xfail.  */
+/* { dg-final { scan-assembler-times "and|btst" 20 { target { m68k-*-* } } } } */
+/* { dg-final { scan-assembler-not "or" { target { m68k-*-* } xfail { *-*-* } } } } */
+
diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c
index 561acea4dcc..76048196b27 100644
--- a/gcc/tree-ssa-reassoc.c
+++ b/gcc/tree-ssa-reassoc.c
@@ -5762,6 +5762,18 @@ reassociate_bb (basic_block bb)
 		    fprintf (dump_file,
 			     "Width = %d was chosen for reassociation\n", width);
 
+
+		  /* For binary bit operations, if the last operand in
+		     OPS is a constant, move it to the front.  This
+		     helps ensure that we generate (X & Y) & C rather
+		     than (X & C) & Y.  The former will often match
+		     a canonical bit test when we get to RTL.  */
+		  if ((rhs_code == BIT_AND_EXPR
+		       || rhs_code == BIT_IOR_EXPR
+		       || rhs_code == BIT_XOR_EXPR)
+		      && TREE_CODE (ops.last ()->op) == INTEGER_CST)
+		    std::swap (*ops[0], *ops[ops_num - 1]);
+
 		  if (width > 1
 		      && ops.length () > 3)
 		    rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
Richard Biener Sept. 4, 2017, 9:36 a.m. UTC | #9
On Sun, Sep 3, 2017 at 4:44 PM, Jeff Law <law@redhat.com> wrote:
> On 01/13/2016 05:30 AM, Richard Biener wrote:
>> On Wed, Jan 13, 2016 at 7:39 AM, Jeff Law <law@redhat.com> wrote:
>>> On 01/12/2016 08:11 AM, Richard Biener wrote:
>>>>
>>>> On Tue, Jan 12, 2016 at 6:10 AM, Jeff Law <law@redhat.com> wrote:
>>>>>
>>>>> On 01/11/2016 03:32 AM, Richard Biener wrote:
>>>>>
>>>>>>
>>>>>> Yeah, reassoc is largely about canonicalization.
>>>>>>
>>>>>>> Plus doing it in TER is almost certainly more complex than getting it
>>>>>>> right
>>>>>>> in reassoc to begin with.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I guess canonicalizing differently is ok but you'll still create
>>>>>> ((a & b) & 1) & c then if you only change the above place.
>>>>>
>>>>>
>>>>> What's best for that expression would depend on factors like whether or
>>>>> not
>>>>> the target can exploit ILP.  ie (a & b) & (1 & c) exposes more
>>>>> parallelism
>>>>> while (((a & b) & c) & 1) is not good for parallelism, but does expose
>>>>> the
>>>>> bit test.
>>>>>
>>>>> reassoc currently generates ((a & 1) & b) & c which is dreadful as
>>>>> there's
>>>>> no ILP or chance of creating a bit test.  My patch shuffles things
>>>>> around,
>>>>> but still doesn't expose the ILP or bit test in the 4 operand case.
>>>>> Based
>>>>> on the comments in reassoc, it didn't seem like the author thought
>>>>> anything
>>>>> beyond the 3-operand case was worth handling. So my patch just handles
>>>>> the
>>>>> 3-operand case.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> So I'm not sure what pattern the backend is looking for?
>>>>>
>>>>>
>>>>> It just wants the constant last in the sequence.  That exposes bit clear,
>>>>> set, flip, test, etc idioms.
>>>>
>>>>
>>>> But those don't feed another bit operation, right?  Thus we'd like to see
>>>> ((a & b) & c) & 1, not ((a & b) & 1) & c?  It sounds like the instructions
>>>> are designed to feed conditionals (aka CC consuming ops)?
>>>
>>> At the gimple level they could feed a conditional, or be part of a series of
>>> ops on an SSA_NAME that eventually gets stored to memory, etc.  At the RTL
>>> level they'll feed CC consumers and bit manipulations of pseudos or memory.
>>>
>>> For the 3-op case, we always want the constant last.  For the 4-op case it's
>>> less clear.  Though ((a & b) & c) & 1 is certainly better than ((a & b) & 1)
>>> & c.
>>
>> Ok, so handling it in swap_ops_for_binary_stmt is merely a convenient place
>> to special-case bitwise ops.  The "real" fix to the sorting heuristic would be
>> to sort constants at the opposite end.
>>
>> That might be too invasive right now but there is another "convenient" place:
>>
>>               /* If the operand vector is now empty, all operands were
>>                  consumed by the __builtin_powi optimization.  */
>> ...
>>               else
>>                 {
>>                   machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
>>                   int ops_num = ops.length ();
>>                   int width = get_reassociation_width (ops_num, rhs_code, mode);
>>                   tree new_lhs = lhs;
>>
>>                   if (dump_file && (dump_flags & TDF_DETAILS))
>>                     fprintf (dump_file,
>>                              "Width = %d was chosen for
>> reassociation\n", width);
>>
>> at this point you can check rhs_code and move the (single) constant
>> entry in ops (if there is any constant entry) from .last () to the beginning.
>>
>> That'll get the 4 operand case correct as well and properly models
>> "constant last" for the specified operation kind.
> Resurrecting an old thread...  Just now getting around to flushing this
> out of the queue.
>
> To recap, given something like (x & y) & C reassociation will turn that
> into (x & C) & y.  It's functionally equivalent, but it will inhibit
> generation of bit test instructions.
>
> I originally hacked up swap_ops_for_binary_stmt.  You requested that
> change be made in reassociate_bb so that it would apply to cases where
> there are more than 3 args.
>
> So that's what this patch does.   OK for the trunk now?

Ok.

Thanks,
Richard.

> Bootstrapped and regression tested on x86_64.  Also tested the new
> testcase on m68k.
>
>
> commit c10ae0339674c27c89a1fa1904217a55bf530cb3
> Author: Jeff Law <law@torsion.usersys.redhat.com>
> Date:   Sun Sep 3 10:42:30 2017 -0400
>
>     2017-09-03  Jeff Law  <law@redhat.com>
>
>             PR tree-optimization/64910
>             * tree-ssa-reassoc.c (reassociate_bb): For bitwise binary ops,
>             swap the first and last operand if the last is a constant.
>
>             PR tree-optimization/64910
>             * gcc.dg/tree-ssa/pr64910-2.c: New test.
>
> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
> index 3f632ca31c2..2c9a8c8265a 100644
> --- a/gcc/ChangeLog
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,9 @@
> +2017-09-03  Jeff Law  <law@redhat.com>
> +
> +       PR tree-optimization/64910
> +       * tree-ssa-reassoc.c (reassociate_bb): For bitwise binary ops,
> +       swap the first and last operand if the last is a constant.
> +
>  2017-09-01  Segher Boessenkool  <segher@kernel.crashing.org>
>
>         PR rtl-optimization/82024
> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
> index 4ead57edfa2..766677c899b 100644
> --- a/gcc/testsuite/ChangeLog
> +++ b/gcc/testsuite/ChangeLog
> @@ -1,3 +1,8 @@
> +2017-09-03  Jeff Law  <law@redhat.com>
> +
> +       PR tree-optimization/64910
> +       * gcc.dg/tree-ssa/pr64910-2.c: New test.
> +
>  2017-09-01  Jakub Jelinek  <jakub@redhat.com>
>
>         PR target/81766
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c
> new file mode 100644
> index 00000000000..2e3d6790776
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c
> @@ -0,0 +1,85 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-reassoc1" } */
> +
> +/* We want to make sure that we reassociate in a way that has the
> +   constant last.  With the constant last, it's more likely to result
> +   in a bitfield test on targets with such capabilities.  */
> +
> +extern void boo ();
> +
> +int b2b_uc (unsigned char u, unsigned char w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +
> +int b2b_us (unsigned short u, unsigned short w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +
> +int b2b_ui (unsigned int u, unsigned int w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +int b2b_ul (unsigned long u, unsigned long w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +int b2b_ull (unsigned long long u, unsigned long long w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +
> +int b2b_sc (signed char u, signed char w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +
> +int b2b_ss (signed short u, signed short w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +
> +int b2b_si (signed int u, signed int w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +int b2b_sl (signed long u, signed long w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +int b2b_sll (signed long long u, signed long long w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +
> +/* The AND of U & W should go into a temporary, when is then ANDed
> +   with the constant.
> +
> +   First verify that we have the right number of ANDs between U and W.  */
> +/* { dg-final { scan-tree-dump-times "\[uw\]_\[0-9\]+.D. \& \[uw\]_\[0-9\]+.D.;" 10 "reassoc1"} } */
> +
> +/* Then verify that we have the right number of ANDS between a temporary
> +   and the constant.  */
> +/* { dg-final { scan-tree-dump-times "_\[0-9]+ \& 32;" 10 "reassoc1"} } */
> +
> +/* Each function has one AND.  It will have either a second AND or TEST.  So
> +   we can count the number of AND and TEST instructions.  They must be 2X
> +   the number of test functions in this file.  */
> +/* { dg-final { scan-assembler-times "and|test" 20 { target { i?86-*-* x86_64-*-*} } } } */
> +
> +/* Similarly on the m68k.  The code for the long long tests is suboptimal,
> +   which catch via the second pattern and xfail.  */
> +/* { dg-final { scan-assembler-times "and|btst" 20 { target { m68k-*-* } } } } */
> +/* { dg-final { scan-assembler-not "or" { target { m68k-*-* } xfail { *-*-* } } } } */
> +
> diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c
> index 561acea4dcc..76048196b27 100644
> --- a/gcc/tree-ssa-reassoc.c
> +++ b/gcc/tree-ssa-reassoc.c
> @@ -5762,6 +5762,18 @@ reassociate_bb (basic_block bb)
>                     fprintf (dump_file,
>                              "Width = %d was chosen for reassociation\n", width);
>
> +
> +                 /* For binary bit operations, if the last operand in
> +                    OPS is a constant, move it to the front.  This
> +                    helps ensure that we generate (X & Y) & C rather
> +                    than (X & C) & Y.  The former will often match
> +                    a canonical bit test when we get to RTL.  */
> +                 if ((rhs_code == BIT_AND_EXPR
> +                      || rhs_code == BIT_IOR_EXPR
> +                      || rhs_code == BIT_XOR_EXPR)
> +                     && TREE_CODE (ops.last ()->op) == INTEGER_CST)
> +                   std::swap (*ops[0], *ops[ops_num - 1]);
> +
>                   if (width > 1
>                       && ops.length () > 3)
>                     rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
>
Christophe Lyon Sept. 5, 2017, 6:38 a.m. UTC | #10
Hi Jeff,


On 3 September 2017 at 16:44, Jeff Law <law@redhat.com> wrote:
> On 01/13/2016 05:30 AM, Richard Biener wrote:
>> On Wed, Jan 13, 2016 at 7:39 AM, Jeff Law <law@redhat.com> wrote:
>>> On 01/12/2016 08:11 AM, Richard Biener wrote:
>>>>
>>>> On Tue, Jan 12, 2016 at 6:10 AM, Jeff Law <law@redhat.com> wrote:
>>>>>
>>>>> On 01/11/2016 03:32 AM, Richard Biener wrote:
>>>>>
>>>>>>
>>>>>> Yeah, reassoc is largely about canonicalization.
>>>>>>
>>>>>>> Plus doing it in TER is almost certainly more complex than getting it
>>>>>>> right
>>>>>>> in reassoc to begin with.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I guess canonicalizing differently is ok but you'll still create
>>>>>> ((a & b) & 1) & c then if you only change the above place.
>>>>>
>>>>>
>>>>> What's best for that expression would depend on factors like whether or
>>>>> not
>>>>> the target can exploit ILP.  ie (a & b) & (1 & c) exposes more
>>>>> parallelism
>>>>> while (((a & b) & c) & 1) is not good for parallelism, but does expose
>>>>> the
>>>>> bit test.
>>>>>
>>>>> reassoc currently generates ((a & 1) & b) & c which is dreadful as
>>>>> there's
>>>>> no ILP or chance of creating a bit test.  My patch shuffles things
>>>>> around,
>>>>> but still doesn't expose the ILP or bit test in the 4 operand case.
>>>>> Based
>>>>> on the comments in reassoc, it didn't seem like the author thought
>>>>> anything
>>>>> beyond the 3-operand case was worth handling. So my patch just handles
>>>>> the
>>>>> 3-operand case.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> So I'm not sure what pattern the backend is looking for?
>>>>>
>>>>>
>>>>> It just wants the constant last in the sequence.  That exposes bit clear,
>>>>> set, flip, test, etc idioms.
>>>>
>>>>
>>>> But those don't feed another bit operation, right?  Thus we'd like to see
>>>> ((a & b) & c) & 1, not ((a & b) & 1) & c?  It sounds like the instructions
>>>> are designed to feed conditionals (aka CC consuming ops)?
>>>
>>> At the gimple level they could feed a conditional, or be part of a series of
>>> ops on an SSA_NAME that eventually gets stored to memory, etc.  At the RTL
>>> level they'll feed CC consumers and bit manipulations of pseudos or memory.
>>>
>>> For the 3-op case, we always want the constant last.  For the 4-op case it's
>>> less clear.  Though ((a & b) & c) & 1 is certainly better than ((a & b) & 1)
>>> & c.
>>
>> Ok, so handling it in swap_ops_for_binary_stmt is merely a convenient place
>> to special-case bitwise ops.  The "real" fix to the sorting heuristic would be
>> to sort constants at the opposite end.
>>
>> That might be too invasive right now but there is another "convenient" place:
>>
>>               /* If the operand vector is now empty, all operands were
>>                  consumed by the __builtin_powi optimization.  */
>> ...
>>               else
>>                 {
>>                   machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
>>                   int ops_num = ops.length ();
>>                   int width = get_reassociation_width (ops_num, rhs_code, mode);
>>                   tree new_lhs = lhs;
>>
>>                   if (dump_file && (dump_flags & TDF_DETAILS))
>>                     fprintf (dump_file,
>>                              "Width = %d was chosen for
>> reassociation\n", width);
>>
>> at this point you can check rhs_code and move the (single) constant
>> entry in ops (if there is any constant entry) from .last () to the beginning.
>>
>> That'll get the 4 operand case correct as well and properly models
>> "constant last" for the specified operation kind.
> Resurrecting an old thread...  Just now getting around to flushing this
> out of the queue.
>
> To recap, given something like (x & y) & C reassociation will turn that
> into (x & C) & y.  It's functionally equivalent, but it will inhibit
> generation of bit test instructions.
>
> I originally hacked up swap_ops_for_binary_stmt.  You requested that
> change be made in reassociate_bb so that it would apply to cases where
> there are more than 3 args.
>
> So that's what this patch does.   OK for the trunk now?
>
> Bootstrapped and regression tested on x86_64.  Also tested the new
> testcase on m68k.
>
>
> commit c10ae0339674c27c89a1fa1904217a55bf530cb3
> Author: Jeff Law <law@torsion.usersys.redhat.com>
> Date:   Sun Sep 3 10:42:30 2017 -0400
>
>     2017-09-03  Jeff Law  <law@redhat.com>
>
>             PR tree-optimization/64910
>             * tree-ssa-reassoc.c (reassociate_bb): For bitwise binary ops,
>             swap the first and last operand if the last is a constant.
>
>             PR tree-optimization/64910
>             * gcc.dg/tree-ssa/pr64910-2.c: New test.
>
> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
> index 3f632ca31c2..2c9a8c8265a 100644
> --- a/gcc/ChangeLog
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,9 @@
> +2017-09-03  Jeff Law  <law@redhat.com>
> +
> +       PR tree-optimization/64910
> +       * tree-ssa-reassoc.c (reassociate_bb): For bitwise binary ops,
> +       swap the first and last operand if the last is a constant.
> +
>  2017-09-01  Segher Boessenkool  <segher@kernel.crashing.org>
>
>         PR rtl-optimization/82024
> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
> index 4ead57edfa2..766677c899b 100644
> --- a/gcc/testsuite/ChangeLog
> +++ b/gcc/testsuite/ChangeLog
> @@ -1,3 +1,8 @@
> +2017-09-03  Jeff Law  <law@redhat.com>
> +
> +       PR tree-optimization/64910
> +       * gcc.dg/tree-ssa/pr64910-2.c: New test.
> +
>  2017-09-01  Jakub Jelinek  <jakub@redhat.com>
>
>         PR target/81766
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c
> new file mode 100644
> index 00000000000..2e3d6790776
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c
> @@ -0,0 +1,85 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-reassoc1" } */
> +
> +/* We want to make sure that we reassociate in a way that has the
> +   constant last.  With the constant last, it's more likely to result
> +   in a bitfield test on targets with such capabilities.  */
> +
> +extern void boo ();
> +
> +int b2b_uc (unsigned char u, unsigned char w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +
> +int b2b_us (unsigned short u, unsigned short w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +
> +int b2b_ui (unsigned int u, unsigned int w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +int b2b_ul (unsigned long u, unsigned long w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +int b2b_ull (unsigned long long u, unsigned long long w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +
> +int b2b_sc (signed char u, signed char w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +
> +int b2b_ss (signed short u, signed short w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +
> +int b2b_si (signed int u, signed int w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +int b2b_sl (signed long u, signed long w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +int b2b_sll (signed long long u, signed long long w)
> +{
> +  if ((u & w) & 0x20)
> +    boo ();
> +}
> +
> +/* The AND of U & W should go into a temporary, when is then ANDed
> +   with the constant.
> +
> +   First verify that we have the right number of ANDs between U and W.  */
> +/* { dg-final { scan-tree-dump-times "\[uw\]_\[0-9\]+.D. \& \[uw\]_\[0-9\]+.D.;" 10 "reassoc1"} } */
> +
> +/* Then verify that we have the right number of ANDS between a temporary
> +   and the constant.  */
> +/* { dg-final { scan-tree-dump-times "_\[0-9]+ \& 32;" 10 "reassoc1"} } */

This part of the new testcase fails on aarch64 & arm:
FAIL: gcc.dg/tree-ssa/pr64910-2.c scan-tree-dump-times reassoc1
"_[0-9]+ & 32;" 10

Christophe

> +
> +/* Each function has one AND.  It will have either a second AND or TEST.  So
> +   we can count the number of AND and TEST instructions.  They must be 2X
> +   the number of test functions in this file.  */
> +/* { dg-final { scan-assembler-times "and|test" 20 { target { i?86-*-* x86_64-*-*} } } } */
> +
> +/* Similarly on the m68k.  The code for the long long tests is suboptimal,
> +   which catch via the second pattern and xfail.  */
> +/* { dg-final { scan-assembler-times "and|btst" 20 { target { m68k-*-* } } } } */
> +/* { dg-final { scan-assembler-not "or" { target { m68k-*-* } xfail { *-*-* } } } } */
> +
> diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c
> index 561acea4dcc..76048196b27 100644
> --- a/gcc/tree-ssa-reassoc.c
> +++ b/gcc/tree-ssa-reassoc.c
> @@ -5762,6 +5762,18 @@ reassociate_bb (basic_block bb)
>                     fprintf (dump_file,
>                              "Width = %d was chosen for reassociation\n", width);
>
> +
> +                 /* For binary bit operations, if the last operand in
> +                    OPS is a constant, move it to the front.  This
> +                    helps ensure that we generate (X & Y) & C rather
> +                    than (X & C) & Y.  The former will often match
> +                    a canonical bit test when we get to RTL.  */
> +                 if ((rhs_code == BIT_AND_EXPR
> +                      || rhs_code == BIT_IOR_EXPR
> +                      || rhs_code == BIT_XOR_EXPR)
> +                     && TREE_CODE (ops.last ()->op) == INTEGER_CST)
> +                   std::swap (*ops[0], *ops[ops_num - 1]);
> +
>                   if (width > 1
>                       && ops.length () > 3)
>                     rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
>
Jeff Law Sept. 5, 2017, 5:26 p.m. UTC | #11
On 09/05/2017 12:38 AM, Christophe Lyon wrote:
> Hi Jeff,
> 
> 
> On 3 September 2017 at 16:44, Jeff Law <law@redhat.com> wrote:
>> On 01/13/2016 05:30 AM, Richard Biener wrote:
>>> On Wed, Jan 13, 2016 at 7:39 AM, Jeff Law <law@redhat.com> wrote:
>>>> On 01/12/2016 08:11 AM, Richard Biener wrote:
>>>>>
>>>>> On Tue, Jan 12, 2016 at 6:10 AM, Jeff Law <law@redhat.com> wrote:
>>>>>>
>>>>>> On 01/11/2016 03:32 AM, Richard Biener wrote:
>>>>>>
>>>>>>>
>>>>>>> Yeah, reassoc is largely about canonicalization.
>>>>>>>
>>>>>>>> Plus doing it in TER is almost certainly more complex than getting it
>>>>>>>> right
>>>>>>>> in reassoc to begin with.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I guess canonicalizing differently is ok but you'll still create
>>>>>>> ((a & b) & 1) & c then if you only change the above place.
>>>>>>
>>>>>>
>>>>>> What's best for that expression would depend on factors like whether or
>>>>>> not
>>>>>> the target can exploit ILP.  ie (a & b) & (1 & c) exposes more
>>>>>> parallelism
>>>>>> while (((a & b) & c) & 1) is not good for parallelism, but does expose
>>>>>> the
>>>>>> bit test.
>>>>>>
>>>>>> reassoc currently generates ((a & 1) & b) & c which is dreadful as
>>>>>> there's
>>>>>> no ILP or chance of creating a bit test.  My patch shuffles things
>>>>>> around,
>>>>>> but still doesn't expose the ILP or bit test in the 4 operand case.
>>>>>> Based
>>>>>> on the comments in reassoc, it didn't seem like the author thought
>>>>>> anything
>>>>>> beyond the 3-operand case was worth handling. So my patch just handles
>>>>>> the
>>>>>> 3-operand case.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> So I'm not sure what pattern the backend is looking for?
>>>>>>
>>>>>>
>>>>>> It just wants the constant last in the sequence.  That exposes bit clear,
>>>>>> set, flip, test, etc idioms.
>>>>>
>>>>>
>>>>> But those don't feed another bit operation, right?  Thus we'd like to see
>>>>> ((a & b) & c) & 1, not ((a & b) & 1) & c?  It sounds like the instructions
>>>>> are designed to feed conditionals (aka CC consuming ops)?
>>>>
>>>> At the gimple level they could feed a conditional, or be part of a series of
>>>> ops on an SSA_NAME that eventually gets stored to memory, etc.  At the RTL
>>>> level they'll feed CC consumers and bit manipulations of pseudos or memory.
>>>>
>>>> For the 3-op case, we always want the constant last.  For the 4-op case it's
>>>> less clear.  Though ((a & b) & c) & 1 is certainly better than ((a & b) & 1)
>>>> & c.
>>>
>>> Ok, so handling it in swap_ops_for_binary_stmt is merely a convenient place
>>> to special-case bitwise ops.  The "real" fix to the sorting heuristic would be
>>> to sort constants at the opposite end.
>>>
>>> That might be too invasive right now but there is another "convenient" place:
>>>
>>>               /* If the operand vector is now empty, all operands were
>>>                  consumed by the __builtin_powi optimization.  */
>>> ...
>>>               else
>>>                 {
>>>                   machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
>>>                   int ops_num = ops.length ();
>>>                   int width = get_reassociation_width (ops_num, rhs_code, mode);
>>>                   tree new_lhs = lhs;
>>>
>>>                   if (dump_file && (dump_flags & TDF_DETAILS))
>>>                     fprintf (dump_file,
>>>                              "Width = %d was chosen for
>>> reassociation\n", width);
>>>
>>> at this point you can check rhs_code and move the (single) constant
>>> entry in ops (if there is any constant entry) from .last () to the beginning.
>>>
>>> That'll get the 4 operand case correct as well and properly models
>>> "constant last" for the specified operation kind.
>> Resurrecting an old thread...  Just now getting around to flushing this
>> out of the queue.
>>
>> To recap, given something like (x & y) & C reassociation will turn that
>> into (x & C) & y.  It's functionally equivalent, but it will inhibit
>> generation of bit test instructions.
>>
>> I originally hacked up swap_ops_for_binary_stmt.  You requested that
>> change be made in reassociate_bb so that it would apply to cases where
>> there are more than 3 args.
>>
>> So that's what this patch does.   OK for the trunk now?
>>
>> Bootstrapped and regression tested on x86_64.  Also tested the new
>> testcase on m68k.
>>
>>
>> commit c10ae0339674c27c89a1fa1904217a55bf530cb3
>> Author: Jeff Law <law@torsion.usersys.redhat.com>
>> Date:   Sun Sep 3 10:42:30 2017 -0400
>>
>>     2017-09-03  Jeff Law  <law@redhat.com>
>>
>>             PR tree-optimization/64910
>>             * tree-ssa-reassoc.c (reassociate_bb): For bitwise binary ops,
>>             swap the first and last operand if the last is a constant.
>>
>>             PR tree-optimization/64910
>>             * gcc.dg/tree-ssa/pr64910-2.c: New test.
>>
>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>> index 3f632ca31c2..2c9a8c8265a 100644
>> --- a/gcc/ChangeLog
>> +++ b/gcc/ChangeLog
>> @@ -1,3 +1,9 @@
>> +2017-09-03  Jeff Law  <law@redhat.com>
>> +
>> +       PR tree-optimization/64910
>> +       * tree-ssa-reassoc.c (reassociate_bb): For bitwise binary ops,
>> +       swap the first and last operand if the last is a constant.
>> +
>>  2017-09-01  Segher Boessenkool  <segher@kernel.crashing.org>
>>
>>         PR rtl-optimization/82024
>> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
>> index 4ead57edfa2..766677c899b 100644
>> --- a/gcc/testsuite/ChangeLog
>> +++ b/gcc/testsuite/ChangeLog
>> @@ -1,3 +1,8 @@
>> +2017-09-03  Jeff Law  <law@redhat.com>
>> +
>> +       PR tree-optimization/64910
>> +       * gcc.dg/tree-ssa/pr64910-2.c: New test.
>> +
>>  2017-09-01  Jakub Jelinek  <jakub@redhat.com>
>>
>>         PR target/81766
>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c
>> new file mode 100644
>> index 00000000000..2e3d6790776
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c
>> @@ -0,0 +1,85 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -fdump-tree-reassoc1" } */
>> +
>> +/* We want to make sure that we reassociate in a way that has the
>> +   constant last.  With the constant last, it's more likely to result
>> +   in a bitfield test on targets with such capabilities.  */
>> +
>> +extern void boo ();
>> +
>> +int b2b_uc (unsigned char u, unsigned char w)
>> +{
>> +  if ((u & w) & 0x20)
>> +    boo ();
>> +}
>> +
>> +int b2b_us (unsigned short u, unsigned short w)
>> +{
>> +  if ((u & w) & 0x20)
>> +    boo ();
>> +}
>> +
>> +int b2b_ui (unsigned int u, unsigned int w)
>> +{
>> +  if ((u & w) & 0x20)
>> +    boo ();
>> +}
>> +int b2b_ul (unsigned long u, unsigned long w)
>> +{
>> +  if ((u & w) & 0x20)
>> +    boo ();
>> +}
>> +int b2b_ull (unsigned long long u, unsigned long long w)
>> +{
>> +  if ((u & w) & 0x20)
>> +    boo ();
>> +}
>> +
>> +int b2b_sc (signed char u, signed char w)
>> +{
>> +  if ((u & w) & 0x20)
>> +    boo ();
>> +}
>> +
>> +int b2b_ss (signed short u, signed short w)
>> +{
>> +  if ((u & w) & 0x20)
>> +    boo ();
>> +}
>> +
>> +int b2b_si (signed int u, signed int w)
>> +{
>> +  if ((u & w) & 0x20)
>> +    boo ();
>> +}
>> +int b2b_sl (signed long u, signed long w)
>> +{
>> +  if ((u & w) & 0x20)
>> +    boo ();
>> +}
>> +int b2b_sll (signed long long u, signed long long w)
>> +{
>> +  if ((u & w) & 0x20)
>> +    boo ();
>> +}
>> +
>> +/* The AND of U & W should go into a temporary, when is then ANDed
>> +   with the constant.
>> +
>> +   First verify that we have the right number of ANDs between U and W.  */
>> +/* { dg-final { scan-tree-dump-times "\[uw\]_\[0-9\]+.D. \& \[uw\]_\[0-9\]+.D.;" 10 "reassoc1"} } */
>> +
>> +/* Then verify that we have the right number of ANDS between a temporary
>> +   and the constant.  */
>> +/* { dg-final { scan-tree-dump-times "_\[0-9]+ \& 32;" 10 "reassoc1"} } */
> 
> This part of the new testcase fails on aarch64 & arm:
> FAIL: gcc.dg/tree-ssa/pr64910-2.c scan-tree-dump-times reassoc1
> "_[0-9]+ & 32;" 10
Thanks.  I'm investigating.

Jeff
Jeff Law Sept. 6, 2017, 5:21 a.m. UTC | #12
On 09/05/2017 11:26 AM, Jeff Law wrote:
> On 09/05/2017 12:38 AM, Christophe Lyon wrote:
>> Hi Jeff,
>>
>>
>> On 3 September 2017 at 16:44, Jeff Law <law@redhat.com> wrote:
>>> On 01/13/2016 05:30 AM, Richard Biener wrote:
>>>> On Wed, Jan 13, 2016 at 7:39 AM, Jeff Law <law@redhat.com> wrote:
>>>>> On 01/12/2016 08:11 AM, Richard Biener wrote:
>>>>>>
>>>>>> On Tue, Jan 12, 2016 at 6:10 AM, Jeff Law <law@redhat.com> wrote:
>>>>>>>
>>>>>>> On 01/11/2016 03:32 AM, Richard Biener wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Yeah, reassoc is largely about canonicalization.
>>>>>>>>
>>>>>>>>> Plus doing it in TER is almost certainly more complex than getting it
>>>>>>>>> right
>>>>>>>>> in reassoc to begin with.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I guess canonicalizing differently is ok but you'll still create
>>>>>>>> ((a & b) & 1) & c then if you only change the above place.
>>>>>>>
>>>>>>>
>>>>>>> What's best for that expression would depend on factors like whether or
>>>>>>> not
>>>>>>> the target can exploit ILP.  ie (a & b) & (1 & c) exposes more
>>>>>>> parallelism
>>>>>>> while (((a & b) & c) & 1) is not good for parallelism, but does expose
>>>>>>> the
>>>>>>> bit test.
>>>>>>>
>>>>>>> reassoc currently generates ((a & 1) & b) & c which is dreadful as
>>>>>>> there's
>>>>>>> no ILP or chance of creating a bit test.  My patch shuffles things
>>>>>>> around,
>>>>>>> but still doesn't expose the ILP or bit test in the 4 operand case.
>>>>>>> Based
>>>>>>> on the comments in reassoc, it didn't seem like the author thought
>>>>>>> anything
>>>>>>> beyond the 3-operand case was worth handling. So my patch just handles
>>>>>>> the
>>>>>>> 3-operand case.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> So I'm not sure what pattern the backend is looking for?
>>>>>>>
>>>>>>>
>>>>>>> It just wants the constant last in the sequence.  That exposes bit clear,
>>>>>>> set, flip, test, etc idioms.
>>>>>>
>>>>>>
>>>>>> But those don't feed another bit operation, right?  Thus we'd like to see
>>>>>> ((a & b) & c) & 1, not ((a & b) & 1) & c?  It sounds like the instructions
>>>>>> are designed to feed conditionals (aka CC consuming ops)?
>>>>>
>>>>> At the gimple level they could feed a conditional, or be part of a series of
>>>>> ops on an SSA_NAME that eventually gets stored to memory, etc.  At the RTL
>>>>> level they'll feed CC consumers and bit manipulations of pseudos or memory.
>>>>>
>>>>> For the 3-op case, we always want the constant last.  For the 4-op case it's
>>>>> less clear.  Though ((a & b) & c) & 1 is certainly better than ((a & b) & 1)
>>>>> & c.
>>>>
>>>> Ok, so handling it in swap_ops_for_binary_stmt is merely a convenient place
>>>> to special-case bitwise ops.  The "real" fix to the sorting heuristic would be
>>>> to sort constants at the opposite end.
>>>>
>>>> That might be too invasive right now but there is another "convenient" place:
>>>>
>>>>               /* If the operand vector is now empty, all operands were
>>>>                  consumed by the __builtin_powi optimization.  */
>>>> ...
>>>>               else
>>>>                 {
>>>>                   machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
>>>>                   int ops_num = ops.length ();
>>>>                   int width = get_reassociation_width (ops_num, rhs_code, mode);
>>>>                   tree new_lhs = lhs;
>>>>
>>>>                   if (dump_file && (dump_flags & TDF_DETAILS))
>>>>                     fprintf (dump_file,
>>>>                              "Width = %d was chosen for
>>>> reassociation\n", width);
>>>>
>>>> at this point you can check rhs_code and move the (single) constant
>>>> entry in ops (if there is any constant entry) from .last () to the beginning.
>>>>
>>>> That'll get the 4 operand case correct as well and properly models
>>>> "constant last" for the specified operation kind.
>>> Resurrecting an old thread...  Just now getting around to flushing this
>>> out of the queue.
>>>
>>> To recap, given something like (x & y) & C reassociation will turn that
>>> into (x & C) & y.  It's functionally equivalent, but it will inhibit
>>> generation of bit test instructions.
>>>
>>> I originally hacked up swap_ops_for_binary_stmt.  You requested that
>>> change be made in reassociate_bb so that it would apply to cases where
>>> there are more than 3 args.
>>>
>>> So that's what this patch does.   OK for the trunk now?
>>>
>>> Bootstrapped and regression tested on x86_64.  Also tested the new
>>> testcase on m68k.
>>>
>>>
>>> commit c10ae0339674c27c89a1fa1904217a55bf530cb3
>>> Author: Jeff Law <law@torsion.usersys.redhat.com>
>>> Date:   Sun Sep 3 10:42:30 2017 -0400
>>>
>>>     2017-09-03  Jeff Law  <law@redhat.com>
>>>
>>>             PR tree-optimization/64910
>>>             * tree-ssa-reassoc.c (reassociate_bb): For bitwise binary ops,
>>>             swap the first and last operand if the last is a constant.
>>>
>>>             PR tree-optimization/64910
>>>             * gcc.dg/tree-ssa/pr64910-2.c: New test.
>>>
>>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>>> index 3f632ca31c2..2c9a8c8265a 100644
>>> --- a/gcc/ChangeLog
>>> +++ b/gcc/ChangeLog
>>> @@ -1,3 +1,9 @@
>>> +2017-09-03  Jeff Law  <law@redhat.com>
>>> +
>>> +       PR tree-optimization/64910
>>> +       * tree-ssa-reassoc.c (reassociate_bb): For bitwise binary ops,
>>> +       swap the first and last operand if the last is a constant.
>>> +
>>>  2017-09-01  Segher Boessenkool  <segher@kernel.crashing.org>
>>>
>>>         PR rtl-optimization/82024
>>> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
>>> index 4ead57edfa2..766677c899b 100644
>>> --- a/gcc/testsuite/ChangeLog
>>> +++ b/gcc/testsuite/ChangeLog
>>> @@ -1,3 +1,8 @@
>>> +2017-09-03  Jeff Law  <law@redhat.com>
>>> +
>>> +       PR tree-optimization/64910
>>> +       * gcc.dg/tree-ssa/pr64910-2.c: New test.
>>> +
>>>  2017-09-01  Jakub Jelinek  <jakub@redhat.com>
>>>
>>>         PR target/81766
>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c
>>> new file mode 100644
>>> index 00000000000..2e3d6790776
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c
>>> @@ -0,0 +1,85 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-options "-O2 -fdump-tree-reassoc1" } */
>>> +
>>> +/* We want to make sure that we reassociate in a way that has the
>>> +   constant last.  With the constant last, it's more likely to result
>>> +   in a bitfield test on targets with such capabilities.  */
>>> +
>>> +extern void boo ();
>>> +
>>> +int b2b_uc (unsigned char u, unsigned char w)
>>> +{
>>> +  if ((u & w) & 0x20)
>>> +    boo ();
>>> +}
>>> +
>>> +int b2b_us (unsigned short u, unsigned short w)
>>> +{
>>> +  if ((u & w) & 0x20)
>>> +    boo ();
>>> +}
>>> +
>>> +int b2b_ui (unsigned int u, unsigned int w)
>>> +{
>>> +  if ((u & w) & 0x20)
>>> +    boo ();
>>> +}
>>> +int b2b_ul (unsigned long u, unsigned long w)
>>> +{
>>> +  if ((u & w) & 0x20)
>>> +    boo ();
>>> +}
>>> +int b2b_ull (unsigned long long u, unsigned long long w)
>>> +{
>>> +  if ((u & w) & 0x20)
>>> +    boo ();
>>> +}
>>> +
>>> +int b2b_sc (signed char u, signed char w)
>>> +{
>>> +  if ((u & w) & 0x20)
>>> +    boo ();
>>> +}
>>> +
>>> +int b2b_ss (signed short u, signed short w)
>>> +{
>>> +  if ((u & w) & 0x20)
>>> +    boo ();
>>> +}
>>> +
>>> +int b2b_si (signed int u, signed int w)
>>> +{
>>> +  if ((u & w) & 0x20)
>>> +    boo ();
>>> +}
>>> +int b2b_sl (signed long u, signed long w)
>>> +{
>>> +  if ((u & w) & 0x20)
>>> +    boo ();
>>> +}
>>> +int b2b_sll (signed long long u, signed long long w)
>>> +{
>>> +  if ((u & w) & 0x20)
>>> +    boo ();
>>> +}
>>> +
>>> +/* The AND of U & W should go into a temporary, when is then ANDed
>>> +   with the constant.
>>> +
>>> +   First verify that we have the right number of ANDs between U and W.  */
>>> +/* { dg-final { scan-tree-dump-times "\[uw\]_\[0-9\]+.D. \& \[uw\]_\[0-9\]+.D.;" 10 "reassoc1"} } */
>>> +
>>> +/* Then verify that we have the right number of ANDS between a temporary
>>> +   and the constant.  */
>>> +/* { dg-final { scan-tree-dump-times "_\[0-9]+ \& 32;" 10 "reassoc1"} } */
>>
>> This part of the new testcase fails on aarch64 & arm:
>> FAIL: gcc.dg/tree-ssa/pr64910-2.c scan-tree-dump-times reassoc1
>> "_[0-9]+ & 32;" 10
> Thanks.  I'm investigating.
So when there are precisely 2 operands the most recent change will
create non-canonical gimple.  Specifically we can end up with

cst OP ssa_name

On some targets (like aarch64, arm, likely others).

Nothing actually cared -- the form got rewritten at some point and
aarch64 at least generates the desired assembly code.  But since the
test was looking at the .reassoc dump, it failed.

The fix is easy.  Just restrict the operand swapping to cases where
there are 3 or more ops.  Bootstrapped and regression tested on aarch64.

Committing.

Jeff
commit 077cf883c3e983369e68b27307ec4944c7097393
Author: law <law@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Wed Sep 6 05:20:25 2017 +0000

            PR tree-optimization/64910
            * tree-ssa-reassoc.c (reassociate_bb): Restrict last change to
            cases where we have 3 or more operands.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@251751 138bc75d-0d04-0410-961f-82ee72b054a4

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 53637905722..1b3bddbb806 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,9 @@
+2017-09-05  Jeff Law  <law@redhat.com>
+
+	PR tree-optimization/64910
+	* tree-ssa-reassoc.c (reassociate_bb): Restrict last change to
+	cases where we have 3 or more operands.
+
 2017-09-05  Jakub Jelinek  <jakub@redhat.com>
 
 	PR middle-end/81768
diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c
index 76048196b27..2fb6aef51d7 100644
--- a/gcc/tree-ssa-reassoc.c
+++ b/gcc/tree-ssa-reassoc.c
@@ -5763,14 +5763,15 @@ reassociate_bb (basic_block bb)
 			     "Width = %d was chosen for reassociation\n", width);
 
 
-		  /* For binary bit operations, if the last operand in
-		     OPS is a constant, move it to the front.  This
-		     helps ensure that we generate (X & Y) & C rather
-		     than (X & C) & Y.  The former will often match
-		     a canonical bit test when we get to RTL.  */
-		  if ((rhs_code == BIT_AND_EXPR
-		       || rhs_code == BIT_IOR_EXPR
-		       || rhs_code == BIT_XOR_EXPR)
+		  /* For binary bit operations, if there are at least 3
+		     operands and the last last operand in OPS is a constant,
+		     move it to the front.  This helps ensure that we generate
+		     (X & Y) & C rather than (X & C) & Y.  The former will
+		     often match a canonical bit test when we get to RTL.  */
+		  if (ops.length () != 2
+		      && (rhs_code == BIT_AND_EXPR
+		          || rhs_code == BIT_IOR_EXPR
+		          || rhs_code == BIT_XOR_EXPR)
 		      && TREE_CODE (ops.last ()->op) == INTEGER_CST)
 		    std::swap (*ops[0], *ops[ops_num - 1]);
Jakub Jelinek Sept. 6, 2017, 9:26 a.m. UTC | #13
On Tue, Sep 05, 2017 at 11:21:48PM -0600, Jeff Law wrote:
> --- a/gcc/tree-ssa-reassoc.c
> +++ b/gcc/tree-ssa-reassoc.c
> @@ -5763,14 +5763,15 @@ reassociate_bb (basic_block bb)
>  			     "Width = %d was chosen for reassociation\n", width);
>  
>  
> -		  /* For binary bit operations, if the last operand in
> -		     OPS is a constant, move it to the front.  This
> -		     helps ensure that we generate (X & Y) & C rather
> -		     than (X & C) & Y.  The former will often match
> -		     a canonical bit test when we get to RTL.  */
> -		  if ((rhs_code == BIT_AND_EXPR
> -		       || rhs_code == BIT_IOR_EXPR
> -		       || rhs_code == BIT_XOR_EXPR)
> +		  /* For binary bit operations, if there are at least 3
> +		     operands and the last last operand in OPS is a constant,
> +		     move it to the front.  This helps ensure that we generate
> +		     (X & Y) & C rather than (X & C) & Y.  The former will
> +		     often match a canonical bit test when we get to RTL.  */
> +		  if (ops.length () != 2

So wouldn't it be clearer to write ops.length () > 2 ?
if (ops.length () == 0)
else if (ops.length () == 1)
come earlier, so it is the same thing, but might help the reader.

> +		      && (rhs_code == BIT_AND_EXPR
> +		          || rhs_code == BIT_IOR_EXPR
> +		          || rhs_code == BIT_XOR_EXPR)
>  		      && TREE_CODE (ops.last ()->op) == INTEGER_CST)
>  		    std::swap (*ops[0], *ops[ops_num - 1]);

Don't you then want to put the constant as second operand rather than first,
i.e. swap with *ops[1]?
And doesn't swap_ops_for_binary_stmt undo it again?

	Jakub
Jeff Law Oct. 13, 2017, 4:31 p.m. UTC | #14
On 09/06/2017 03:26 AM, Jakub Jelinek wrote:
> On Tue, Sep 05, 2017 at 11:21:48PM -0600, Jeff Law wrote:
>> --- a/gcc/tree-ssa-reassoc.c
>> +++ b/gcc/tree-ssa-reassoc.c
>> @@ -5763,14 +5763,15 @@ reassociate_bb (basic_block bb)
>>  			     "Width = %d was chosen for reassociation\n", width);
>>  
>>  
>> -		  /* For binary bit operations, if the last operand in
>> -		     OPS is a constant, move it to the front.  This
>> -		     helps ensure that we generate (X & Y) & C rather
>> -		     than (X & C) & Y.  The former will often match
>> -		     a canonical bit test when we get to RTL.  */
>> -		  if ((rhs_code == BIT_AND_EXPR
>> -		       || rhs_code == BIT_IOR_EXPR
>> -		       || rhs_code == BIT_XOR_EXPR)
>> +		  /* For binary bit operations, if there are at least 3
>> +		     operands and the last last operand in OPS is a constant,
>> +		     move it to the front.  This helps ensure that we generate
>> +		     (X & Y) & C rather than (X & C) & Y.  The former will
>> +		     often match a canonical bit test when we get to RTL.  */
>> +		  if (ops.length () != 2
> 
> So wouldn't it be clearer to write ops.length () > 2 ?
> if (ops.length () == 0)
> else if (ops.length () == 1)
> come earlier, so it is the same thing, but might help the reader.
Agreed.  It's a trivial change, but I'll spin it with some other testing
that's in progress.


> 
>> +		      && (rhs_code == BIT_AND_EXPR
>> +		          || rhs_code == BIT_IOR_EXPR
>> +		          || rhs_code == BIT_XOR_EXPR)
>>  		      && TREE_CODE (ops.last ()->op) == INTEGER_CST)
>>  		    std::swap (*ops[0], *ops[ops_num - 1]);
> 
> Don't you then want to put the constant as second operand rather than first,
> i.e. swap with *ops[1]?
I don't think it matters in practice. I recall being concerned that we
could end up with non-canonical gimple, but we don't AFAICT.

> And doesn't swap_ops_for_binary_stmt undo it again?
No.  It doesn't.  The rank checks get in the way IIRC.  FWIW that's
where I'd originally put the transformation, but Richi wanted it in the
caller.

Jeff
diff mbox

Patch

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index c7626ff..f5ca85e 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,9 @@ 
+2015-12-20  Jeff Law  <law@redhat.com>
+
+	PR tree-optimization/64910
+	* tree-ssa-reassoc.c (swap_ops_for_binary_stmt): Make sure constant
+	is last for binary bit operations.
+
 2015-12-21  Pierre-Marie de Rodat  <derodat@adacore.com>
 
 	* dwarf2out.c (add_data_member_location_attribute): Do not
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index bb2ed22..e2f55f3 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@ 
+2015-12-20  Jeff Law  <law@redhat.com>
+
+	PR tree-optimization/64910
+	* gcc.dg/tree-ssa/pr64910-2.c.c: New test.
+
 2015-12-21  Claudiu Zissulescu  <claziss@synopsys.com>
 
 	* gcc.target/arc/builtin_general.c: New test.
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c
new file mode 100644
index 0000000..2e3d679
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr64910-2.c
@@ -0,0 +1,85 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-reassoc1" } */
+
+/* We want to make sure that we reassociate in a way that has the
+   constant last.  With the constant last, it's more likely to result
+   in a bitfield test on targets with such capabilities.  */
+
+extern void boo ();
+
+int b2b_uc (unsigned char u, unsigned char w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+
+int b2b_us (unsigned short u, unsigned short w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+
+int b2b_ui (unsigned int u, unsigned int w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+int b2b_ul (unsigned long u, unsigned long w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+int b2b_ull (unsigned long long u, unsigned long long w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+
+int b2b_sc (signed char u, signed char w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+
+int b2b_ss (signed short u, signed short w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+
+int b2b_si (signed int u, signed int w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+int b2b_sl (signed long u, signed long w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+int b2b_sll (signed long long u, signed long long w)
+{
+  if ((u & w) & 0x20)
+    boo ();
+}
+
+/* The AND of U & W should go into a temporary, when is then ANDed
+   with the constant.
+
+   First verify that we have the right number of ANDs between U and W.  */
+/* { dg-final { scan-tree-dump-times "\[uw\]_\[0-9\]+.D. \& \[uw\]_\[0-9\]+.D.;" 10 "reassoc1"} } */
+
+/* Then verify that we have the right number of ANDS between a temporary
+   and the constant.  */
+/* { dg-final { scan-tree-dump-times "_\[0-9]+ \& 32;" 10 "reassoc1"} } */
+
+/* Each function has one AND.  It will have either a second AND or TEST.  So
+   we can count the number of AND and TEST instructions.  They must be 2X
+   the number of test functions in this file.  */
+/* { dg-final { scan-assembler-times "and|test" 20 { target { i?86-*-* x86_64-*-*} } } } */
+
+/* Similarly on the m68k.  The code for the long long tests is suboptimal,
+   which catch via the second pattern and xfail.  */
+/* { dg-final { scan-assembler-times "and|btst" 20 { target { m68k-*-* } } } } */
+/* { dg-final { scan-assembler-not "or" { target { m68k-*-* } xfail { *-*-* } } } } */
+
diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c
index e54700e..1dcfce3 100644
--- a/gcc/tree-ssa-reassoc.c
+++ b/gcc/tree-ssa-reassoc.c
@@ -3449,6 +3449,26 @@  swap_ops_for_binary_stmt (vec<operand_entry *> ops,
       oe1->op = temp.op;
       oe1->rank = temp.rank;
     }
+  /* For X OP Y OP C, associate as (X OP Y) OP C, but only for
+     logicals, which encourages bit operations.  Experimentation
+     has shown this generally isn't a win for arithmetic.  */
+  else if (stmt)
+    {
+      enum tree_code opcode = gimple_assign_rhs_code (stmt);
+      if ((opcode == BIT_AND_EXPR
+	   || opcode == BIT_IOR_EXPR
+	   || opcode == BIT_XOR_EXPR)
+	  && TREE_CODE (oe1->op) != INTEGER_CST
+	  && TREE_CODE (oe2->op) != INTEGER_CST
+	  && TREE_CODE (oe3->op) == INTEGER_CST)
+	{
+	  operand_entry temp = *oe3;
+	  oe3->op = oe1->op;
+	  oe3->rank = oe1->rank;
+	  oe1->op = temp.op;
+	  oe1->rank= temp.rank;
+	}
+    }
 }
 
 /* If definition of RHS1 or RHS2 dominates STMT, return the later of those