diff mbox

PR 62173, re-shuffle insns for RTL loop invariant hoisting

Message ID n99vbgp35kl.fsf@arm.com
State New
Headers show

Commit Message

Jiong Wang April 21, 2015, 2:24 p.m. UTC
Jiong Wang writes:

> 2015-04-14 18:24 GMT+01:00 Jeff Law <law@redhat.com>:
>> On 04/14/2015 10:48 AM, Steven Bosscher wrote:
>>>>
>>>> So I think this stage2/3 binary difference is acceptable?
>>>
>>>
>>> No, they should be identical. If there's a difference, then there's a
>>> bug - which, it seems, you've already found, too.
>>
>> RIght.  And so the natural question is how to fix.
>>
>> At first glance it would seem like having this new code ignore dependencies
>> rising from debug insns would work.
>>
>> Which then begs the question, what happens to the debug insn -- it's
>> certainly not going to be correct anymore if the transformation is made.
>
> Exactly.
>
> The debug_insn 2776 in my example is to record the base address of a
> local array. the new code is doing correctly here by not shuffling the
> operands of insn 2556 and 2557 as there is additional reference of
> reg:1473 from debug insn, although the code will still execute correctly
> if we do the transformation.
>
> my understanding to fix this:
>
>   * delete the out-of-date mismatch debug_insn? as there is no guarantee
>     to generate accurate debug info under -O2.
>
>     IMO, this debug_insn may affect "DW_AT_location" field for variable
>     descrption of "classes" in .debug_info section, but it's omitted in
>     the final output already.
>
>     <3><38a4d>: Abbrev Number: 137 (DW_TAG_variable)
>     <38a4f>   DW_AT_name : (indirect string, offset: 0x18db): classes
>     <38a53>   DW_AT_decl_file   : 1
>     <38a54>   DW_AT_decl_line   : 548
>     <38a56>   DW_AT_type        : <0x38cb4>
>
>   * update the debug_insn? if the following change is OK with dwarf standard
>
>    from
>
>      insn0: reg0 = fp + reg1
>      debug_insn: var_loc = reg0 + const_off
>      insn1: reg2 = reg0 + const_off
>
>    to
>
>      insn0: reg0 = fp + const_off
>      debug_insn: var_loc = reg0 + reg1
>      insn1: reg2 = reg0 + reg1
>
> Thanks,
>

And attachment is the new patch which will update debug_insn as
described in the second solution above.

Now the stage2/3 binary differences on AArch64 gone away. Bootstrap OK.

On AArch64, this patch give 600+ new rtl loop invariants found across
spec2k6 float. +4.5% perf improvement on 436.cactusADM because four new
invariants found in the critical function "regex_compile".

The similar improvements may be achieved on other RISC backends like
powerpc/mips I guess.

One thing to mention, for AArch64, one minor glitch in
aarch64_legitimize_address needs to be fixed to let this patch take
effect, I will send out that patch later as it's a seperate issue.
Powerpc/Mips don't have this glitch in LEGITIMIZE_ADDRESS hook, so
should be OK, and I verified the base address of local array in the
testcase given by Seb on pr62173 do hoisted on ppc64 now. I think
pr62173 is fixed on those 64bit arch by this patch.

Thoughts?

Thanks.

2015-04-21  Jiong Wang  <jiong.wang@arm.com>

gcc/
  * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build.
  (vfp_const_iv): New hash table.
  (expensive_addr_check_p): New boolean.
  (init_inv_motion_data): Initialize new variables.>
  (free_inv_motion_data): Release hash table.
  (create_new_invariant): Set cheap_address to false for iv in
  vfp_const_iv table.
  (find_invariant_insn): Skip dependencies check for iv in vfp_const_iv
  table.
  (use_for_single_du): New function.
  (reshuffle_insn_with_vfp): Likewise.
  (find_invariants_bb): Call reshuffle_insn_with_vfp.

gcc/testsuite/
   * gcc.dg/pr62173.c: New testcase.

Comments

Jeff Law April 24, 2015, 1:55 a.m. UTC | #1
On 04/21/2015 08:24 AM, Jiong Wang wrote:
>
> Jiong Wang writes:
>
>> 2015-04-14 18:24 GMT+01:00 Jeff Law <law@redhat.com>:
>>> On 04/14/2015 10:48 AM, Steven Bosscher wrote:
>>>>>
>>>>> So I think this stage2/3 binary difference is acceptable?
>>>>
>>>>
>>>> No, they should be identical. If there's a difference, then there's a
>>>> bug - which, it seems, you've already found, too.
>>>
>>> RIght.  And so the natural question is how to fix.
>>>
>>> At first glance it would seem like having this new code ignore dependencies
>>> rising from debug insns would work.
>>>
>>> Which then begs the question, what happens to the debug insn -- it's
>>> certainly not going to be correct anymore if the transformation is made.
>>
>> Exactly.
>>
>> The debug_insn 2776 in my example is to record the base address of a
>> local array. the new code is doing correctly here by not shuffling the
>> operands of insn 2556 and 2557 as there is additional reference of
>> reg:1473 from debug insn, although the code will still execute correctly
>> if we do the transformation.
>>
>> my understanding to fix this:
>>
>>    * delete the out-of-date mismatch debug_insn? as there is no guarantee
>>      to generate accurate debug info under -O2.
>>
>>      IMO, this debug_insn may affect "DW_AT_location" field for variable
>>      descrption of "classes" in .debug_info section, but it's omitted in
>>      the final output already.
>>
>>      <3><38a4d>: Abbrev Number: 137 (DW_TAG_variable)
>>      <38a4f>   DW_AT_name : (indirect string, offset: 0x18db): classes
>>      <38a53>   DW_AT_decl_file   : 1
>>      <38a54>   DW_AT_decl_line   : 548
>>      <38a56>   DW_AT_type        : <0x38cb4>
>>
>>    * update the debug_insn? if the following change is OK with dwarf standard
>>
>>     from
>>
>>       insn0: reg0 = fp + reg1
>>       debug_insn: var_loc = reg0 + const_off
>>       insn1: reg2 = reg0 + const_off
>>
>>     to
>>
>>       insn0: reg0 = fp + const_off
>>       debug_insn: var_loc = reg0 + reg1
>>       insn1: reg2 = reg0 + reg1
>>
>> Thanks,
>>
>
> And attachment is the new patch which will update debug_insn as
> described in the second solution above.
>
> Now the stage2/3 binary differences on AArch64 gone away. Bootstrap OK.
>
> On AArch64, this patch give 600+ new rtl loop invariants found across
> spec2k6 float. +4.5% perf improvement on 436.cactusADM because four new
> invariants found in the critical function "regex_compile".
>
> The similar improvements may be achieved on other RISC backends like
> powerpc/mips I guess.
>
> One thing to mention, for AArch64, one minor glitch in
> aarch64_legitimize_address needs to be fixed to let this patch take
> effect, I will send out that patch later as it's a seperate issue.
> Powerpc/Mips don't have this glitch in LEGITIMIZE_ADDRESS hook, so
> should be OK, and I verified the base address of local array in the
> testcase given by Seb on pr62173 do hoisted on ppc64 now. I think
> pr62173 is fixed on those 64bit arch by this patch.
>
> Thoughts?
>
> Thanks.
>
> 2015-04-21  Jiong Wang  <jiong.wang@arm.com>
>
> gcc/
>    * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build.
>    (vfp_const_iv): New hash table.
>    (expensive_addr_check_p): New boolean.
>    (init_inv_motion_data): Initialize new variables.>
>    (free_inv_motion_data): Release hash table.
>    (create_new_invariant): Set cheap_address to false for iv in
>    vfp_const_iv table.
>    (find_invariant_insn): Skip dependencies check for iv in vfp_const_iv
>    table.
>    (use_for_single_du): New function.
>    (reshuffle_insn_with_vfp): Likewise.
>    (find_invariants_bb): Call reshuffle_insn_with_vfp.
>
> gcc/testsuite/
>     * gcc.dg/pr62173.c: New testcase.
So ultimately isn't this just a specialized version of reassociation of 
pointer arithmetic?  And if so, don't we have all the usual problems 
around introducing overflow?

ISTM if this is going to go forward (ie, we somehow convince ourselves 
that this kind of reassociation is OK), then should it be made to apply 
on pointers in general rather than restricting to stack, frame, 
virtual-frame?

I wish I'd looked more closely at the patch the first time around so 
that I could have raised these questions sooner.

Jeff
>
Jiong Wang April 24, 2015, 4:18 p.m. UTC | #2
Jeff Law writes:

> On 04/21/2015 08:24 AM, Jiong Wang wrote:
>>
>> Jiong Wang writes:
>>
>>> 2015-04-14 18:24 GMT+01:00 Jeff Law <law@redhat.com>:
>>>> On 04/14/2015 10:48 AM, Steven Bosscher wrote:
>>>>>>
>>>>>> So I think this stage2/3 binary difference is acceptable?
>>>>>
>>>>>
>>>>> No, they should be identical. If there's a difference, then there's a
>>>>> bug - which, it seems, you've already found, too.
>>>>
>>>> RIght.  And so the natural question is how to fix.
>>>>
>>>> At first glance it would seem like having this new code ignore dependencies
>>>> rising from debug insns would work.
>>>>
>>>> Which then begs the question, what happens to the debug insn -- it's
>>>> certainly not going to be correct anymore if the transformation is made.
>>>
>>> Exactly.
>>>
>>> The debug_insn 2776 in my example is to record the base address of a
>>> local array. the new code is doing correctly here by not shuffling the
>>> operands of insn 2556 and 2557 as there is additional reference of
>>> reg:1473 from debug insn, although the code will still execute correctly
>>> if we do the transformation.
>>>
>>> my understanding to fix this:
>>>
>>>    * delete the out-of-date mismatch debug_insn? as there is no guarantee
>>>      to generate accurate debug info under -O2.
>>>
>>>      IMO, this debug_insn may affect "DW_AT_location" field for variable
>>>      descrption of "classes" in .debug_info section, but it's omitted in
>>>      the final output already.
>>>
>>>      <3><38a4d>: Abbrev Number: 137 (DW_TAG_variable)
>>>      <38a4f>   DW_AT_name : (indirect string, offset: 0x18db): classes
>>>      <38a53>   DW_AT_decl_file   : 1
>>>      <38a54>   DW_AT_decl_line   : 548
>>>      <38a56>   DW_AT_type        : <0x38cb4>
>>>
>>>    * update the debug_insn? if the following change is OK with dwarf standard
>>>
>>>     from
>>>
>>>       insn0: reg0 = fp + reg1
>>>       debug_insn: var_loc = reg0 + const_off
>>>       insn1: reg2 = reg0 + const_off
>>>
>>>     to
>>>
>>>       insn0: reg0 = fp + const_off
>>>       debug_insn: var_loc = reg0 + reg1
>>>       insn1: reg2 = reg0 + reg1
>>>
>>> Thanks,
>>>
>>
>> And attachment is the new patch which will update debug_insn as
>> described in the second solution above.
>>
>> Now the stage2/3 binary differences on AArch64 gone away. Bootstrap OK.
>>
>> On AArch64, this patch give 600+ new rtl loop invariants found across
>> spec2k6 float. +4.5% perf improvement on 436.cactusADM because four new
>> invariants found in the critical function "regex_compile".
>>
>> The similar improvements may be achieved on other RISC backends like
>> powerpc/mips I guess.
>>
>> One thing to mention, for AArch64, one minor glitch in
>> aarch64_legitimize_address needs to be fixed to let this patch take
>> effect, I will send out that patch later as it's a seperate issue.
>> Powerpc/Mips don't have this glitch in LEGITIMIZE_ADDRESS hook, so
>> should be OK, and I verified the base address of local array in the
>> testcase given by Seb on pr62173 do hoisted on ppc64 now. I think
>> pr62173 is fixed on those 64bit arch by this patch.
>>
>> Thoughts?
>>
>> Thanks.
>>
>> 2015-04-21  Jiong Wang  <jiong.wang@arm.com>
>>
>> gcc/
>>    * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build.
>>    (vfp_const_iv): New hash table.
>>    (expensive_addr_check_p): New boolean.
>>    (init_inv_motion_data): Initialize new variables.>
>>    (free_inv_motion_data): Release hash table.
>>    (create_new_invariant): Set cheap_address to false for iv in
>>    vfp_const_iv table.
>>    (find_invariant_insn): Skip dependencies check for iv in vfp_const_iv
>>    table.
>>    (use_for_single_du): New function.
>>    (reshuffle_insn_with_vfp): Likewise.
>>    (find_invariants_bb): Call reshuffle_insn_with_vfp.
>>
>> gcc/testsuite/
>>     * gcc.dg/pr62173.c: New testcase.
> So ultimately isn't this just a specialized version of reassociation of 
> pointer arithmetic?  And if so, don't we have all the usual problems 
> around introducing overflow?
>
>
> ISTM if this is going to go forward (ie, we somehow convince ourselves 
> that this kind of reassociation is OK), then should it be made to apply 
> on pointers in general rather than restricting to stack, frame, 
> virtual-frame?
>

Jeff,

  Thanks for the review.

  This transformation is not reassociation of pointer arithmetic.
  
  The idea of this patch is, given two instructions with variable value,
  we may get new instruction sequences with fixed value by reassociating
  their operands. And currently GCC only generate such instruction
  sequences for local array accessing as far as I known.

  for the statement "D.2707 = A[i]", gcc generate the following
  instruction sequence:

  (insn 92 91 93 6 (set (reg/f:DI 148)
        (plus:DI (reg/f:DI 64 sfp)
            (reg:DI 147 [ i ])))
     (expr_list:REG_DEAD (reg:DI 147 [ i ])
        (nil)))
        
  (insn 93 92 94 6 (set (reg:SI 149 [ D.2707 ])
        (zero_extend:SI (mem/j:QI (plus:DI (reg/f:DI 148)
                    (const_int -16 [0xfffffffffffffff0])) [0 A S1 A8])))
     (expr_list:REG_DEAD (reg/f:DI 148)
        (nil)))


  both insn 92 and insn 93 are with variable value, but
  "(reg/f:DI 64 sfp)" in insn 92 and "const_int -16" in insn 93 are
  fixed value.
  
  So my patch will transform above sequence into the following if
  "apply_change_group" succeed. Then insn 92 is with fixed value and
  thus loop invariant.
  
  (insn 92 88 97 5 (set (reg/f:DI 148)
        (plus:DI (reg/f:DI 64 sfp)
            (const_int -16 [0xfffffffffffffff0])))
     (nil))
     
  (insn 93 131 94 6 (set (reg:SI 149 [ D.2707 ])
        (zero_extend:SI (mem/j:QI (plus:DI (reg/f:DI 148)
                    (reg:DI 147 [ i ])) [0 A S1 A8])))
     (expr_list:REG_DEAD (reg:DI 147 [ i ])
        (expr_list:REG_DEAD (reg/f:DI 148)
            (nil))))

  This tranformation will only be done if the def of r148 in insn 92 is the single
  def for the use of it in insn 93, and the use in insn 93 is also
  single use.

  And if there is overflow, then it will happen in the original insn 93
  also, then it should not be generated in the first place. So I think
  there is no need to do overflow check.

  Does this explanation make sense?
Wang Jiong April 28, 2015, 12:06 p.m. UTC | #3
Hi Matthew,

2015-04-21 15:24 GMT+01:00 Jiong Wang <jiong.wang@arm.com>:

>
> 2015-04-21  Jiong Wang  <jiong.wang@arm.com>
>
> gcc/
>   * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build.
>   (vfp_const_iv): New hash table.
>   (expensive_addr_check_p): New boolean.
>   (init_inv_motion_data): Initialize new variables.>
>   (free_inv_motion_data): Release hash table.
>   (create_new_invariant): Set cheap_address to false for iv in
>   vfp_const_iv table.
>   (find_invariant_insn): Skip dependencies check for iv in vfp_const_iv
>   table.
>   (use_for_single_du): New function.
>   (reshuffle_insn_with_vfp): Likewise.
>   (find_invariants_bb): Call reshuffle_insn_with_vfp.

Is it possible for you to test this patch on spec2k6 on MIPS?
especially 436.cactusADM.
I am interested in the performance impact on other load/store machines.

Thanks.

Regards,
Jiong
Matthew Fortune April 28, 2015, 1:56 p.m. UTC | #4
> Hi Matthew,

> 

> 2015-04-21 15:24 GMT+01:00 Jiong Wang <jiong.wang@arm.com>:

> 

> >

> > 2015-04-21  Jiong Wang  <jiong.wang@arm.com>

> >

> > gcc/

> >   * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build.

> >   (vfp_const_iv): New hash table.

> >   (expensive_addr_check_p): New boolean.

> >   (init_inv_motion_data): Initialize new variables.>

> >   (free_inv_motion_data): Release hash table.

> >   (create_new_invariant): Set cheap_address to false for iv in

> >   vfp_const_iv table.

> >   (find_invariant_insn): Skip dependencies check for iv in

> vfp_const_iv

> >   table.

> >   (use_for_single_du): New function.

> >   (reshuffle_insn_with_vfp): Likewise.

> >   (find_invariants_bb): Call reshuffle_insn_with_vfp.

> 

> Is it possible for you to test this patch on spec2k6 on MIPS?

> especially 436.cactusADM.

> I am interested in the performance impact on other load/store machines.


I will certainly try and fit it in. It may take me a week to get back to
you though as I'm away on vacation for a few days.

Matthew
Wang Jiong April 28, 2015, 2 p.m. UTC | #5
2015-04-28 14:56 GMT+01:00 Matthew Fortune <Matthew.Fortune@imgtec.com>:
>> Hi Matthew,
>>
>> 2015-04-21 15:24 GMT+01:00 Jiong Wang <jiong.wang@arm.com>:
>>
>> >
>> > 2015-04-21  Jiong Wang  <jiong.wang@arm.com>
>> >
>> > gcc/
>> >   * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build.
>> >   (vfp_const_iv): New hash table.
>> >   (expensive_addr_check_p): New boolean.
>> >   (init_inv_motion_data): Initialize new variables.>
>> >   (free_inv_motion_data): Release hash table.
>> >   (create_new_invariant): Set cheap_address to false for iv in
>> >   vfp_const_iv table.
>> >   (find_invariant_insn): Skip dependencies check for iv in
>> vfp_const_iv
>> >   table.
>> >   (use_for_single_du): New function.
>> >   (reshuffle_insn_with_vfp): Likewise.
>> >   (find_invariants_bb): Call reshuffle_insn_with_vfp.
>>
>> Is it possible for you to test this patch on spec2k6 on MIPS?
>> especially 436.cactusADM.
>> I am interested in the performance impact on other load/store machines.
>
> I will certainly try and fit it in. It may take me a week to get back to
> you though as I'm away on vacation for a few days.

Thanks very much. I appreciate that.

And you can add "-fdump-rtl-loop2_invariant", then we can see how many
new loop invariants found at RTL level across spec2k6 on MIPS.

Thanks.

Regards,
Jiong
Jeff Law May 14, 2015, 7:51 p.m. UTC | #6
On 04/24/2015 10:18 AM, Jiong Wang wrote:
>>> 2015-04-21  Jiong Wang  <jiong.wang@arm.com>
>>>
>>> gcc/
>>>     * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build.
>>>     (vfp_const_iv): New hash table.
>>>     (expensive_addr_check_p): New boolean.
>>>     (init_inv_motion_data): Initialize new variables.>
>>>     (free_inv_motion_data): Release hash table.
>>>     (create_new_invariant): Set cheap_address to false for iv in
>>>     vfp_const_iv table.
>>>     (find_invariant_insn): Skip dependencies check for iv in vfp_const_iv
>>>     table.
>>>     (use_for_single_du): New function.
>>>     (reshuffle_insn_with_vfp): Likewise.
>>>     (find_invariants_bb): Call reshuffle_insn_with_vfp.
>>>
>>> gcc/testsuite/
>>>      * gcc.dg/pr62173.c: New testcase.
>> So ultimately isn't this just a specialized version of reassociation of
>> pointer arithmetic?  And if so, don't we have all the usual problems
>> around introducing overflow?
>>
>>
>> ISTM if this is going to go forward (ie, we somehow convince ourselves
>> that this kind of reassociation is OK), then should it be made to apply
>> on pointers in general rather than restricting to stack, frame,
>> virtual-frame?
>>
>
> Jeff,
>
>    Thanks for the review.
>
>    This transformation is not reassociation of pointer arithmetic.
                             ^^^


>
>    The idea of this patch is, given two instructions with variable value,
>    we may get new instruction sequences with fixed value by reassociating
>    their operands. And currently GCC only generate such instruction
>    sequences for local array accessing as far as I known.
Which is precisely reassociation of pointer arithmetic.

Given:
x = a + b
y = *(x + c)

You're generating:
x = a + c
y = *(x + b)

That's reassociation of pointer arithmetic.

For all kinds of reassociation we have to concern ourselves with adding 
overflow where it didn't already occur.  Assuming a 32 bit architecture 
we could get overflow if A is 0x7fffffff, b is -4 and and c = 3

0x7fffffff + -4 = 0x7ffffffb
0x7ffffffb + 3 = 0x7ffffffe


If you make the transformation you're suggesting we get

0x7fffffff + 3 = 0x80000002  OVERFLOW
0x80000002 - 4 = 0x7ffffffe

Now if you always know pointers are unsigned, then the overflow is 
defined and you'd be OK.  But that's a property of the target and one 
that's not well modeled within GCC (we have POINTER_EXTEND_UNSIGNED 
which kind of tells us something in this space).

In addition to worrying about overflow, you have to worry about 
segmented architectures with implicit segment selection -- especially if 
the segment selection comes from the base register than the entire 
effective address.

On such architectures a + b != b + a when used in a memory reference. 
The HPPA and mn103 ports immediately come to mind.  I'm sure there's at 
least one more since I recall helping someone with these issues in the 
past.  Anyway

So given
x = a + b;
y = *(x + c)

Can only be transformed into
x = a + c
y = *(x + b)

If and only if a + b would choose the same segment as a + c.   On the PA 
that wouldn't be true for something like

a = 0x4000000
b = 0x10
c = -4

a + b is 0x40000010
a + c is 0x3ffffffc

Those select different segments.  Even though the full address is the 
same (0x4000000c).

Sadly we don't expose this aspect of target characteristics at all.


To proceed, you'd probably need ways to conditionalize these 
transformations.  ie, target hooks.

Jeff
Jiong Wang May 14, 2015, 9:13 p.m. UTC | #7
Jeff Law writes:

> For all kinds of reassociation we have to concern ourselves with adding 
> overflow where it didn't already occur.  Assuming a 32 bit architecture 
> we could get overflow if A is 0x7fffffff, b is -4 and and c = 3
>
> 0x7fffffff + -4 = 0x7ffffffb
> 0x7ffffffb + 3 = 0x7ffffffe
>
>
> If you make the transformation you're suggesting we get
>
> 0x7fffffff + 3 = 0x80000002  OVERFLOW
> 0x80000002 - 4 = 0x7ffffffe
>
> Now if you always know pointers are unsigned, then the overflow is 
> defined and you'd be OK.  But that's a property of the target and one 
> that's not well modeled within GCC (we have POINTER_EXTEND_UNSIGNED 
> which kind of tells us something in this space).

I see, understood, cool! Thanks for such detailed explanation.

Above scenario do may happen for general pointer arith
reassociation.

One thing may make life easier as my reassociation is restricted within
frame pointer. the "(plus (plus fp, index_reg) + const_off)" pattern was
to address some variable on stack. index_reg, const_off were part of
the stack offset of the variable. Reassociate them means reorder two
parts of the stack offset. There may be way to prove the transformation
will not add extra overflow risk, especially when the index_reg is
unsigned.
 
I understand for general pointer arith reassociation, there do have big
risk, as the involved operands largely come from irrelevant instruction,
no relationship between the values from those operands, we can deduce nothing.

>
> In addition to worrying about overflow, you have to worry about 
> segmented architectures with implicit segment selection -- especially if 
> the segment selection comes from the base register than the entire 
> effective address.
>

Hmm, understood!

This let me recall something as dark as x86 segment descriptor in protecting mode...

>
> Sadly we don't expose this aspect of target characteristics at all.
>
>
> To proceed, you'd probably need ways to conditionalize these 
> transformations.  ie, target hooks.
>
> Jeff
Jeff Law May 14, 2015, 10:21 p.m. UTC | #8
On 05/14/2015 03:13 PM, Jiong Wang wrote:
>
> Jeff Law writes:
>
>> For all kinds of reassociation we have to concern ourselves with adding
>> overflow where it didn't already occur.  Assuming a 32 bit architecture
>> we could get overflow if A is 0x7fffffff, b is -4 and and c = 3
>>
>> 0x7fffffff + -4 = 0x7ffffffb
>> 0x7ffffffb + 3 = 0x7ffffffe
>>
>>
>> If you make the transformation you're suggesting we get
>>
>> 0x7fffffff + 3 = 0x80000002  OVERFLOW
>> 0x80000002 - 4 = 0x7ffffffe
>>
>> Now if you always know pointers are unsigned, then the overflow is
>> defined and you'd be OK.  But that's a property of the target and one
>> that's not well modeled within GCC (we have POINTER_EXTEND_UNSIGNED
>> which kind of tells us something in this space).
>
> I see, understood, cool! Thanks for such detailed explanation.
>
> Above scenario do may happen for general pointer arith
> reassociation.
>
> One thing may make life easier as my reassociation is restricted within
> frame pointer. the "(plus (plus fp, index_reg) + const_off)" pattern was
> to address some variable on stack. index_reg, const_off were part of
> the stack offset of the variable. Reassociate them means reorder two
> parts of the stack offset. There may be way to prove the transformation
> will not add extra overflow risk, especially when the index_reg is
> unsigned.
>
> I understand for general pointer arith reassociation, there do have big
> risk, as the involved operands largely come from irrelevant instruction,
> no relationship between the values from those operands, we can deduce nothing.
Given the special status of SP, FP and ARGP and a known constant part, 
we can probably do something here.  More below...



>
>>
>> In addition to worrying about overflow, you have to worry about
>> segmented architectures with implicit segment selection -- especially if
>> the segment selection comes from the base register than the entire
>> effective address.
>>
>
> Hmm, understood!
>
> This let me recall something as dark as x86 segment descriptor in protecting mode...
Possibly, I've actually never studied the segmented aspects of the x86. 
  But I'm painfully familiar with the others mentioned :(

My recollection for the segmented stuff on the PA is we only had a 
single guard page at both ends of the segment.  So we only allowed an 
offset of +-4k when doing address reassociations in legitimize_address. 
  This was possible because we had callouts from the right places in the 
RTL generators/optimizers to allow targets to rewrite address 
arithmetic.  So we could naturally bury the target details away from the 
code generator/optimizers.

So we could possibly parameterize the transformation around similar 
concepts.  The design issue here is it's introducing more target 
dependencies in places where we've really wanted to avoid them.  In 
theory the gimple optimizers are supposed to be target independent. 
Reality is some stuff bleeds into them (the one that's mentioned the 
most often is branch costing, but there's others).

*If* we decide to go forward with using some target hooks here.  I'd be 
tempted to do 2.  One that's effective a tri-state.  Full reassociation, 
limited reassociation, no reassociation.  The second would bound the 
constants in the limited reassociation case.

Thoughts?

Jeff
Jiong Wang May 21, 2015, 8:46 p.m. UTC | #9
Jeff Law writes:

> On 05/14/2015 03:13 PM, Jiong Wang wrote:
>>
>> Jeff Law writes:
>>
>>> For all kinds of reassociation we have to concern ourselves with adding
>>> overflow where it didn't already occur.  Assuming a 32 bit architecture
>>> we could get overflow if A is 0x7fffffff, b is -4 and and c = 3
>>>
>>> 0x7fffffff + -4 = 0x7ffffffb
>>> 0x7ffffffb + 3 = 0x7ffffffe
>>>
>>>
>>> If you make the transformation you're suggesting we get
>>>
>>> 0x7fffffff + 3 = 0x80000002  OVERFLOW
>>> 0x80000002 - 4 = 0x7ffffffe
>>>
>>> Now if you always know pointers are unsigned, then the overflow is
>>> defined and you'd be OK.  But that's a property of the target and one
>>> that's not well modeled within GCC (we have POINTER_EXTEND_UNSIGNED
>>> which kind of tells us something in this space).
>>
>> I see, understood, cool! Thanks for such detailed explanation.
>>
>> Above scenario do may happen for general pointer arith
>> reassociation.
>>
>> One thing may make life easier as my reassociation is restricted within
>> frame pointer. the "(plus (plus fp, index_reg) + const_off)" pattern was
>> to address some variable on stack. index_reg, const_off were part of
>> the stack offset of the variable. Reassociate them means reorder two
>> parts of the stack offset. There may be way to prove the transformation
>> will not add extra overflow risk, especially when the index_reg is
>> unsigned.
>>
>> I understand for general pointer arith reassociation, there do have big
>> risk, as the involved operands largely come from irrelevant instruction,
>> no relationship between the values from those operands, we can deduce nothing.
> Given the special status of SP, FP and ARGP and a known constant part, 
> we can probably do something here.  More below...
>
>
>
>>
>>>
>>> In addition to worrying about overflow, you have to worry about
>>> segmented architectures with implicit segment selection -- especially if
>>> the segment selection comes from the base register than the entire
>>> effective address.
>>>
>>
>> Hmm, understood!
>>
>> This let me recall something as dark as x86 segment descriptor in protecting mode...
> Possibly, I've actually never studied the segmented aspects of the x86. 
>   But I'm painfully familiar with the others mentioned :(
>
> My recollection for the segmented stuff on the PA is we only had a 
> single guard page at both ends of the segment.  So we only allowed an 
> offset of +-4k when doing address reassociations in legitimize_address. 
>   This was possible because we had callouts from the right places in the 
> RTL generators/optimizers to allow targets to rewrite address 
> arithmetic.  So we could naturally bury the target details away from the 
> code generator/optimizers.
>
> So we could possibly parameterize the transformation around similar 
> concepts.  The design issue here is it's introducing more target 
> dependencies in places where we've really wanted to avoid them.  In 
> theory the gimple optimizers are supposed to be target independent. 
> Reality is some stuff bleeds into them (the one that's mentioned the 
> most often is branch costing, but there's others).
>
> *If* we decide to go forward with using some target hooks here.  I'd be 
> tempted to do 2.  One that's effective a tri-state.  Full reassociation, 
> limited reassociation, no reassociation.  The second would bound the 
> constants in the limited reassociation case.
>
> Thoughts?

Thanks for these thoughts.

I tried but still can't prove this transformation will not introduce
extra pointer overflow even given it's reassociation with vfp, although
my first impression is it do will not introduce extra risk in real
application.

Have done a quick check on hppa's legitimize_address. I see for (plus
sym_ref, const_int), if const_int is beyond +-4K, then that hook will
force them into register, then (plus reg, reg) is always OK.

So for target hooks,  my understanding of your idea is something like:

 new hook targetm.pointer_arith_reassociate (), if return -1 then
 support full reassociation, 0 for limited, 1 for should not do any
 reassociation. the default version return -1 as most targets are OK to
 do reassociation given we can prove there is no introducing of overflow
 risk. While for target like HPPA, we should define this hook to return
 0 for limited support.

 Then, if targetm.pointer_arith_reassociate () return 1, we should
 further invoke the second hook targetm.limited_reassociate_p (rtx x),
 to check the reassociated rtx 'x' meets any restrictions, for example
 for HPPA, constants part shouldn't beyond +-4K.

not sure whether my understanding is correct.
Jeff Law May 27, 2015, 4:04 p.m. UTC | #10
On 05/21/2015 02:46 PM, Jiong Wang wrote:
>
> Thanks for these thoughts.
>
> I tried but still can't prove this transformation will not introduce
> extra pointer overflow even given it's reassociation with vfp, although
> my first impression is it do will not introduce extra risk in real
> application.
>
> Have done a quick check on hppa's legitimize_address. I see for (plus
> sym_ref, const_int), if const_int is beyond +-4K, then that hook will
> force them into register, then (plus reg, reg) is always OK.
I'm virtually certain the PA's legitimize_address is not overflow safe. 
  It was written long before we started worrying about overflows in 
address computations.  It was mostly concerned with trying generate good 
addressing modes without running afoul of the implicit space register 
selection issues.

A SYMBOL_REF is always a valid base register.  However, as the comment 
in hppa_legitimize_address notes, we might be given a MEM for something 
like:  x[n-100000].

We don't want to rewrite that as (x-100000) + n, even though doing so 
would be beneficial for LICM.


>
> So for target hooks,  my understanding of your idea is something like:
>
>   new hook targetm.pointer_arith_reassociate (), if return -1 then
>   support full reassociation, 0 for limited, 1 for should not do any
>   reassociation. the default version return -1 as most targets are OK to
>   do reassociation given we can prove there is no introducing of overflow
>   risk. While for target like HPPA, we should define this hook to return
>   0 for limited support.
Right.  Rather than use magic constants, I'd suggest an enum for the 
tri-state.  FULL_PTR_REASSOCIATION, PARTIAL_PTR_REASSOCIATION, 
NO_PTR_REASSOCIATION.


>
>   Then, if targetm.pointer_arith_reassociate () return 1, we should
>   further invoke the second hook targetm.limited_reassociate_p (rtx x),
>   to check the reassociated rtx 'x' meets any restrictions, for example
>   for HPPA, constants part shouldn't beyond +-4K.
Right.

Jeff
Jiong Wang Sept. 2, 2015, 1:36 p.m. UTC | #11
Jeff Law writes:

> On 05/21/2015 02:46 PM, Jiong Wang wrote:
>>
>> Thanks for these thoughts.
>>
>> I tried but still can't prove this transformation will not introduce
>> extra pointer overflow even given it's reassociation with vfp, although
>> my first impression is it do will not introduce extra risk in real
>> application.
>>
>> Have done a quick check on hppa's legitimize_address. I see for (plus
>> sym_ref, const_int), if const_int is beyond +-4K, then that hook will
>> force them into register, then (plus reg, reg) is always OK.
> I'm virtually certain the PA's legitimize_address is not overflow safe. 
>   It was written long before we started worrying about overflows in 
> address computations.  It was mostly concerned with trying generate good 
> addressing modes without running afoul of the implicit space register 
> selection issues.
>
> A SYMBOL_REF is always a valid base register.  However, as the comment 
> in hppa_legitimize_address notes, we might be given a MEM for something 
> like:  x[n-100000].
>
> We don't want to rewrite that as (x-100000) + n, even though doing so 
> would be beneficial for LICM.
>
>
>>
>> So for target hooks,  my understanding of your idea is something like:
>>
>>   new hook targetm.pointer_arith_reassociate (), if return -1 then
>>   support full reassociation, 0 for limited, 1 for should not do any
>>   reassociation. the default version return -1 as most targets are OK to
>>   do reassociation given we can prove there is no introducing of overflow
>>   risk. While for target like HPPA, we should define this hook to return
>>   0 for limited support.
> Right.  Rather than use magic constants, I'd suggest an enum for the 
> tri-state.  FULL_PTR_REASSOCIATION, PARTIAL_PTR_REASSOCIATION, 
> NO_PTR_REASSOCIATION.
>
>
>>
>>   Then, if targetm.pointer_arith_reassociate () return 1, we should
>>   further invoke the second hook targetm.limited_reassociate_p (rtx x),
>>   to check the reassociated rtx 'x' meets any restrictions, for example
>>   for HPPA, constants part shouldn't beyond +-4K.
> Right.
>
> Jeff

For the record, after Bin's recent tree-ssa-ivopt improvement originated
from PR62173, this patch is not benefitial anymore.

I can't see such re-shuffling opportunites in RTL level anymore. This
patch was trying to hoist those RTL sequences generated for local array
base address which haven't been hoisted out of the loop at tree level,
while now they are handled quite well by tree-ssa-ivopt.

During both aarch64 and mips64 bootstrapping, optimization in this patch
haven't been triggered while there were quite a few before Bin's tree level fix.

I have stopped working on this patch. Thanks for those time spent on
reviewing and discussing on this.
Jeff Law Sept. 2, 2015, 8:51 p.m. UTC | #12
On 09/02/2015 07:36 AM, Jiong Wang wrote:

> For the record, after Bin's recent tree-ssa-ivopt improvement originated
> from PR62173, this patch is not benefitial anymore.
It happens sometimes.

>
> I have stopped working on this patch. Thanks for those time spent on
> reviewing and discussing on this.
No problem, thanks for testing after Bin's changes and letting me know 
that this has become a non-issue.

Jeff
diff mbox

Patch

diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
index f79b497..f70dfb0 100644
--- a/gcc/loop-invariant.c
+++ b/gcc/loop-invariant.c
@@ -203,6 +203,8 @@  typedef struct invariant *invariant_p;
 /* The invariants.  */
 
 static vec<invariant_p> invariants;
+static hash_table <pointer_hash <rtx_insn> > *vfp_const_iv;
+static bool need_expensive_addr_check_p;
 
 /* Check the size of the invariant table and realloc if necessary.  */
 
@@ -695,7 +697,7 @@  find_defs (struct loop *loop)
 
   df_remove_problem (df_chain);
   df_process_deferred_rescans ();
-  df_chain_add_problem (DF_UD_CHAIN);
+  df_chain_add_problem (DF_UD_CHAIN + DF_DU_CHAIN);
   df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
   df_analyze_loop (loop);
   check_invariant_table_size ();
@@ -742,6 +744,9 @@  create_new_invariant (struct def *def, rtx_insn *insn, bitmap depends_on,
 	 See http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html .  */
       inv->cheap_address = address_cost (SET_SRC (set), word_mode,
 					 ADDR_SPACE_GENERIC, speed) < 3;
+
+      if (need_expensive_addr_check_p && vfp_const_iv->find (insn))
+	inv->cheap_address = false;
     }
   else
     {
@@ -952,7 +957,8 @@  find_invariant_insn (rtx_insn *insn, bool always_reached, bool always_executed)
     return;
 
   depends_on = BITMAP_ALLOC (NULL);
-  if (!check_dependencies (insn, depends_on))
+  if (!vfp_const_iv->find (insn)
+      && !check_dependencies (insn, depends_on))
     {
       BITMAP_FREE (depends_on);
       return;
@@ -1007,6 +1013,180 @@  find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
   record_uses (insn);
 }
 
+/*  Find the single use of def of insn. At the same time,
+    make sure def of insn is the single def which reach the use.
+    NOTE: debug_insn can affect DF, var_location debug insn may reference
+    the same rtl expr as variable location, such debug_insn should not
+    affect the insn shuffling while we do need to update the loc of the
+    debug_insn after insn shuffling. So here we will record such debug_insn.  */
+
+static rtx_insn *
+use_for_single_du (rtx_insn *insn, rtx_insn **debug_insn, rtx *debug_expr_loc)
+{
+  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+  df_ref def;
+
+  FOR_EACH_INSN_INFO_DEF (def, insn_info)
+    {
+      struct df_link *chain = DF_REF_CHAIN (def);
+
+      if (!chain)
+	return NULL;
+
+      /* For multi use, only allow the second use be debug_insn.  */
+      if (chain->next
+	  && (chain->next->next
+	      || NONDEBUG_INSN_P (DF_REF_INSN (chain->next->ref))))
+	return NULL;
+
+      if (chain->next)
+	{
+	  rtx_insn *insn = DF_REF_INSN (chain->next->ref);
+	  rtx debug_pat = PATTERN (insn);
+	  if (GET_CODE (debug_pat) != VAR_LOCATION)
+	    return NULL;
+	  else
+	    {
+	      *debug_insn = insn;
+	      *debug_expr_loc = PAT_VAR_LOCATION_LOC (debug_pat);
+	    }
+	}
+
+      df_ref use = chain->ref;
+      chain = DF_REF_CHAIN (use);
+
+      if (!chain
+	  || chain->next)
+	return NULL;
+
+      return DF_REF_INSN (use);
+    }
+
+  return NULL;
+}
+
+/* This function also try to transform
+
+     RA <- fixed_reg + RC
+     RD <- MEM (RA + const_offset)
+
+   into:
+
+     RA <- fixed_reg + const_offset
+     RD <- MEM (RA + RC) <- pos0
+
+  If use of RA in the second insn is the single use, and the define of
+  RA in the first insn is the single def reach the second insn.
+
+  After this change, the first instruction is loop invariant.  */
+
+static void
+reshuffle_insn_with_vfp (rtx_insn *insn)
+{
+  rtx set = single_set (insn);
+
+  if (!set
+      || GET_CODE (SET_SRC (set)) != PLUS)
+    return;
+
+  rtx dest = SET_DEST (set);
+  rtx op0 = XEXP (SET_SRC (set), 0);
+  rtx op1 = XEXP (SET_SRC (set), 1);
+  rtx non_vfp_op = op1;
+
+  if (op1 == frame_pointer_rtx
+      || op1 == stack_pointer_rtx
+      || op1 == virtual_stack_vars_rtx)
+    {
+      non_vfp_op = op0;
+      std::swap (op0, op1);
+    }
+
+  if (GET_CODE (op1) == SIGN_EXTEND
+      || GET_CODE (op1) == ZERO_EXTEND)
+    op1 = XEXP (op1, 0);
+
+  if (!(REG_P (dest) && REG_P (op0) && REG_P (op1)
+	&& (op0 == frame_pointer_rtx
+	    || op0 == stack_pointer_rtx
+	    || op0 == virtual_stack_vars_rtx)))
+    return;
+
+  rtx_insn *use_insn, *debug_insn = NULL;
+  rtx debug_expr_loc;
+
+  if (!(use_insn = use_for_single_du (insn, &debug_insn, &debug_expr_loc)))
+    return;
+
+  rtx u_set = single_set (use_insn);
+
+  if (!(u_set && (REG_P (SET_DEST (u_set)) || MEM_P (SET_DEST (u_set)))))
+    return;
+
+  rtx mem_addr;
+
+  if (REG_P (SET_DEST (u_set)))
+    mem_addr = SET_SRC (u_set);
+  else
+    mem_addr = SET_DEST (u_set);
+
+  if (debug_insn && !rtx_equal_p (debug_expr_loc, mem_addr))
+    return;
+
+  if (GET_CODE (mem_addr) == ZERO_EXTEND
+      || GET_CODE (mem_addr) == SIGN_EXTEND
+      || GET_CODE (mem_addr) == TRUNCATE)
+    mem_addr = XEXP (mem_addr, 0);
+
+  if (!MEM_P (mem_addr)
+      || GET_CODE (XEXP (mem_addr, 0)) != PLUS)
+    return;
+
+  rtx *mem_plus_loc = &XEXP (mem_addr, 0);
+  rtx u_op0 = XEXP (XEXP (mem_addr, 0), 0);
+  rtx u_op1 = XEXP (XEXP (mem_addr, 0), 1);
+
+  if (!(REG_P (u_op0) && CONST_INT_P (u_op1)
+	&& REGNO (dest) == REGNO (u_op0)))
+    return;
+
+  rtx new_src = plus_constant (GET_MODE (op0), op0, INTVAL (u_op1));
+  validate_change (insn, &SET_SRC (set), new_src, 1);
+  new_src = gen_rtx_PLUS (GET_MODE (u_op0), u_op0, non_vfp_op);
+  validate_change (use_insn, mem_plus_loc, new_src, 1);
+  if (debug_insn)
+    validate_change (debug_insn, &XEXP(debug_expr_loc, 0), copy_rtx (new_src),
+		     1);
+
+  if (apply_change_group ())
+    {
+      rtx_insn **slot = vfp_const_iv->find_slot (insn, INSERT);
+
+      if (*slot)
+	gcc_unreachable ();
+      else
+	*slot = insn;
+
+      need_expensive_addr_check_p = true;
+
+      /* We should update REG_DEAD info also.  */
+      rtx note;
+      if ((note = find_reg_note (insn, REG_DEAD, non_vfp_op)))
+	{
+	  remove_note (insn, note);
+	  add_shallow_copy_of_reg_note (use_insn, note);
+	}
+
+      if (dump_file)
+	fprintf (dump_file,
+		 "\nRe-associate insn %d and %d for later"
+		 " RTL loop invariant hoisting.\n",
+		 INSN_UID (insn), INSN_UID (use_insn));
+    }
+
+  return;
+}
+
 /* Finds invariants in basic block BB.  ALWAYS_REACHED is true if the
    basic block is always executed.  ALWAYS_EXECUTED is true if the basic
    block is always executed, unless the program ends due to a function
@@ -1022,6 +1197,8 @@  find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
       if (!NONDEBUG_INSN_P (insn))
 	continue;
 
+      reshuffle_insn_with_vfp (insn);
+
       find_invariants_insn (insn, always_reached, always_executed);
 
       if (always_reached
@@ -1647,6 +1824,9 @@  move_invariants (struct loop *loop)
 static void
 init_inv_motion_data (void)
 {
+  need_expensive_addr_check_p = false;
+  vfp_const_iv = new hash_table <pointer_hash <rtx_insn> > (4);
+
   actual_stamp = 1;
 
   invariants.create (100);
@@ -1682,6 +1862,8 @@  free_inv_motion_data (void)
       free (inv);
     }
   invariants.release ();
+
+  delete vfp_const_iv;
 }
 
 /* Move the invariants out of the LOOP.  */