diff mbox

PR 62173, re-shuffle insns for RTL loop invariant hoisting

Message ID 54803EBE.2060607@arm.com
State New
Headers show

Commit Message

Jiong Wang Dec. 4, 2014, 11 a.m. UTC
For PR62173, the ideal solution is to resolve the problem on tree level ivopt pass.

While, apart from the tree level issue, PR 62173 also exposed another two RTL level issues.
one of them is looks like we could improve RTL level loop invariant hoisting by re-shuffle insns.

for Seb's testcase

void bar(int i) {
   char A[10];
   int d = 0;
   while (i > 0)
   A[d++] = i--;

   while (d > 0)
   foo(A[d--]);
}

the insn sequences to calculate A[I]'s address looks like:

(insn 76 75 77 22 (set (reg/f:DI 109)
   (plus:DI (reg/f:DI 64 sfp)
   (reg:DI 108 [ i ]))) seb-pop.c:8 84 {*adddi3_aarch64}
   (expr_list:REG_DEAD (reg:DI 108 [ i ])
   (nil)))
(insn 77 76 78 22 (set (reg:SI 110 [ D.2633 ])
   (zero_extend:SI (mem/j:QI (plus:DI (reg/f:DI 109)
   (const_int -16 [0xfffffffffffffff0])) [0 A S1 A8]))) seb-pop.c:8 76 {*zero_extendqisi2_aarch64}
   (expr_list:REG_DEAD (reg/f:DI 109)
   (nil)))

while for most RISC archs, reg + reg addressing is typical, so if we re-shuffle
the instruction sequences into the following:

(insn 96 94 97 22 (set (reg/f:DI 129)
   (plus:DI (reg/f:DI 64 sfp)
   (const_int -16 [0xfffffffffffffff0]))) seb-pop.c:8 84 {*adddi3_aarch64}
   (nil))
(insn 97 96 98 22 (set (reg:DI 130 [ i ])
   (sign_extend:DI (reg/v:SI 97 [ i ]))) seb-pop.c:8 70 {*extendsidi2_aarch64}
   (expr_list:REG_DEAD (reg/v:SI 97 [ i ])
   (nil)))
(insn 98 97 99 22 (set (reg:SI 131 [ D.2633 ])
   (zero_extend:SI (mem/j:QI (plus:DI (reg/f:DI 129)
   (reg:DI 130 [ i ])) [0 A S1 A8]))) seb-pop.c:8 76 {*zero_extendqisi2_aarch64}
   (expr_list:REG_DEAD (reg:DI 130 [ i ])
   (expr_list:REG_DEAD (reg/f:DI 129)
   (nil))))

which means re-associate the constant imm with the virtual frame pointer.

transform

      RA <- fixed_reg + RC
      RD <- MEM (RA + const_offset)

   into:

      RA <- fixed_reg + const_offset
      RD <- MEM (RA + RC)

then RA <- fixed_reg + const_offset is actually loop invariant, so the later
RTL GCSE PRE pass could catch it and do the hoisting, and thus ameliorate what tree
level ivopts could not sort out.

and this patch only tries to re-shuffle instructions within single basic block which
is a inner loop which is perf critical.

I am reusing the loop info in fwprop because there is loop info and it's run before
GCSE.

verified on aarch64 and mips64, the array base address hoisted out of loop.

bootstrap ok on x86-64 and aarch64.

comments?

thanks.

gcc/
   PR62173
   fwprop.c (prepare_for_gcse_pre): New function.
   (fwprop_done): Call it.

Comments

Richard Biener Dec. 4, 2014, 11:07 a.m. UTC | #1
On Thu, Dec 4, 2014 at 12:00 PM, Jiong Wang <jiong.wang@arm.com> wrote:
> For PR62173, the ideal solution is to resolve the problem on tree level
> ivopt pass.
>
> While, apart from the tree level issue, PR 62173 also exposed another two
> RTL level issues.
> one of them is looks like we could improve RTL level loop invariant hoisting
> by re-shuffle insns.
>
> for Seb's testcase
>
> void bar(int i) {
>   char A[10];
>   int d = 0;
>   while (i > 0)
>   A[d++] = i--;
>
>   while (d > 0)
>   foo(A[d--]);
> }
>
> the insn sequences to calculate A[I]'s address looks like:
>
> (insn 76 75 77 22 (set (reg/f:DI 109)
>   (plus:DI (reg/f:DI 64 sfp)
>   (reg:DI 108 [ i ]))) seb-pop.c:8 84 {*adddi3_aarch64}
>   (expr_list:REG_DEAD (reg:DI 108 [ i ])
>   (nil)))
> (insn 77 76 78 22 (set (reg:SI 110 [ D.2633 ])
>   (zero_extend:SI (mem/j:QI (plus:DI (reg/f:DI 109)
>   (const_int -16 [0xfffffffffffffff0])) [0 A S1 A8]))) seb-pop.c:8 76
> {*zero_extendqisi2_aarch64}
>   (expr_list:REG_DEAD (reg/f:DI 109)
>   (nil)))
>
> while for most RISC archs, reg + reg addressing is typical, so if we
> re-shuffle
> the instruction sequences into the following:
>
> (insn 96 94 97 22 (set (reg/f:DI 129)
>   (plus:DI (reg/f:DI 64 sfp)
>   (const_int -16 [0xfffffffffffffff0]))) seb-pop.c:8 84 {*adddi3_aarch64}
>   (nil))
> (insn 97 96 98 22 (set (reg:DI 130 [ i ])
>   (sign_extend:DI (reg/v:SI 97 [ i ]))) seb-pop.c:8 70
> {*extendsidi2_aarch64}
>   (expr_list:REG_DEAD (reg/v:SI 97 [ i ])
>   (nil)))
> (insn 98 97 99 22 (set (reg:SI 131 [ D.2633 ])
>   (zero_extend:SI (mem/j:QI (plus:DI (reg/f:DI 129)
>   (reg:DI 130 [ i ])) [0 A S1 A8]))) seb-pop.c:8 76
> {*zero_extendqisi2_aarch64}
>   (expr_list:REG_DEAD (reg:DI 130 [ i ])
>   (expr_list:REG_DEAD (reg/f:DI 129)
>   (nil))))
>
> which means re-associate the constant imm with the virtual frame pointer.
>
> transform
>
>      RA <- fixed_reg + RC
>      RD <- MEM (RA + const_offset)
>
>   into:
>
>      RA <- fixed_reg + const_offset
>      RD <- MEM (RA + RC)
>
> then RA <- fixed_reg + const_offset is actually loop invariant, so the later
> RTL GCSE PRE pass could catch it and do the hoisting, and thus ameliorate
> what tree
> level ivopts could not sort out.

There is a LIM pass after gimple ivopts - if the invariantness is already
visible there why not handle it there similar to the special-cases in
rewrite_bittest and rewrite_reciprocal?

And of course similar tricks could be applied on the RTL level to
RTL invariant motion?

Thanks,
Richard.

> and this patch only tries to re-shuffle instructions within single basic
> block which
> is a inner loop which is perf critical.
>
> I am reusing the loop info in fwprop because there is loop info and it's run
> before
> GCSE.
>
> verified on aarch64 and mips64, the array base address hoisted out of loop.
>
> bootstrap ok on x86-64 and aarch64.
>
> comments?
>
> thanks.
>
> gcc/
>   PR62173
>   fwprop.c (prepare_for_gcse_pre): New function.
>   (fwprop_done): Call it.
Richard Biener Dec. 4, 2014, 11:07 a.m. UTC | #2
On Thu, Dec 4, 2014 at 12:07 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Thu, Dec 4, 2014 at 12:00 PM, Jiong Wang <jiong.wang@arm.com> wrote:
>> For PR62173, the ideal solution is to resolve the problem on tree level
>> ivopt pass.
>>
>> While, apart from the tree level issue, PR 62173 also exposed another two
>> RTL level issues.
>> one of them is looks like we could improve RTL level loop invariant hoisting
>> by re-shuffle insns.
>>
>> for Seb's testcase
>>
>> void bar(int i) {
>>   char A[10];
>>   int d = 0;
>>   while (i > 0)
>>   A[d++] = i--;
>>
>>   while (d > 0)
>>   foo(A[d--]);
>> }
>>
>> the insn sequences to calculate A[I]'s address looks like:
>>
>> (insn 76 75 77 22 (set (reg/f:DI 109)
>>   (plus:DI (reg/f:DI 64 sfp)
>>   (reg:DI 108 [ i ]))) seb-pop.c:8 84 {*adddi3_aarch64}
>>   (expr_list:REG_DEAD (reg:DI 108 [ i ])
>>   (nil)))
>> (insn 77 76 78 22 (set (reg:SI 110 [ D.2633 ])
>>   (zero_extend:SI (mem/j:QI (plus:DI (reg/f:DI 109)
>>   (const_int -16 [0xfffffffffffffff0])) [0 A S1 A8]))) seb-pop.c:8 76
>> {*zero_extendqisi2_aarch64}
>>   (expr_list:REG_DEAD (reg/f:DI 109)
>>   (nil)))
>>
>> while for most RISC archs, reg + reg addressing is typical, so if we
>> re-shuffle
>> the instruction sequences into the following:
>>
>> (insn 96 94 97 22 (set (reg/f:DI 129)
>>   (plus:DI (reg/f:DI 64 sfp)
>>   (const_int -16 [0xfffffffffffffff0]))) seb-pop.c:8 84 {*adddi3_aarch64}
>>   (nil))
>> (insn 97 96 98 22 (set (reg:DI 130 [ i ])
>>   (sign_extend:DI (reg/v:SI 97 [ i ]))) seb-pop.c:8 70
>> {*extendsidi2_aarch64}
>>   (expr_list:REG_DEAD (reg/v:SI 97 [ i ])
>>   (nil)))
>> (insn 98 97 99 22 (set (reg:SI 131 [ D.2633 ])
>>   (zero_extend:SI (mem/j:QI (plus:DI (reg/f:DI 129)
>>   (reg:DI 130 [ i ])) [0 A S1 A8]))) seb-pop.c:8 76
>> {*zero_extendqisi2_aarch64}
>>   (expr_list:REG_DEAD (reg:DI 130 [ i ])
>>   (expr_list:REG_DEAD (reg/f:DI 129)
>>   (nil))))
>>
>> which means re-associate the constant imm with the virtual frame pointer.
>>
>> transform
>>
>>      RA <- fixed_reg + RC
>>      RD <- MEM (RA + const_offset)
>>
>>   into:
>>
>>      RA <- fixed_reg + const_offset
>>      RD <- MEM (RA + RC)
>>
>> then RA <- fixed_reg + const_offset is actually loop invariant, so the later
>> RTL GCSE PRE pass could catch it and do the hoisting, and thus ameliorate
>> what tree
>> level ivopts could not sort out.
>
> There is a LIM pass after gimple ivopts - if the invariantness is already
> visible there why not handle it there similar to the special-cases in
> rewrite_bittest and rewrite_reciprocal?
>
> And of course similar tricks could be applied on the RTL level to
> RTL invariant motion?

Oh, and the patch misses a testcase.

> Thanks,
> Richard.
>
>> and this patch only tries to re-shuffle instructions within single basic
>> block which
>> is a inner loop which is perf critical.
>>
>> I am reusing the loop info in fwprop because there is loop info and it's run
>> before
>> GCSE.
>>
>> verified on aarch64 and mips64, the array base address hoisted out of loop.
>>
>> bootstrap ok on x86-64 and aarch64.
>>
>> comments?
>>
>> thanks.
>>
>> gcc/
>>   PR62173
>>   fwprop.c (prepare_for_gcse_pre): New function.
>>   (fwprop_done): Call it.
Jiong Wang Dec. 4, 2014, 7:32 p.m. UTC | #3
On 04/12/14 11:07, Richard Biener wrote:

> On Thu, Dec 4, 2014 at 12:07 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Thu, Dec 4, 2014 at 12:00 PM, Jiong Wang <jiong.wang@arm.com> wrote:
>>
>>
>> which means re-associate the constant imm with the virtual frame pointer.
>>
>> transform
>>
>>       RA <- fixed_reg + RC
>>       RD <- MEM (RA + const_offset)
>>
>>    into:
>>
>>       RA <- fixed_reg + const_offset
>>       RD <- MEM (RA + RC)
>>
>> then RA <- fixed_reg + const_offset is actually loop invariant, so the later
>> RTL GCSE PRE pass could catch it and do the hoisting, and thus ameliorate
>> what tree
>> level ivopts could not sort out.
>> There is a LIM pass after gimple ivopts - if the invariantness is already
>> visible there why not handle it there similar to the special-cases in
>> rewrite_bittest and rewrite_reciprocal?

maybe, needs further check.

>>
>> And of course similar tricks could be applied on the RTL level to
>> RTL invariant motion?

Thanks. I just checked the code, yes, loop invariant motion pass
is the natural place to integrate such multi-insns invariant analysis trick.

those code could be integrated into loop-invariant.c cleanly, but
then I found although the invariant detected successfully but it's not moved
out of loop because of cost issue, and looks like the patch committed to fix PR33928
is trying to prevent such cheap address be hoisted to reduce register pressure.


  805       /* ??? Try to determine cheapness of address computation.  Unfortunately
  806          the address cost is only a relative measure, we can't really compare
  807          it with any absolute number, but only with other address costs.
  808          But here we don't have any other addresses, so compare with a magic
  809          number anyway.  It has to be large enough to not regress PR33928
  810          (by avoiding to move reg+8,reg+16,reg+24 invariants), but small
  811          enough to not regress 410.bwaves either (by still moving reg+reg
  812          invariants).
  813          See http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html .  */
  814       inv->cheap_address = address_cost (SET_SRC (set), word_mode,
  815                                          ADDR_SPACE_GENERIC, speed) < 3;


I think that maybe necessary for x86 which is short of register, while for RISC,
it may not be that necessary, especially the whole register pressure is not big.

currently, what I could think of is for this transformation below, we should
increase the costs:
  
A
==
      RA <- virtual_stack_var + RC
      RD <- MEM (RA + const_offset)

   into:

B
==
      RA <- virtual_stack_var + const_offset  <--B
      RD <- MEM (RA + RC)

because the cost is not that cheap, if there is not re-assocation of virtual_stack_var
with const_offset, then lra elimination will create another instruction to hold the
elimination result, so format A will actually be

      RT <- real_stack_pointer + elimination_offset
      RA <- RT + RC
      RD <- MEM (RA + const_offset)

so, the re-assocation and later hoisting of invariant B could actually save two instructions
in the loop, this is why there are 15% perf gap for bzip2 under some situation.

On aarch64, for Seb's testcase,

before IV hoisting
===
         b       .L4
.L37:
         sub     w19, w19, #1
.L4:
	add	x1, x29, 48
	add	x0, x1, x0, sxtw
	ldrb	w0, [x0, -16]
         bl      foo
         mov     w0, w19
         cbnz    w19, .L37
         ldr     x19, [sp, 16]
         ldp     x29, x30, [sp], 48
         ret

after this transformation, and the IV hoisting
(need one extra callee saved, x20)
===
.L3:
         add     x20, x29, 32  <-- IV hoisted
         b       .L4
.L37:
         sub     w19, w19, #1
.L4:
         ldrb    w0, [x20, w0, sxtw]
         bl      foo
         mov     w0, w19
         cbnz    w19, .L37
         ldp     x19, x20, [sp, 16]
         ldp     x29, x30, [sp], 48
         ret

> Oh, and the patch misses a testcase.

It's at the start of the email, will include in the patch.

void bar(int i) {
   char A[10];
   int d = 0;
   while (i > 0)
   A[d++] = i--;

   while (d > 0)
   foo(A[d--]);
}

>
>> Thanks,
>> Richard.
>>
>>> and this patch only tries to re-shuffle instructions within single basic
>>> block which
>>> is a inner loop which is perf critical.
>>>
>>> I am reusing the loop info in fwprop because there is loop info and it's run
>>> before
>>> GCSE.
>>>
>>> verified on aarch64 and mips64, the array base address hoisted out of loop.
>>>
>>> bootstrap ok on x86-64 and aarch64.
>>>
>>> comments?
>>>
>>> thanks.
>>>
>>> gcc/
>>>    PR62173
>>>    fwprop.c (prepare_for_gcse_pre): New function.
>>>    (fwprop_done): Call it.
Jiong Wang Dec. 15, 2014, 3:28 p.m. UTC | #4
On 04/12/14 19:32, Jiong Wang wrote:
> On 04/12/14 11:07, Richard Biener wrote:
>
>> On Thu, Dec 4, 2014 at 12:07 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Thu, Dec 4, 2014 at 12:00 PM, Jiong Wang <jiong.wang@arm.com> wrote:
>>>
>>>
>>> which means re-associate the constant imm with the virtual frame pointer.
>>>
>>> transform
>>>
>>>        RA <- fixed_reg + RC
>>>        RD <- MEM (RA + const_offset)
>>>
>>>     into:
>>>
>>>        RA <- fixed_reg + const_offset
>>>        RD <- MEM (RA + RC)
>>>
>>> then RA <- fixed_reg + const_offset is actually loop invariant, so the later
>>> RTL GCSE PRE pass could catch it and do the hoisting, and thus ameliorate
>>> what tree
>>> level ivopts could not sort out.
>>> There is a LIM pass after gimple ivopts - if the invariantness is already
>>> visible there why not handle it there similar to the special-cases in
>>> rewrite_bittest and rewrite_reciprocal?
> maybe, needs further check.
>
>>> And of course similar tricks could be applied on the RTL level to
>>> RTL invariant motion?
> Thanks. I just checked the code, yes, loop invariant motion pass
> is the natural place to integrate such multi-insns invariant analysis trick.
>
> those code could be integrated into loop-invariant.c cleanly, but
> then I found although the invariant detected successfully but it's not moved
> out of loop because of cost issue, and looks like the patch committed to fix PR33928
> is trying to prevent such cheap address be hoisted to reduce register pressure.
>
>
>    805       /* ??? Try to determine cheapness of address computation.  Unfortunately
>    806          the address cost is only a relative measure, we can't really compare
>    807          it with any absolute number, but only with other address costs.
>    808          But here we don't have any other addresses, so compare with a magic
>    809          number anyway.  It has to be large enough to not regress PR33928
>    810          (by avoiding to move reg+8,reg+16,reg+24 invariants), but small
>    811          enough to not regress 410.bwaves either (by still moving reg+reg
>    812          invariants).
>    813          See http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html .  */
>    814       inv->cheap_address = address_cost (SET_SRC (set), word_mode,
>    815                                          ADDR_SPACE_GENERIC, speed) < 3;
>
>
> I think that maybe necessary for x86 which is short of register, while for RISC,
> it may not be that necessary, especially the whole register pressure is not big.
>
> currently, what I could think of is for this transformation below, we should
> increase the costs:
>    
> A
> ==
>        RA <- virtual_stack_var + RC
>        RD <- MEM (RA + const_offset)
>
>     into:
>
> B
> ==
>        RA <- virtual_stack_var + const_offset  <--B
>        RD <- MEM (RA + RC)
>
> because the cost is not that cheap, if there is not re-assocation of virtual_stack_var
> with const_offset, then lra elimination will create another instruction to hold the
> elimination result, so format A will actually be
>
>        RT <- real_stack_pointer + elimination_offset
>        RA <- RT + RC
>        RD <- MEM (RA + const_offset)
>
> so, the re-assocation and later hoisting of invariant B could actually save two instructions
> in the loop, this is why there are 15% perf gap for bzip2 under some situation.

updated patch.

moved this instruction shuffling trick to rtl loop invariant pass.
as described above, this patch tries to transform A to B, so that
after the transformation:

   * RA <- virtual_stack_var + const_offset  could be hoisted out of the loop
   * easy the work of lra elimination as virtual_stack_var is associated with const_offset
     that the elimination offset could be combined with const_offset automatically.

current rtl loop invariant pass treat "reg <- reg + off" as cheap address, while although
"reg <- virtual_stack_var + offset" fall into the same format, but it's not that cheap as
we could also save one lra elimination instruction. so this patch will mark
"reg <- virtual_stack_var + offset" transformed from A to be expensive, so that it could be
hoisted later.

after patch, pr62173 is fixed on powerpc64, while *still not on aarch64*. because there are one
glitch in aarch64_legitimize_address which cause unnecessary complex instructions sequences
generated when legitimize some addresses. and if we fix that, we will get cheaper address for
those cases which is generally good, and the cheaper address will cause tree level IVOPT do
more IVOPT optimization which is generally good also, but from the speck2k result, there
are actually around 1% code size regression on two cases, the reason is for target support
post-index address, doing IVOPT may not always be the best choice because we lost the advantage
of using post-index addressing.

on aarch64, for the following testcase, the ivopted version is complexer than not ivopted version.

     while (oargc--) *(nargv++) = *(oargv++);
  
so, I sent the generic fix here only, as it's an independent patch, and could benefit other targets
like powerpc64 although the issue on aarch64 is still not resolved before we figure out how to let
the correct fix on aarch64_legitimize_address do not cause code size regression on benchmark caused
by sub-optimal tree level IVOPT.

and the testcase is marked to run on powerpc64 only at current time.

bootstrap OK on x86-64, no regression.
bootstrap OK on AArch64, and from speck2k compile dump, there do have a few more RTL loop
invariants get hoisted.

ok for trunk?

gcc/
PR62173
   loop-invariant.c.c (expensive_addr): New hash_table.
   (need_expensive_addr_check_p): New bool.
   (find_exits): Rename to "find_exists_and_reshuffle.
   Support re-shuffle instructions for better loop invariant hoisting.
   (create_new_invariant): Mark address as expensive if it's generated by re-shuffle.
   (init_inv_motion_data): Initialize expensive_addr and need_expensive_addr_check_p.
   (free_inv_motion_data): Release expensive_addr.

gcc/testssuite/
   PR62173
   gcc.dg/pr62173.c: New test.

>> Oh, and the patch misses a testcase.
> It's at the start of the email, will include in the patch.
>
> void bar(int i) {
>     char A[10];
>     int d = 0;
>     while (i > 0)
>     A[d++] = i--;
>
>     while (d > 0)
>     foo(A[d--]);
> }
>
>>> Thanks,
>>> Richard.
>>>
>>>> and this patch only tries to re-shuffle instructions within single basic
>>>> block which
>>>> is a inner loop which is perf critical.
>>>>
>>>> I am reusing the loop info in fwprop because there is loop info and it's run
>>>> before
>>>> GCSE.
>>>>
>>>> verified on aarch64 and mips64, the array base address hoisted out of loop.
>>>>
>>>> bootstrap ok on x86-64 and aarch64.
>>>>
>>>> comments?
>>>>
>>>> thanks.
>>>>
>>>> gcc/
>>>>     PR62173
>>>>     fwprop.c (prepare_for_gcse_pre): New function.
>>>>     (fwprop_done): Call it.
>
>
diff mbox

Patch

diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index 377b33c..b2a5918 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1399,6 +1399,133 @@  forward_propagate_into (df_ref use)
   return false;
 }
 
+/* Loop invariant variable hoisting for critical code has
+   important impact on the performance.
+
+   The RTL GCSE PRE pass could detect more hoisting opportunities
+   if we re-shuffle the instructions to associate fixed registers
+   with constant.
+
+   This function try to transform
+
+     RA <- RB_fixed + RC
+     RD <- MEM (RA + const_offset)
+
+  into:
+
+     RA <- RB_fixed + const_offset
+     RD <- MEM (RA + RC)
+
+  If RA is DEAD after the second instruction.
+
+  After this change, the first instruction is loop invariant.  */
+
+static void
+prepare_for_gcse_pre ()
+{
+  struct loop *loop;
+
+  if (! current_loops)
+    return;
+
+  FOR_EACH_LOOP (loop, LI_INCLUDE_ROOT)
+    {
+      if (loop && loop->header && loop->latch
+	  && loop->header->index == loop->latch->index)
+	{
+	  rtx_insn *insn, *next_insn;
+	  rtx single_set1, single_set2, old_dest;
+	  rtx op0, op0_;
+	  rtx op1, op1_;
+	  rtx inner;
+	  rtx *mem_plus_loc;
+
+	  basic_block bb = BASIC_BLOCK_FOR_FN (cfun, loop->header->index);
+
+	  FOR_BB_INSNS (bb, insn)
+	    {
+	      if (! NONDEBUG_INSN_P (insn))
+		continue;
+
+	      single_set1 = single_set (insn);
+
+	      if (! single_set1
+		  || GET_CODE (SET_SRC (single_set1)) != PLUS)
+		continue;
+
+	      old_dest = SET_DEST (single_set1);
+	      op0 = XEXP (SET_SRC (single_set1), 0);
+	      op1 = XEXP (SET_SRC (single_set1), 1);
+
+	      if (op1 == frame_pointer_rtx
+		  || op1 == stack_pointer_rtx
+		  || op1 == virtual_stack_vars_rtx)
+		std::swap (op0, op1);
+
+	      if (! (REG_P (old_dest) && REG_P (op0) && REG_P (op1)
+		     && (op0 == frame_pointer_rtx
+			 || op0 == stack_pointer_rtx
+			 || op0 == virtual_stack_vars_rtx)))
+		continue;
+
+	      if (! (next_insn = next_real_insn (insn)))
+		break;
+
+	      do
+		{
+		  if (DEBUG_INSN_P (next_insn))
+		    continue;
+
+		  single_set2 = single_set (next_insn);
+
+		  if (!single_set2 || ! REG_P (SET_DEST (single_set2)))
+		    continue;
+
+		  inner = SET_SRC (single_set2);
+
+		  if (GET_CODE (inner) == ZERO_EXTEND
+		      || GET_CODE (inner) == SIGN_EXTEND
+		      || GET_CODE (inner) == TRUNCATE)
+		    inner = XEXP (inner, 0);
+
+		  if (! MEM_P (inner)
+		      || GET_CODE (XEXP (inner, 0)) != PLUS)
+		    continue;
+
+		  mem_plus_loc = &XEXP (inner, 0);
+		  op0_ = XEXP (XEXP (inner, 0), 0);
+		  op1_ = XEXP (XEXP (inner, 0), 1);
+
+		  if (REG_P (op0_) && CONST_INT_P (op1_)
+		      && rtx_equal_p (op0_, old_dest)
+		      && GET_MODE (op0_) == GET_MODE (op1))
+		    {
+		      rtx new_src;
+
+		      if (find_regno_note (next_insn, REG_DEAD,
+					   REGNO (old_dest)))
+			{
+			  new_src = plus_constant (GET_MODE (op0), op0,
+						   INTVAL (op1_));
+			  validate_change (insn, &SET_SRC (single_set1),
+					   new_src, 1);
+			  new_src = gen_rtx_PLUS (GET_MODE (op0_), op0_, op1);
+			  validate_change (next_insn, mem_plus_loc, new_src, 1);
+			  if (apply_change_group () && dump_file)
+			    fprintf (dump_file,
+				     "\nRe-associate insn %d and %d for later"
+				     " RTL loop invariant hoisting.\n",
+				     INSN_UID (insn), INSN_UID (next_insn));
+			}
+		      break;
+		    }
+		} while ((next_insn = next_real_insn (next_insn))
+			 && bb == BLOCK_FOR_INSN (next_insn));
+	    }
+	}
+    }
+}
+
 
 static void
 fwprop_init (void)
@@ -1424,6 +1551,7 @@  fwprop_init (void)
 static void
 fwprop_done (void)
 {
+  prepare_for_gcse_pre ();
   loop_optimizer_finalize ();
 
   use_def_ref.release ();