Patchwork Fix PR46556 (poor address generation)

login
register
mail settings
Submitter William J. Schmidt
Date Oct. 5, 2011, 4:13 p.m.
Message ID <1317831212.4199.45.camel@oc2474580526.ibm.com>
Download mbox | patch
Permalink /patch/117887/
State New
Headers show

Comments

William J. Schmidt - Oct. 5, 2011, 4:13 p.m.
This patch addresses the poor code generation in PR46556 for the
following code:

struct x
{
  int a[16];
  int b[16];
  int c[16];
};

extern void foo (int, int, int);

void
f (struct x *p, unsigned int n)
{
  foo (p->a[n], p->c[n], p->b[n]);
}

Prior to the fix for PR32698, gcc calculated the offset for accessing
the array elements as:  n*4; 64+n*4; 128+n*4.

Following that fix, the offsets are calculated as:  n*4; (n+16)*4; (n
+32)*4.  This led to poor code generation on powerpc64 targets, among
others.

The poor code generation was observed to not occur in loops, as the
IVOPTS code does a good job of lowering these expressions to MEM_REFs.
It was previously suggested that perhaps a general pass to lower memory
accesses to MEM_REFs in GIMPLE would solve not only this, but other
similar problems.  I spent some time looking into various approaches to
this, and reviewing some previous attempts to do similar things.  In the
end, I've concluded that this is a bad idea in practice because of the
loss of useful aliasing information.  In particular, early lowering of
component references causes us to lose the ability to disambiguate
non-overlapping references in the same structure, and there is no simple
way to carry the necessary aliasing information along with the
replacement MEM_REFs to avoid this.  While some performance gains are
available with GIMPLE lowering of memory accesses, there are also
offsetting performance losses, and I suspect this would just be a
continuous source of bug reports into the future.

Therefore the current patch is a much simpler approach to solve the
specific problem noted in the PR.  There are two pieces to the patch:

 * The offending addressing pattern is matched in GIMPLE and transformed
into a restructured MEM_REF that distributes the multiply, so that (n
+32)*4 becomes 4*n+128 as before.  This is done during the reassociation
pass, for reasons described below.  The transformation only occurs in
non-loop blocks, since IVOPTS does a good job on such things within
loops.
 * A tweak is added to the RTL forward-propagator to avoid propagating
into memory references based on a single base register with no offset,
under certain circumstances.  This improves sharing of base registers
for accesses within the same structure and slightly lowers register
pressure.

It would be possible to separate these into two patches if that's
preferred.  I chose to combine them because together they provide the
ideal code generation that the new test cases test for.

I initially implemented the pattern matcher during expand, but I found
that the expanded code for two accesses to the same structure was often
not being CSEd well.  So I moved it back into the GIMPLE phases prior to
DOM to take advantage of its CSE.  To avoid an additional complete pass
over the IL, I chose to piggyback on the reassociation pass.  This
transformation is not technically a reassociation, but it is related
enough to not be a complete wart.

One noob question about this:  It would probably be preferable to have
this transformation only take place during the second reassociation
pass, so the ARRAY_REFs are seen by earlier optimization phases.  Is
there an easy way to detect that it's the second pass without having to
generate a separate pass entry point?

One other general question about the pattern-match transformation:  Is
this an appropriate transformation for all targets, or should it be
somehow gated on available addressing modes on the target processor?

Bootstrapped and regression tested on powerpc64-linux-gnu.  Verified no
performance degradations on that target for SPEC CPU2000 and CPU2006.

I'm looking for eventual approval for trunk after any comments are
resolved.  Thanks!

Bill


2011-10-05  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

gcc:
	
	PR rtl-optimization/46556
	* fwprop.c (fwprop_bb_aux_d): New struct.
	(MEM_PLUS_REGS): New macro.
	(record_mem_plus_reg): New function.
	(record_mem_plus_regs): Likewise.
	(single_def_use_enter_block): Record mem-plus-reg patterns.
	(build_single_def_use_links): Allocate aux storage.
	(locally_poor_mem_replacement): New function.
	(forward_propagate_and_simplify): Call
	locally_poor_mem_replacement.
	(fwprop_init): Free storage.
	* tree.h (copy_ref_info): Expose existing function.
	* tree-ssa-loop-ivopts.c (copy_ref_info): Remove static token.
	* tree-ssa-reassoc.c (restructure_base_and_offset): New function.
	(restructure_mem_ref): Likewise.
	(reassociate_bb): Look for opportunities to call
	restructure_mem_ref; clean up immediate use lists.

gcc/testsuite:
	
	PR rtl-optimization/46556
	* gcc.target/powerpc/ppc-pr46556-1.c: New testcase.
	* gcc.target/powerpc/ppc-pr46556-2.c: Likewise.
	* gcc.target/powerpc/ppc-pr46556-3.c: Likewise.
	* gcc.target/powerpc/ppc-pr46556-4.c: Likewise.
	* gcc.dg/tree-ssa/pr46556-1.c: Likewise.
	* gcc.dg/tree-ssa/pr46556-2.c: Likewise.
	* gcc.dg/tree-ssa/pr46556-3.c: Likewise.
Paolo Bonzini - Oct. 5, 2011, 4:21 p.m.
On 10/05/2011 06:13 PM, William J. Schmidt wrote:
> One other general question about the pattern-match transformation:  Is
> this an appropriate transformation for all targets, or should it be
> somehow gated on available addressing modes on the target processor?
>
> Bootstrapped and regression tested on powerpc64-linux-gnu.  Verified no
> performance degradations on that target for SPEC CPU2000 and CPU2006.
>
> I'm looking for eventual approval for trunk after any comments are
> resolved.  Thanks!

How do the costs look like for the two transforms you mention in the 
head comment of locally_poor_mem_replacement?

Paolo
Steven Bosscher - Oct. 5, 2011, 4:29 p.m.
On Wed, Oct 5, 2011 at 6:13 PM, William J. Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
>        * tree-ssa-loop-ivopts.c (copy_ref_info): Remove static token.

Rather than this, why not move the function to common code somewhere?

Ciao!
Steven
William J. Schmidt - Oct. 5, 2011, 5:14 p.m.
On Wed, 2011-10-05 at 18:29 +0200, Steven Bosscher wrote:
> On Wed, Oct 5, 2011 at 6:13 PM, William J. Schmidt
> <wschmidt@linux.vnet.ibm.com> wrote:
> >        * tree-ssa-loop-ivopts.c (copy_ref_info): Remove static token.
> 
> Rather than this, why not move the function to common code somewhere?
> 
> Ciao!
> Steven

An alternative would be to move it into tree-ssa-address.c, where there
is already a simpler version called copy_mem_ref_info.  I'm open to that
if it's preferable.

Bill
William J. Schmidt - Oct. 5, 2011, 5:22 p.m.
On Wed, 2011-10-05 at 18:21 +0200, Paolo Bonzini wrote:
> On 10/05/2011 06:13 PM, William J. Schmidt wrote:
> > One other general question about the pattern-match transformation:  Is
> > this an appropriate transformation for all targets, or should it be
> > somehow gated on available addressing modes on the target processor?
> >
> > Bootstrapped and regression tested on powerpc64-linux-gnu.  Verified no
> > performance degradations on that target for SPEC CPU2000 and CPU2006.
> >
> > I'm looking for eventual approval for trunk after any comments are
> > resolved.  Thanks!
> 
> How do the costs look like for the two transforms you mention in the 
> head comment of locally_poor_mem_replacement?
> 
> Paolo
> 

I don't know off the top of my head -- I'll have to gather that
information.  The issue is that the profitability is really
context-sensitive, so just the isolated costs of insns aren't enough.
The forward propagation of the add into (mem (reg REG)) looks like a
slam dunk in the absence of other information, but if there are other
nearby references using nonzero offsets from REG, this just extends the
lifetimes of X and Y without eliminating the need for REG.
Paolo Bonzini - Oct. 5, 2011, 7:01 p.m.
On 10/05/2011 07:22 PM, William J. Schmidt wrote:
> I don't know off the top of my head -- I'll have to gather that
> information.  The issue is that the profitability is really
> context-sensitive, so just the isolated costs of insns aren't enough.
> The forward propagation of the add into (mem (reg REG)) looks like a
> slam dunk in the absence of other information, but if there are other
> nearby references using nonzero offsets from REG, this just extends the
> lifetimes of X and Y without eliminating the need for REG.

True, however there are other passes that do this kind of un-CSE and 
lifetime reduction.

Paolo
William J. Schmidt - Oct. 5, 2011, 8:16 p.m.
On Wed, 2011-10-05 at 21:01 +0200, Paolo Bonzini wrote:
> On 10/05/2011 07:22 PM, William J. Schmidt wrote:
> > I don't know off the top of my head -- I'll have to gather that
> > information.  The issue is that the profitability is really
> > context-sensitive, so just the isolated costs of insns aren't enough.
> > The forward propagation of the add into (mem (reg REG)) looks like a
> > slam dunk in the absence of other information, but if there are other
> > nearby references using nonzero offsets from REG, this just extends the
> > lifetimes of X and Y without eliminating the need for REG.
> 
> True, however there are other passes that do this kind of un-CSE and 
> lifetime reduction.
> 
> Paolo

OK, I see.  If there's a better place downstream to make a swizzle, I'm
certainly fine with that.

I disabled locally_poor_mem_replacement and added some dump information
in should_replace_address to show the costs for the replacement I'm
trying to avoid:

In should_replace_address:
  old_rtx = (reg/f:DI 125 [ D.2036 ])
  new_rtx = (plus:DI (reg/v/f:DI 126 [ p ])
        (reg:DI 128))
  address_cost (old_rtx) = 0
  address_cost (new_rtx) = 0
  set_src_cost (old_rtx) = 0
  set_src_cost (new_rtx) = 4

In insn 11, replacing
 (mem/s:SI (reg/f:DI 125 [ D.2036 ]) [2 p_1(D)->a S4 A32])
 with (mem/s:SI (plus:DI (reg/v/f:DI 126 [ p ])
            (reg:DI 128)) [2 p_1(D)->a S4 A32])
Changed insn 11
deferring rescan insn with uid = 11.
deferring rescan insn with uid = 11.

Hope this helps,
Bill
Richard Guenther - Oct. 6, 2011, 10:13 a.m.
On Wed, 5 Oct 2011, William J. Schmidt wrote:

> This patch addresses the poor code generation in PR46556 for the
> following code:
> 
> struct x
> {
>   int a[16];
>   int b[16];
>   int c[16];
> };
> 
> extern void foo (int, int, int);
> 
> void
> f (struct x *p, unsigned int n)
> {
>   foo (p->a[n], p->c[n], p->b[n]);
> }
> 
> Prior to the fix for PR32698, gcc calculated the offset for accessing
> the array elements as:  n*4; 64+n*4; 128+n*4.
> 
> Following that fix, the offsets are calculated as:  n*4; (n+16)*4; (n
> +32)*4.  This led to poor code generation on powerpc64 targets, among
> others.
> 
> The poor code generation was observed to not occur in loops, as the
> IVOPTS code does a good job of lowering these expressions to MEM_REFs.
> It was previously suggested that perhaps a general pass to lower memory
> accesses to MEM_REFs in GIMPLE would solve not only this, but other
> similar problems.  I spent some time looking into various approaches to
> this, and reviewing some previous attempts to do similar things.  In the
> end, I've concluded that this is a bad idea in practice because of the
> loss of useful aliasing information.  In particular, early lowering of
> component references causes us to lose the ability to disambiguate
> non-overlapping references in the same structure, and there is no simple
> way to carry the necessary aliasing information along with the
> replacement MEM_REFs to avoid this.  While some performance gains are
> available with GIMPLE lowering of memory accesses, there are also
> offsetting performance losses, and I suspect this would just be a
> continuous source of bug reports into the future.
> 
> Therefore the current patch is a much simpler approach to solve the
> specific problem noted in the PR.  There are two pieces to the patch:
> 
>  * The offending addressing pattern is matched in GIMPLE and transformed
> into a restructured MEM_REF that distributes the multiply, so that (n
> +32)*4 becomes 4*n+128 as before.  This is done during the reassociation
> pass, for reasons described below.  The transformation only occurs in
> non-loop blocks, since IVOPTS does a good job on such things within
> loops.
>  * A tweak is added to the RTL forward-propagator to avoid propagating
> into memory references based on a single base register with no offset,
> under certain circumstances.  This improves sharing of base registers
> for accesses within the same structure and slightly lowers register
> pressure.
> 
> It would be possible to separate these into two patches if that's
> preferred.  I chose to combine them because together they provide the
> ideal code generation that the new test cases test for.
> 
> I initially implemented the pattern matcher during expand, but I found
> that the expanded code for two accesses to the same structure was often
> not being CSEd well.  So I moved it back into the GIMPLE phases prior to
> DOM to take advantage of its CSE.  To avoid an additional complete pass
> over the IL, I chose to piggyback on the reassociation pass.  This
> transformation is not technically a reassociation, but it is related
> enough to not be a complete wart.
> 
> One noob question about this:  It would probably be preferable to have
> this transformation only take place during the second reassociation
> pass, so the ARRAY_REFs are seen by earlier optimization phases.  Is
> there an easy way to detect that it's the second pass without having to
> generate a separate pass entry point?
> 
> One other general question about the pattern-match transformation:  Is
> this an appropriate transformation for all targets, or should it be
> somehow gated on available addressing modes on the target processor?
> 
> Bootstrapped and regression tested on powerpc64-linux-gnu.  Verified no
> performance degradations on that target for SPEC CPU2000 and CPU2006.
> 
> I'm looking for eventual approval for trunk after any comments are
> resolved.  Thanks!

People have already commented on pieces, so I'm looking only
at the tree-ssa-reassoc.c pieces (did you consider piggy-backing
on IVOPTs instead?  The idea is to expose additional CSE
opportunities, right?  So it's sort-of a strength-reduction
optimization on scalar code (classically strength reduction
in loops transforms for (i) { z = i*x; } to z = 0; for (i) { z += x }).
That might be worth in general, even for non-address cases.
So - if you rename that thing to tree-ssa-strength-reduce.c you
can get away without piggy-backing on anything ;)  If you
structure it to detect a strength reduction opportunity
(thus, you'd need to match two/multiple of the patterns at the same time)
that would be a bonus ... generalizing it a little bit would be
another.

Now some comments on the patch ...

> Bill
> 
> 
> 2011-10-05  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
> 
> gcc:
> 	
> 	PR rtl-optimization/46556
> 	* fwprop.c (fwprop_bb_aux_d): New struct.
> 	(MEM_PLUS_REGS): New macro.
> 	(record_mem_plus_reg): New function.
> 	(record_mem_plus_regs): Likewise.
> 	(single_def_use_enter_block): Record mem-plus-reg patterns.
> 	(build_single_def_use_links): Allocate aux storage.
> 	(locally_poor_mem_replacement): New function.
> 	(forward_propagate_and_simplify): Call
> 	locally_poor_mem_replacement.
> 	(fwprop_init): Free storage.
> 	* tree.h (copy_ref_info): Expose existing function.
> 	* tree-ssa-loop-ivopts.c (copy_ref_info): Remove static token.
> 	* tree-ssa-reassoc.c (restructure_base_and_offset): New function.
> 	(restructure_mem_ref): Likewise.
> 	(reassociate_bb): Look for opportunities to call
> 	restructure_mem_ref; clean up immediate use lists.
> 
> gcc/testsuite:
> 	
> 	PR rtl-optimization/46556
> 	* gcc.target/powerpc/ppc-pr46556-1.c: New testcase.
> 	* gcc.target/powerpc/ppc-pr46556-2.c: Likewise.
> 	* gcc.target/powerpc/ppc-pr46556-3.c: Likewise.
> 	* gcc.target/powerpc/ppc-pr46556-4.c: Likewise.
> 	* gcc.dg/tree-ssa/pr46556-1.c: Likewise.
> 	* gcc.dg/tree-ssa/pr46556-2.c: Likewise.
> 	* gcc.dg/tree-ssa/pr46556-3.c: Likewise.

> Index: gcc/tree-ssa-reassoc.c
> ===================================================================
> --- gcc/tree-ssa-reassoc.c	(revision 179317)
> +++ gcc/tree-ssa-reassoc.c	(working copy)
> @@ -2361,6 +2361,116 @@ break_up_subtract_bb (basic_block bb)
>      break_up_subtract_bb (son);
>  }
>  
> +/* Given BASE and OFFSET derived from *EXPR occurring in stmt *GSI,
> +   determine whether there is a profitable opportunity to restructure
> +   address arithmetic within BASE and OFFSET.  If so, produce such
> +   a restructuring in *MEM_REF and return true; else return false.  */
> +
> +static tree
> +restructure_base_and_offset (tree *expr, gimple_stmt_iterator *gsi,
> +			     tree base, tree offset, HOST_WIDE_INT bitpos)
> +{
> +  tree mult_op0, mult_op1, t1, t2, offset_type, mult_expr, add_expr, mem_ref;
> +  HOST_WIDE_INT c, c1, c2, c3, c4;
> +
> +  /* Look for the following pattern:
> +
> +       base   = MEM_REF (T1, C1);
> +       offset = MULT_EXPR (PLUS_EXPR (T2, C2), C3)
> +       bitpos = C4 * BITS_PER_UNIT
> +
> +     If found, convert it to:
> +
> +       MEM_REF (POINTER_PLUS_EXPR (T1, MULT_EXPR (T2, C3)),
> +                C1 + (C2 * C3) + C4)
> +   */
> +  if (!base
> +      || !offset
> +      || TREE_CODE (base) != MEM_REF
> +      || TREE_CODE (TREE_OPERAND (base, 1)) != INTEGER_CST
> +      || TREE_CODE (offset) != MULT_EXPR)
> +    return NULL_TREE;
> +
> +  mult_op0 = TREE_OPERAND (offset, 0);
> +  mult_op1 = TREE_OPERAND (offset, 1);
> +
> +  if (TREE_CODE (mult_op0) != PLUS_EXPR
> +      || TREE_CODE (mult_op1) != INTEGER_CST
> +      || TREE_CODE (TREE_OPERAND (mult_op0, 1)) != INTEGER_CST)
> +    return NULL_TREE;
> +
> +  t1 = TREE_OPERAND (base, 0);
> +  t2 = TREE_OPERAND (mult_op0, 0);
> +  c1 = TREE_INT_CST_LOW (TREE_OPERAND (base, 1));
> +  c2 = TREE_INT_CST_LOW (TREE_OPERAND (mult_op0, 1));
> +  c3 = TREE_INT_CST_LOW (mult_op1);

Before accessing TREE_INT_CST_LOW you need to make sure the
constants fit into a HWI using host_integerp () (which
conveniently includes the check for INTEGER_CST).

Note that you need to sign-extend the MEM_REF offset,
thus use mem_ref_offset (base).low instead of
TREE_INT_CST_LOW (TREE_OPERAND (base, 1)).  Might be worth
to add a testcase with negative offset ;)

> +  c4 = bitpos / BITS_PER_UNIT;
> +  c = c1 + c2 * c3 + c4;

And you don't know whether this operation overflows.  Thus it's
probably easiest to use double_ints instead of HOST_WIDE_INTs
in all of the code.

> +
> +  offset_type = TREE_TYPE (TREE_OPERAND (base, 1));
> +  
> +  mult_expr = fold_build2 (MULT_EXPR, sizetype, t2,
> +			   build_int_cst (sizetype, c3));
> +  mult_expr = force_gimple_operand_gsi (gsi, mult_expr, true, NULL,
> +					true, GSI_SAME_STMT);
> +  add_expr = fold_build2 (POINTER_PLUS_EXPR, TREE_TYPE (t1), t1, mult_expr);
> +  add_expr = force_gimple_operand_gsi (gsi, add_expr, true, NULL,
> +				       true, GSI_SAME_STMT);
> +  mem_ref = fold_build2 (MEM_REF, TREE_TYPE (*expr), add_expr,
> +			 build_int_cst (offset_type, c));
> +  return mem_ref;
> +}
> +
> +/* If *EXPR contains a memory reference, determine whether there is an
> +   opportunity to restructure the base and offset expressions for
> +   improved performance.  */
> +
> +static bool
> +restructure_mem_ref (tree *expr, gimple_stmt_iterator *gsi)
> +{
> +  enum tree_code code = TREE_CODE (*expr);
> +  tree base, offset, mem_ref;
> +  HOST_WIDE_INT bitsize, bitpos;
> +  enum machine_mode mode;
> +  int unsignedp, volatilep;
> +
> +  /* Only expressions that reference memory are of interest.  Bitfield
> +     references should eventually be lowered prior to this point and
> +     are not dealt with.  */
> +  if (! handled_component_p (*expr)
> +      || code == BIT_FIELD_REF
> +      || (code == COMPONENT_REF && DECL_BIT_FIELD (TREE_OPERAND (*expr, 1))))
> +    return false;

I'd say you want to handle references with a non-BLKmode only here,
thus || TYPE_MODE (TREE_TYPE (*expr)) == BLKmode

> +  /* Determine whether the expression can be represented with base and
> +     offset components.  */
> +  base = get_inner_reference (*expr, &bitsize, &bitpos, &offset, &mode,
> +			      &unsignedp, &volatilep, false);
> +  if (!base || !offset)
> +    return false;
> +
> +  /* Look for a restructuring opportunity.  */
> +  if ((mem_ref = restructure_base_and_offset (expr, gsi, base,
> +					      offset, bitpos)) == NULL_TREE)
> +    return false;

What I'm missing is a check whether the old address computation stmts
will be dead after the transform.

> +  /* Found one.  Record the replacement in the dump file and complete
> +     the replacement.  */
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file, "\nOriginal_expr = ");
> +      print_generic_expr (dump_file, *expr, TDF_VOPS|TDF_MEMSYMS);
> +      fprintf (dump_file, "\nmem_ref = ");
> +      print_generic_expr (dump_file, mem_ref, TDF_VOPS|TDF_MEMSYMS);
> +      fprintf (dump_file, "\n\n");
> +    }
> +
> +  copy_ref_info (mem_ref, *expr);
> +  *expr = mem_ref;
> +
> +  return true;
> +}
> +
>  /* Reassociate expressions in basic block BB and its post-dominator as
>     children.  */
>  
> @@ -2369,14 +2479,30 @@ reassociate_bb (basic_block bb)
>  {
>    gimple_stmt_iterator gsi;
>    basic_block son;
> +  bool chgd_mem_ref = false;
> +  bool in_loop = loop_depth (bb->loop_father) != 0;
>  
>    for (gsi = gsi_last_bb (bb); !gsi_end_p (gsi); gsi_prev (&gsi))
>      {
>        gimple stmt = gsi_stmt (gsi);
>  
> -      if (is_gimple_assign (stmt)
> -	  && !stmt_could_throw_p (stmt))
> +      /* Look for restructuring opportunities within an expression
> +	 that references memory.  We only do this for blocks not
> +         contained in loops, since the ivopts machinery does a 
> +         good job on loop expressions, and we don't want to interfere
> +	 with other loop optimizations.  */
> +      if (!in_loop && gimple_vuse (stmt) && gimple_assign_single_p (stmt))
>  	{
> +	  tree *lhs, *rhs;
> +	  lhs = gimple_assign_lhs_ptr (stmt);
> +	  chgd_mem_ref = restructure_mem_ref (lhs, &gsi) || chgd_mem_ref;
> +	  rhs = gimple_assign_rhs1_ptr (stmt);
> +	  chgd_mem_ref = restructure_mem_ref (rhs, &gsi) || chgd_mem_ref;

It will either be a store or a load, but never both (unless it's an
aggregate copy which I think we should not handle).  So ...

  if (gimple_vdef (stmt))
    ... lhs
  else if (gimple_vuse (stmt))
    ... rhs

> +	}
> +
> +      else if (is_gimple_assign (stmt)
> +	       && !stmt_could_throw_p (stmt))
> +	{
>  	  tree lhs, rhs1, rhs2;
>  	  enum tree_code rhs_code = gimple_assign_rhs_code (stmt);
>  
> @@ -2489,6 +2615,12 @@ reassociate_bb (basic_block bb)
>  	    }
>  	}
>      }
> +  /* If memory references have been restructured, immediate uses need
> +     to be cleaned up.  */
> +  if (chgd_mem_ref)
> +    for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +      update_stmt (gsi_stmt (gsi));

ICK.  Definitely a no ;)

Why does a update_stmt () after the restructure_mem_ref call not work?

Richard.
William J. Schmidt - Oct. 6, 2011, 1:43 p.m.
On Thu, 2011-10-06 at 12:13 +0200, Richard Guenther wrote:
> People have already commented on pieces, so I'm looking only
> at the tree-ssa-reassoc.c pieces (did you consider piggy-backing
> on IVOPTs instead?  The idea is to expose additional CSE
> opportunities, right?  So it's sort-of a strength-reduction
> optimization on scalar code (classically strength reduction
> in loops transforms for (i) { z = i*x; } to z = 0; for (i) { z += x }).
> That might be worth in general, even for non-address cases.
> So - if you rename that thing to tree-ssa-strength-reduce.c you
> can get away without piggy-backing on anything ;)  If you
> structure it to detect a strength reduction opportunity
> (thus, you'd need to match two/multiple of the patterns at the same time)
> that would be a bonus ... generalizing it a little bit would be
> another.

These are all good ideas.  I will think about casting this as a more
general strength reduction over extended basic blocks outside of loops.
First I'll put together some simple tests to see whether we're currently
missing some non-address opportunities.

<snip>

> > +  mult_op0 = TREE_OPERAND (offset, 0);
> > +  mult_op1 = TREE_OPERAND (offset, 1);
> > +
> > +  if (TREE_CODE (mult_op0) != PLUS_EXPR
> > +      || TREE_CODE (mult_op1) != INTEGER_CST
> > +      || TREE_CODE (TREE_OPERAND (mult_op0, 1)) != INTEGER_CST)
> > +    return NULL_TREE;
> > +
> > +  t1 = TREE_OPERAND (base, 0);
> > +  t2 = TREE_OPERAND (mult_op0, 0);
> > +  c1 = TREE_INT_CST_LOW (TREE_OPERAND (base, 1));
> > +  c2 = TREE_INT_CST_LOW (TREE_OPERAND (mult_op0, 1));
> > +  c3 = TREE_INT_CST_LOW (mult_op1);
> 
> Before accessing TREE_INT_CST_LOW you need to make sure the
> constants fit into a HWI using host_integerp () (which
> conveniently includes the check for INTEGER_CST).
> 
> Note that you need to sign-extend the MEM_REF offset,
> thus use mem_ref_offset (base).low instead of
> TREE_INT_CST_LOW (TREE_OPERAND (base, 1)).  Might be worth
> to add a testcase with negative offset ;)

D'oh! >.<

> 
> > +  c4 = bitpos / BITS_PER_UNIT;
> > +  c = c1 + c2 * c3 + c4;
> 
> And you don't know whether this operation overflows.  Thus it's
> probably easiest to use double_ints instead of HOST_WIDE_INTs
> in all of the code.

OK, thanks, will do.

<snip>

> 
> > +  /* Determine whether the expression can be represented with base and
> > +     offset components.  */
> > +  base = get_inner_reference (*expr, &bitsize, &bitpos, &offset, &mode,
> > +			      &unsignedp, &volatilep, false);
> > +  if (!base || !offset)
> > +    return false;
> > +
> > +  /* Look for a restructuring opportunity.  */
> > +  if ((mem_ref = restructure_base_and_offset (expr, gsi, base,
> > +					      offset, bitpos)) == NULL_TREE)
> > +    return false;
> 
> What I'm missing is a check whether the old address computation stmts
> will be dead after the transform.

Hm, not quite sure what to do here.  Prior to the transformation I'll
have an assignment with something like:

ARRAY_REF (COMPONENT_REF (MEM_REF (Ta, Cb), FIELD_DECL c), Td)

on LHS or RHS.  Ta and Td will be part of the replacement.  What should
I be checking for?

<snip>

> >  
> > -      if (is_gimple_assign (stmt)
> > -	  && !stmt_could_throw_p (stmt))
> > +      /* Look for restructuring opportunities within an expression
> > +	 that references memory.  We only do this for blocks not
> > +         contained in loops, since the ivopts machinery does a 
> > +         good job on loop expressions, and we don't want to interfere
> > +	 with other loop optimizations.  */
> > +      if (!in_loop && gimple_vuse (stmt) && gimple_assign_single_p (stmt))
> >  	{
> > +	  tree *lhs, *rhs;
> > +	  lhs = gimple_assign_lhs_ptr (stmt);
> > +	  chgd_mem_ref = restructure_mem_ref (lhs, &gsi) || chgd_mem_ref;
> > +	  rhs = gimple_assign_rhs1_ptr (stmt);
> > +	  chgd_mem_ref = restructure_mem_ref (rhs, &gsi) || chgd_mem_ref;
> 
> It will either be a store or a load, but never both (unless it's an
> aggregate copy which I think we should not handle).  So ...
> 
>   if (gimple_vdef (stmt))
>     ... lhs
>   else if (gimple_vuse (stmt))
>     ... rhs

OK, with your suggested gating on non-BLKmode I agree.

> > +	}
> > +
> > +      else if (is_gimple_assign (stmt)
> > +	       && !stmt_could_throw_p (stmt))
> > +	{
> >  	  tree lhs, rhs1, rhs2;
> >  	  enum tree_code rhs_code = gimple_assign_rhs_code (stmt);
> >  
> > @@ -2489,6 +2615,12 @@ reassociate_bb (basic_block bb)
> >  	    }
> >  	}
> >      }
> > +  /* If memory references have been restructured, immediate uses need
> > +     to be cleaned up.  */
> > +  if (chgd_mem_ref)
> > +    for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > +      update_stmt (gsi_stmt (gsi));
> 
> ICK.  Definitely a no ;)
> 
> Why does a update_stmt () after the restructure_mem_ref call not work?

Ah, yeah, I meant to check again on that before submitting.  >.< 

IIRC, at some point the update_stmt () following restructure_mem_ref was
still giving me verify errors.  I thought perhaps the statements created
by force_gimple_operand_gsi might be giving me trouble -- threw this in
to work around it and missed it on my pre-submission review.

I'll re-check this.  If for some reason it is the inserted statements, I
can more carefully update just those -- but I just need to go back and
sort it out.  Sorry for letting that slip through.

Thanks for the review!

Bill

> 
> Richard.
Richard Guenther - Oct. 6, 2011, 2:16 p.m.
On Thu, 6 Oct 2011, William J. Schmidt wrote:

> On Thu, 2011-10-06 at 12:13 +0200, Richard Guenther wrote:
> > People have already commented on pieces, so I'm looking only
> > at the tree-ssa-reassoc.c pieces (did you consider piggy-backing
> > on IVOPTs instead?  The idea is to expose additional CSE
> > opportunities, right?  So it's sort-of a strength-reduction
> > optimization on scalar code (classically strength reduction
> > in loops transforms for (i) { z = i*x; } to z = 0; for (i) { z += x }).
> > That might be worth in general, even for non-address cases.
> > So - if you rename that thing to tree-ssa-strength-reduce.c you
> > can get away without piggy-backing on anything ;)  If you
> > structure it to detect a strength reduction opportunity
> > (thus, you'd need to match two/multiple of the patterns at the same time)
> > that would be a bonus ... generalizing it a little bit would be
> > another.
> 
> These are all good ideas.  I will think about casting this as a more
> general strength reduction over extended basic blocks outside of loops.
> First I'll put together some simple tests to see whether we're currently
> missing some non-address opportunities.
> 
> <snip>
> 
> > > +  mult_op0 = TREE_OPERAND (offset, 0);
> > > +  mult_op1 = TREE_OPERAND (offset, 1);
> > > +
> > > +  if (TREE_CODE (mult_op0) != PLUS_EXPR
> > > +      || TREE_CODE (mult_op1) != INTEGER_CST
> > > +      || TREE_CODE (TREE_OPERAND (mult_op0, 1)) != INTEGER_CST)
> > > +    return NULL_TREE;
> > > +
> > > +  t1 = TREE_OPERAND (base, 0);
> > > +  t2 = TREE_OPERAND (mult_op0, 0);
> > > +  c1 = TREE_INT_CST_LOW (TREE_OPERAND (base, 1));
> > > +  c2 = TREE_INT_CST_LOW (TREE_OPERAND (mult_op0, 1));
> > > +  c3 = TREE_INT_CST_LOW (mult_op1);
> > 
> > Before accessing TREE_INT_CST_LOW you need to make sure the
> > constants fit into a HWI using host_integerp () (which
> > conveniently includes the check for INTEGER_CST).
> > 
> > Note that you need to sign-extend the MEM_REF offset,
> > thus use mem_ref_offset (base).low instead of
> > TREE_INT_CST_LOW (TREE_OPERAND (base, 1)).  Might be worth
> > to add a testcase with negative offset ;)
> 
> D'oh! >.<
> 
> > 
> > > +  c4 = bitpos / BITS_PER_UNIT;
> > > +  c = c1 + c2 * c3 + c4;
> > 
> > And you don't know whether this operation overflows.  Thus it's
> > probably easiest to use double_ints instead of HOST_WIDE_INTs
> > in all of the code.
> 
> OK, thanks, will do.
> 
> <snip>
> 
> > 
> > > +  /* Determine whether the expression can be represented with base and
> > > +     offset components.  */
> > > +  base = get_inner_reference (*expr, &bitsize, &bitpos, &offset, &mode,
> > > +			      &unsignedp, &volatilep, false);
> > > +  if (!base || !offset)
> > > +    return false;
> > > +
> > > +  /* Look for a restructuring opportunity.  */
> > > +  if ((mem_ref = restructure_base_and_offset (expr, gsi, base,
> > > +					      offset, bitpos)) == NULL_TREE)
> > > +    return false;
> > 
> > What I'm missing is a check whether the old address computation stmts
> > will be dead after the transform.
> 
> Hm, not quite sure what to do here.  Prior to the transformation I'll
> have an assignment with something like:
> 
> ARRAY_REF (COMPONENT_REF (MEM_REF (Ta, Cb), FIELD_DECL c), Td)
> 
> on LHS or RHS.  Ta and Td will be part of the replacement.  What should
> I be checking for?

Doh, I thought you were matching gimple stmts that do the address
computation.  But now I see you are matching the tree returned from
get_inner_reference.  So no need to check anything for that case.

But that keeps me wondering what you'll do if the accesses were
all pointer arithmetic, not arrays.  Thus,

extern void foo (int, int, int);

void
f (int *p, unsigned int n)
{
 foo (p[n], p[n+64], p[n+128]);
}

wouldn't that have the same issue and you wouldn't handle it?

Richard.
William J. Schmidt - Oct. 6, 2011, 3:49 p.m.
On Thu, 2011-10-06 at 16:16 +0200, Richard Guenther wrote:

<snip>

> 
> Doh, I thought you were matching gimple stmts that do the address
> computation.  But now I see you are matching the tree returned from
> get_inner_reference.  So no need to check anything for that case.
> 
> But that keeps me wondering what you'll do if the accesses were
> all pointer arithmetic, not arrays.  Thus,
> 
> extern void foo (int, int, int);
> 
> void
> f (int *p, unsigned int n)
> {
>  foo (p[n], p[n+64], p[n+128]);
> }
> 
> wouldn't that have the same issue and you wouldn't handle it?
> 
> Richard.
> 

Good point.  This indeed gets missed here, and that's more fuel for
doing a generalized strength reduction along with the special cases like
p->a[n] that are only exposed with get_inner_reference.

(The pointer arithmetic cases were picked up in my earlier "big-hammer"
approach using the aff-comb machinery, but that had too many problems in
the end, as you know.)

So for the long term I will look into a full strength reducer for
non-loop code.  For the short term, what do you think about keeping this
single transformation in reassoc to make sure it gets into 4.7?  I would
plan to strip it back out and fold it into the strength reducer
thereafter, which might or might not make 4.7 depending on my other
responsibilities and how the 4.7 schedule goes.  I haven't seen anything
official, but I'm guessing we're getting towards the end of 4.7 stage 1?
Jeff Law - Oct. 6, 2011, 5:35 p.m.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/06/11 04:13, Richard Guenther wrote:

> 
> People have already commented on pieces, so I'm looking only at the
> tree-ssa-reassoc.c pieces (did you consider piggy-backing on IVOPTs
> instead?  The idea is to expose additional CSE opportunities,
> right?  So it's sort-of a strength-reduction optimization on scalar
> code (classically strength reduction in loops transforms for (i) {
> z = i*x; } to z = 0; for (i) { z += x }). That might be worth in
> general, even for non-address cases. So - if you rename that thing
> to tree-ssa-strength-reduce.c you can get away without
> piggy-backing on anything ;)  If you structure it to detect a
> strength reduction opportunity (thus, you'd need to match
> two/multiple of the patterns at the same time) that would be a
> bonus ... generalizing it a little bit would be another.
There's a variety of literature that uses PRE to detect and optimize
straightline code strength reduction.  I poked at it at one time (RTL
gcse framework) and it looked reasonably promising.  Never pushed it
all the way through.

jeff
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJOjebJAAoJEBRtltQi2kC71ogH/AkMNzXpYK1GXp2EhoS+3Dhn
T1mWDKdHT5+ozpuAxRFzuCSQ8HmkbLJk8fGpOyUuLr15zEnT1isE7cU3i4ZzY3o0
lduo9Ck23rMWNroYgxbV+zPvArW5MG9qrGO6XSBynfipmlpznEo8zQPiaoaASlHz
8G7gd9P2la1QHha9OVtiCMKs0zgckU55RqiwV7d8DMi5tgoq5wkN+qcKCoSI7+b0
jxAukIcp6O8QZ6ADcHyAdav+zZzGDBycEhgakam71WifjFlysah2TG05SsK75Dxi
h3S13yPpx/A8zBuex5osL0qOGn0H7L93uAsTxcv4dTEpUl4Jx7Y5FoPOEp5D1Z4=
=LcZy
-----END PGP SIGNATURE-----
William J. Schmidt - Oct. 6, 2011, 6:02 p.m.
On Thu, 2011-10-06 at 11:35 -0600, Jeff Law wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 10/06/11 04:13, Richard Guenther wrote:
> 
> > 
> > People have already commented on pieces, so I'm looking only at the
> > tree-ssa-reassoc.c pieces (did you consider piggy-backing on IVOPTs
> > instead?  The idea is to expose additional CSE opportunities,
> > right?  So it's sort-of a strength-reduction optimization on scalar
> > code (classically strength reduction in loops transforms for (i) {
> > z = i*x; } to z = 0; for (i) { z += x }). That might be worth in
> > general, even for non-address cases. So - if you rename that thing
> > to tree-ssa-strength-reduce.c you can get away without
> > piggy-backing on anything ;)  If you structure it to detect a
> > strength reduction opportunity (thus, you'd need to match
> > two/multiple of the patterns at the same time) that would be a
> > bonus ... generalizing it a little bit would be another.
> There's a variety of literature that uses PRE to detect and optimize
> straightline code strength reduction.  I poked at it at one time (RTL
> gcse framework) and it looked reasonably promising.  Never pushed it
> all the way through.
> 
> jeff

I ran across http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22586 and
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35308 that show this
question has come up before.  The former also suggested a PRE-based
approach.
Jeff Law - Oct. 6, 2011, 6:06 p.m.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/06/11 12:02, William J. Schmidt wrote:
> On Thu, 2011-10-06 at 11:35 -0600, Jeff Law wrote:
>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>> 
>> On 10/06/11 04:13, Richard Guenther wrote:
>> 
>>> 
>>> People have already commented on pieces, so I'm looking only at
>>> the tree-ssa-reassoc.c pieces (did you consider piggy-backing
>>> on IVOPTs instead?  The idea is to expose additional CSE
>>> opportunities, right?  So it's sort-of a strength-reduction
>>> optimization on scalar code (classically strength reduction in
>>> loops transforms for (i) { z = i*x; } to z = 0; for (i) { z +=
>>> x }). That might be worth in general, even for non-address
>>> cases. So - if you rename that thing to
>>> tree-ssa-strength-reduce.c you can get away without 
>>> piggy-backing on anything ;)  If you structure it to detect a 
>>> strength reduction opportunity (thus, you'd need to match 
>>> two/multiple of the patterns at the same time) that would be a 
>>> bonus ... generalizing it a little bit would be another.
>> There's a variety of literature that uses PRE to detect and
>> optimize straightline code strength reduction.  I poked at it at
>> one time (RTL gcse framework) and it looked reasonably promising.
>> Never pushed it all the way through.
>> 
>> jeff
> 
> I ran across http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22586 and 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35308 that show this 
> question has come up before.  The former also suggested a
> PRE-based approach.
Yea.  We've kicked it around several times over the last 15 or so
years.

When I briefly looked at it, I was doing so more in the context of
eliminating all the optimize_related_values crap, purely as a cleanup
and utlimately couldn't justify the time.

IIRC Morgan & Muchnick both had write-ups on the basic concepts.
There's probably other literature as well.

jeff
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJOje4wAAoJEBRtltQi2kC7q9UIAIXdiGG5Reu75PBkMPO9KKhn
RRGKQMMNSinGDyyORGqxfqwrirtFuQqzn+ITRsfjegHydUTwwDDaAtTyqEqFmpt0
2phGYBOS5pN4VImKzrd2fxlwbuW0jUlOpDuWFK+K10W8jU3SlJSIZhfaSMh1PwC5
IQm6FLDiuRuNSgyYattUnI5KZ2chN2QEkfUBQgDvbxHXfPDqjNQymIfv1K9iymrG
j3Wq7i47fBkYbnPYtAQ9GCYsmT6Wo2v+2/ZeFWE417FYYhgCdBeYu2iZPE6Nm8Pb
SypPDyi1AQ3QRfg+LPiN1bdQk40MhfPlMhHZtnVtq8nEa9+fLTgO/ERzCD0G+r8=
=XWLf
-----END PGP SIGNATURE-----
Richard Guenther - Oct. 7, 2011, 8 a.m.
On Thu, 6 Oct 2011, William J. Schmidt wrote:

> On Thu, 2011-10-06 at 16:16 +0200, Richard Guenther wrote:
> 
> <snip>
> 
> > 
> > Doh, I thought you were matching gimple stmts that do the address
> > computation.  But now I see you are matching the tree returned from
> > get_inner_reference.  So no need to check anything for that case.
> > 
> > But that keeps me wondering what you'll do if the accesses were
> > all pointer arithmetic, not arrays.  Thus,
> > 
> > extern void foo (int, int, int);
> > 
> > void
> > f (int *p, unsigned int n)
> > {
> >  foo (p[n], p[n+64], p[n+128]);
> > }
> > 
> > wouldn't that have the same issue and you wouldn't handle it?
> > 
> > Richard.
> > 
> 
> Good point.  This indeed gets missed here, and that's more fuel for
> doing a generalized strength reduction along with the special cases like
> p->a[n] that are only exposed with get_inner_reference.
> 
> (The pointer arithmetic cases were picked up in my earlier "big-hammer"
> approach using the aff-comb machinery, but that had too many problems in
> the end, as you know.)

Yeah, I know.  It's a very tricky area ;)

> So for the long term I will look into a full strength reducer for
> non-loop code.  For the short term, what do you think about keeping this
> single transformation in reassoc to make sure it gets into 4.7?  I would
> plan to strip it back out and fold it into the strength reducer
> thereafter, which might or might not make 4.7 depending on my other
> responsibilities and how the 4.7 schedule goes.  I haven't seen anything
> official, but I'm guessing we're getting towards the end of 4.7 stage 1?

It's a reasonable plan - you'd have to introduce a late reassoc
pass though.  Can you separate out the RTL fwprop changes?  So
we can iterate over the tree parts separately.

Thanks,
Richard.
Richard Guenther - Oct. 7, 2011, 8:06 a.m.
On Thu, 6 Oct 2011, Jeff Law wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 10/06/11 04:13, Richard Guenther wrote:
> 
> > 
> > People have already commented on pieces, so I'm looking only at the
> > tree-ssa-reassoc.c pieces (did you consider piggy-backing on IVOPTs
> > instead?  The idea is to expose additional CSE opportunities,
> > right?  So it's sort-of a strength-reduction optimization on scalar
> > code (classically strength reduction in loops transforms for (i) {
> > z = i*x; } to z = 0; for (i) { z += x }). That might be worth in
> > general, even for non-address cases. So - if you rename that thing
> > to tree-ssa-strength-reduce.c you can get away without
> > piggy-backing on anything ;)  If you structure it to detect a
> > strength reduction opportunity (thus, you'd need to match
> > two/multiple of the patterns at the same time) that would be a
> > bonus ... generalizing it a little bit would be another.
> There's a variety of literature that uses PRE to detect and optimize
> straightline code strength reduction.  I poked at it at one time (RTL
> gcse framework) and it looked reasonably promising.  Never pushed it
> all the way through.

I poked at it in the tree PRE framework at some point but never
pushed it through either.  At least it didn't feel like it "fits" PRE
(and PRE should also be able to handle the loop case).

Richard.
Paolo Bonzini - Oct. 7, 2011, 9:17 a.m.
On 10/07/2011 10:00 AM, Richard Guenther wrote:
> It's a reasonable plan - you'd have to introduce a late reassoc
> pass though.  Can you separate out the RTL fwprop changes?  So
> we can iterate over the tree parts separately.

That's indeed better, also because they became CSE changes.

Paolo
William J. Schmidt - Oct. 7, 2011, 12:09 p.m.
On Fri, 2011-10-07 at 11:17 +0200, Paolo Bonzini wrote:
> On 10/07/2011 10:00 AM, Richard Guenther wrote:
> > It's a reasonable plan - you'd have to introduce a late reassoc
> > pass though.  Can you separate out the RTL fwprop changes?  So
> > we can iterate over the tree parts separately.
> 
> That's indeed better, also because they became CSE changes.
> 
> Paolo
> 

Yes, I'll plan to work these two separately.

Paolo, I should be able to shepherd your suggested patch through.
Normally company policy frowns on committing other people's patches, but
in a case like this where you're no longer actively doing GCC
development, I should be able to overcome that.

BTW, Richard, whatever my former issues with update_stmt were are gone,
so I'll be able to clean up that wart.

Thanks,
Bill

Patch

Index: gcc/fwprop.c
===================================================================
--- gcc/fwprop.c	(revision 179317)
+++ gcc/fwprop.c	(working copy)
@@ -131,6 +131,15 @@  static VEC(df_ref,heap) *reg_defs_stack;
 static bitmap local_md;
 static bitmap local_lr;
 
+/* Auxiliary information for each block for this pass.  */
+typedef struct GTY(()) fwprop_bb_aux_d
+{
+  /* Registers appearing in (mem (plus (reg ... patterns in this block.  */
+  bitmap mem_plus_regs;
+} fwprop_bb_aux, *fwprop_bb_aux_t;
+
+#define MEM_PLUS_REGS(BB) ((fwprop_bb_aux_t) ((BB)->aux))->mem_plus_regs
+
 /* Return the only def in USE's use-def chain, or NULL if there is
    more than one def in the chain.  */
 
@@ -212,7 +221,43 @@  process_uses (df_ref *use_rec, int top_flag)
 }
 
 
+/* Record whether EXPR in block BB is of the form (mem (plus (reg ...  */
 static void
+record_mem_plus_reg (basic_block bb, rtx expr)
+{
+  rtx addr, reg;
+
+  if (GET_CODE (expr) != MEM)
+    return;
+
+  addr = XEXP (expr, 0);
+  if (GET_CODE (addr) != PLUS)
+    return;
+
+  reg = XEXP (addr, 0);
+  if (!REG_P (reg))
+    return;
+
+  (void)bitmap_set_bit (MEM_PLUS_REGS (bb), REGNO (reg));
+}
+
+
+/* Record whether INSN in block BB contains any patterns of the form
+   (mem (plus (reg ...  */
+static void
+record_mem_plus_regs (basic_block bb, rtx insn)
+{
+  rtx insn_set = single_set (insn);
+
+  if (insn_set)
+    {
+      record_mem_plus_reg (bb, SET_DEST (insn_set));
+      record_mem_plus_reg (bb, SET_SRC (insn_set));
+    }
+}
+
+
+static void
 single_def_use_enter_block (struct dom_walk_data *walk_data ATTRIBUTE_UNUSED,
 			    basic_block bb)
 {
@@ -230,6 +275,8 @@  single_def_use_enter_block (struct dom_walk_data *
   process_uses (df_get_artificial_uses (bb_index), DF_REF_AT_TOP);
   process_defs (df_get_artificial_defs (bb_index), DF_REF_AT_TOP);
 
+  MEM_PLUS_REGS (bb) = BITMAP_ALLOC (NULL);
+
   /* We don't call df_simulate_initialize_forwards, as it may overestimate
      the live registers if there are unused artificial defs.  We prefer
      liveness to be underestimated.  */
@@ -242,8 +289,15 @@  single_def_use_enter_block (struct dom_walk_data *
         process_uses (DF_INSN_UID_EQ_USES (uid), 0);
         process_defs (DF_INSN_UID_DEFS (uid), 0);
 	df_simulate_one_insn_forwards (bb, insn, local_lr);
+	record_mem_plus_regs (bb, insn);
       }
 
+  if (dump_file)
+    {
+      fprintf (dump_file, "mem_plus_regs (%d): ", bb->index);
+      dump_bitmap (dump_file, MEM_PLUS_REGS (bb));
+    }
+
   process_uses (df_get_artificial_uses (bb_index), 0);
   process_defs (df_get_artificial_defs (bb_index), 0);
 }
@@ -295,6 +349,8 @@  build_single_def_use_links (void)
   local_md = BITMAP_ALLOC (NULL);
   local_lr = BITMAP_ALLOC (NULL);
 
+  alloc_aux_for_blocks (sizeof (fwprop_bb_aux));
+
   /* Walk the dominator tree looking for single reaching definitions
      dominating the uses.  This is similar to how SSA form is built.  */
   walk_data.dom_direction = CDI_DOMINATORS;
@@ -1207,6 +1263,69 @@  forward_propagate_asm (df_ref use, rtx def_insn, r
   return true;
 }
 
+/* We are proposing to replace a USE of REG in USE_SET.  Determine
+   whether we have a situation where two storage uses of REG occur
+   in the same block as follows.  Each use can be either a memory
+   store or a memory load.
+
+     ... (mem (reg REG))
+
+     ... (mem (plus (reg REG)
+                    (...)))
+
+   The problem is that it will look profitable to propagate something
+   like
+
+     (set (reg REG)
+          (plus (reg X)
+                (reg Y)))
+
+   into the first use, but not into the second one.  This breaks a 
+   CSE opportunity and raises register pressure by extending the
+   lifetimes of X and Y.  To avoid this, don't propagate into the
+   first use when this very specific situation arises.  */
+static bool
+locally_poor_mem_replacement (df_ref use, rtx use_set, rtx reg, rtx def_set)
+{
+  rtx rhs, addend, mem, base;
+  unsigned i;
+
+  /* We're only concerned if the RHS of def_set is the sum of two
+     registers.  */
+  rhs = SET_SRC (def_set);
+
+  if (GET_CODE (rhs) != PLUS)
+    return false;
+
+  for (i = 0; i < 2; i++)
+    {
+      addend = XEXP (rhs, i);
+      if (!REG_P (addend))
+	return false;
+    }
+
+  /* The use must have a single SET and be used in addressing.  */
+  if (!use_set || !DF_REF_REG_MEM_P (use))
+    return false;
+
+  if (DF_REF_REG_MEM_STORE_P (use))
+    mem = SET_DEST (use_set);
+  else
+    mem = SET_SRC (use_set);
+
+  if (GET_CODE (mem) != MEM)
+    return false;
+
+  /* We are specifically interested in the base address.  */
+  base = XEXP (mem, 0);
+  if (reg != base)
+    return false;
+
+  /* We've found a use of the first form.  Check whether uses of the
+     second form exist in the same block.  If found, don't replace.  */
+  return bitmap_bit_p (MEM_PLUS_REGS (DF_REF_BB (use)), DF_REF_REGNO (use));
+}
+
 /* Try to replace USE with SRC (defined in DEF_INSN) and simplify the
    result.  */
 
@@ -1273,6 +1392,11 @@  forward_propagate_and_simplify (df_ref use, rtx de
       return false;
     }
 
+  /* Check for a specific unprofitable propagation into a memory
+     reference, where the propagation will break a CSE opportunity.  */
+  if (locally_poor_mem_replacement (use, use_set, reg, def_set))
+    return false;
+
   if (asm_use >= 0)
     return forward_propagate_asm (use, def_insn, def_set, reg);
 
@@ -1403,6 +1527,12 @@  fwprop_init (void)
 static void
 fwprop_done (void)
 {
+  basic_block bb;
+
+  FOR_EACH_BB (bb)
+    BITMAP_FREE (MEM_PLUS_REGS (bb));
+  free_aux_for_blocks ();
+
   loop_optimizer_finalize ();
 
   VEC_free (df_ref, heap, use_def_ref);
Index: gcc/tree.h
===================================================================
--- gcc/tree.h	(revision 179317)
+++ gcc/tree.h	(working copy)
@@ -5775,6 +5775,9 @@  tree target_for_debug_bind (tree);
 extern tree tree_mem_ref_addr (tree, tree);
 extern void copy_mem_ref_info (tree, tree);
 
+/* In tree-ssa-loop-ivopts.c.  */
+extern void copy_ref_info (tree, tree);
+
 /* In tree-vrp.c */
 extern bool ssa_name_nonnegative_p (const_tree);
 
Index: gcc/testsuite/gcc.target/powerpc/ppc-pr46556-1.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/ppc-pr46556-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/ppc-pr46556-1.c	(revision 0)
@@ -0,0 +1,21 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+}
+
+/* { dg-final { scan-assembler-times "lw.*3," 1 { target powerpc*-*-* } } } */
+/* { dg-final { scan-assembler-times "lw.*4,128\\(.*\\)" 1 { target powerpc*-*-* } } } */
+/* { dg-final { scan-assembler-times "lw.*5,64\\(.*\\)" 1 { target powerpc*-*-* } } } */
Index: gcc/testsuite/gcc.target/powerpc/ppc-pr46556-2.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/ppc-pr46556-2.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/ppc-pr46556-2.c	(revision 0)
@@ -0,0 +1,24 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+  if (n > 12)
+    foo (p->a[n], p->c[n], p->b[n]);
+  else if (n > 3)
+    foo (p->b[n], p->a[n], p->c[n]);
+}
+
+/* { dg-final { scan-assembler-times "lw.*4,0\\(.*\\)" 1 { target powerpc*-*-* } } } */
+/* { dg-final { scan-assembler-times "lw.*3,0\\(.*\\)" 1 { target powerpc*-*-* } } } */
Index: gcc/testsuite/gcc.target/powerpc/ppc-pr46556-3.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/ppc-pr46556-3.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/ppc-pr46556-3.c	(revision 0)
@@ -0,0 +1,26 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+  if (n > 3)
+    {
+      foo (p->a[n], p->c[n], p->b[n]);
+      if (n > 12)
+	foo (p->b[n], p->a[n], p->c[n]);
+    }
+}
+
+/* { dg-final { scan-assembler-times "lw.*3,0\\(.*\\)" 1 { target powerpc*-*-* } } } */
+/* { dg-final { scan-assembler-times "lw.*4,0\\(.*\\)" 1 { target powerpc*-*-* } } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/pr46556-1.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/pr46556-1.c	(revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr46556-1.c	(revision 0)
@@ -0,0 +1,22 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-reassoc2" } */
+
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+}
+
+/* { dg-final { scan-tree-dump-times "\\* 4;" 1 "reassoc2" } } */
+/* { dg-final { scan-tree-dump-times "p_1\\(D\\) \\+ D" 1 "reassoc2" } } */
+/* { dg-final { scan-tree-dump-times "MEM\\\[\\(struct x \\*\\)D" 2 "reassoc2" } } */
+/* { dg-final { cleanup-tree-dump "reassoc2" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/pr46556-2.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/pr46556-2.c	(revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr46556-2.c	(revision 0)
@@ -0,0 +1,26 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-reassoc2" } */
+
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+  if (n > 12)
+    foo (p->a[n], p->c[n], p->b[n]);
+  else if (n > 3)
+    foo (p->b[n], p->a[n], p->c[n]);
+}
+
+/* { dg-final { scan-tree-dump-times "\\* 4;" 1 "reassoc2" } } */
+/* { dg-final { scan-tree-dump-times "p_1\\(D\\) \\+ D" 1 "reassoc2" } } */
+/* { dg-final { scan-tree-dump-times "MEM\\\[\\(struct x \\*\\)D" 6 "reassoc2" } } */
+/* { dg-final { cleanup-tree-dump "reassoc2" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/pr46556-3.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/pr46556-3.c	(revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr46556-3.c	(revision 0)
@@ -0,0 +1,28 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-reassoc2" } */
+
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+  if (n > 3)
+    {
+      foo (p->a[n], p->c[n], p->b[n]);
+      if (n > 12)
+	foo (p->b[n], p->a[n], p->c[n]);
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "\\* 4;" 1 "reassoc2" } } */
+/* { dg-final { scan-tree-dump-times "p_1\\(D\\) \\+ D" 1 "reassoc2" } } */
+/* { dg-final { scan-tree-dump-times "MEM\\\[\\(struct x \\*\\)D" 6 "reassoc2" } } */
+/* { dg-final { cleanup-tree-dump "reassoc2" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/pr46556-4.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/pr46556-4.c	(revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr46556-4.c	(revision 0)
@@ -0,0 +1,20 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-rtl-fwprop1" } */
+
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+}
+
+/* { dg-final { scan-rtl-dump-times "\\(mem.\*:.\* \\(reg.\*p_1\\(D\\)->a" 1 "fwprop1" } } */
+/* { dg-final { cleanup-rtl-dump "fwprop1" } } */
Index: gcc/tree-ssa-loop-ivopts.c
===================================================================
--- gcc/tree-ssa-loop-ivopts.c	(revision 179317)
+++ gcc/tree-ssa-loop-ivopts.c	(working copy)
@@ -6280,7 +6280,7 @@  rewrite_use_nonlinear_expr (struct ivopts_data *da
 
 /* Copies the reference information from OLD_REF to NEW_REF.  */
 
-static void
+void
 copy_ref_info (tree new_ref, tree old_ref)
 {
   tree new_ptr_base = NULL_TREE;
Index: gcc/tree-ssa-reassoc.c
===================================================================
--- gcc/tree-ssa-reassoc.c	(revision 179317)
+++ gcc/tree-ssa-reassoc.c	(working copy)
@@ -2361,6 +2361,116 @@  break_up_subtract_bb (basic_block bb)
     break_up_subtract_bb (son);
 }
 
+/* Given BASE and OFFSET derived from *EXPR occurring in stmt *GSI,
+   determine whether there is a profitable opportunity to restructure
+   address arithmetic within BASE and OFFSET.  If so, produce such
+   a restructuring in *MEM_REF and return true; else return false.  */
+
+static tree
+restructure_base_and_offset (tree *expr, gimple_stmt_iterator *gsi,
+			     tree base, tree offset, HOST_WIDE_INT bitpos)
+{
+  tree mult_op0, mult_op1, t1, t2, offset_type, mult_expr, add_expr, mem_ref;
+  HOST_WIDE_INT c, c1, c2, c3, c4;
+
+  /* Look for the following pattern:
+
+       base   = MEM_REF (T1, C1);
+       offset = MULT_EXPR (PLUS_EXPR (T2, C2), C3)
+       bitpos = C4 * BITS_PER_UNIT
+
+     If found, convert it to:
+
+       MEM_REF (POINTER_PLUS_EXPR (T1, MULT_EXPR (T2, C3)),
+                C1 + (C2 * C3) + C4)
+   */
+  if (!base
+      || !offset
+      || TREE_CODE (base) != MEM_REF
+      || TREE_CODE (TREE_OPERAND (base, 1)) != INTEGER_CST
+      || TREE_CODE (offset) != MULT_EXPR)
+    return NULL_TREE;
+
+  mult_op0 = TREE_OPERAND (offset, 0);
+  mult_op1 = TREE_OPERAND (offset, 1);
+
+  if (TREE_CODE (mult_op0) != PLUS_EXPR
+      || TREE_CODE (mult_op1) != INTEGER_CST
+      || TREE_CODE (TREE_OPERAND (mult_op0, 1)) != INTEGER_CST)
+    return NULL_TREE;
+
+  t1 = TREE_OPERAND (base, 0);
+  t2 = TREE_OPERAND (mult_op0, 0);
+  c1 = TREE_INT_CST_LOW (TREE_OPERAND (base, 1));
+  c2 = TREE_INT_CST_LOW (TREE_OPERAND (mult_op0, 1));
+  c3 = TREE_INT_CST_LOW (mult_op1);
+  c4 = bitpos / BITS_PER_UNIT;
+  c = c1 + c2 * c3 + c4;
+
+  offset_type = TREE_TYPE (TREE_OPERAND (base, 1));
+  
+  mult_expr = fold_build2 (MULT_EXPR, sizetype, t2,
+			   build_int_cst (sizetype, c3));
+  mult_expr = force_gimple_operand_gsi (gsi, mult_expr, true, NULL,
+					true, GSI_SAME_STMT);
+  add_expr = fold_build2 (POINTER_PLUS_EXPR, TREE_TYPE (t1), t1, mult_expr);
+  add_expr = force_gimple_operand_gsi (gsi, add_expr, true, NULL,
+				       true, GSI_SAME_STMT);
+  mem_ref = fold_build2 (MEM_REF, TREE_TYPE (*expr), add_expr,
+			 build_int_cst (offset_type, c));
+  return mem_ref;
+}
+
+/* If *EXPR contains a memory reference, determine whether there is an
+   opportunity to restructure the base and offset expressions for
+   improved performance.  */
+
+static bool
+restructure_mem_ref (tree *expr, gimple_stmt_iterator *gsi)
+{
+  enum tree_code code = TREE_CODE (*expr);
+  tree base, offset, mem_ref;
+  HOST_WIDE_INT bitsize, bitpos;
+  enum machine_mode mode;
+  int unsignedp, volatilep;
+
+  /* Only expressions that reference memory are of interest.  Bitfield
+     references should eventually be lowered prior to this point and
+     are not dealt with.  */
+  if (! handled_component_p (*expr)
+      || code == BIT_FIELD_REF
+      || (code == COMPONENT_REF && DECL_BIT_FIELD (TREE_OPERAND (*expr, 1))))
+    return false;
+
+  /* Determine whether the expression can be represented with base and
+     offset components.  */
+  base = get_inner_reference (*expr, &bitsize, &bitpos, &offset, &mode,
+			      &unsignedp, &volatilep, false);
+  if (!base || !offset)
+    return false;
+
+  /* Look for a restructuring opportunity.  */
+  if ((mem_ref = restructure_base_and_offset (expr, gsi, base,
+					      offset, bitpos)) == NULL_TREE)
+    return false;
+
+  /* Found one.  Record the replacement in the dump file and complete
+     the replacement.  */
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "\nOriginal_expr = ");
+      print_generic_expr (dump_file, *expr, TDF_VOPS|TDF_MEMSYMS);
+      fprintf (dump_file, "\nmem_ref = ");
+      print_generic_expr (dump_file, mem_ref, TDF_VOPS|TDF_MEMSYMS);
+      fprintf (dump_file, "\n\n");
+    }
+
+  copy_ref_info (mem_ref, *expr);
+  *expr = mem_ref;
+
+  return true;
+}
+
 /* Reassociate expressions in basic block BB and its post-dominator as
    children.  */
 
@@ -2369,14 +2479,30 @@  reassociate_bb (basic_block bb)
 {
   gimple_stmt_iterator gsi;
   basic_block son;
+  bool chgd_mem_ref = false;
+  bool in_loop = loop_depth (bb->loop_father) != 0;
 
   for (gsi = gsi_last_bb (bb); !gsi_end_p (gsi); gsi_prev (&gsi))
     {
       gimple stmt = gsi_stmt (gsi);
 
-      if (is_gimple_assign (stmt)
-	  && !stmt_could_throw_p (stmt))
+      /* Look for restructuring opportunities within an expression
+	 that references memory.  We only do this for blocks not
+         contained in loops, since the ivopts machinery does a 
+         good job on loop expressions, and we don't want to interfere
+	 with other loop optimizations.  */
+      if (!in_loop && gimple_vuse (stmt) && gimple_assign_single_p (stmt))
 	{
+	  tree *lhs, *rhs;
+	  lhs = gimple_assign_lhs_ptr (stmt);
+	  chgd_mem_ref = restructure_mem_ref (lhs, &gsi) || chgd_mem_ref;
+	  rhs = gimple_assign_rhs1_ptr (stmt);
+	  chgd_mem_ref = restructure_mem_ref (rhs, &gsi) || chgd_mem_ref;
+	}
+
+      else if (is_gimple_assign (stmt)
+	       && !stmt_could_throw_p (stmt))
+	{
 	  tree lhs, rhs1, rhs2;
 	  enum tree_code rhs_code = gimple_assign_rhs_code (stmt);
 
@@ -2489,6 +2615,12 @@  reassociate_bb (basic_block bb)
 	    }
 	}
     }
+  /* If memory references have been restructured, immediate uses need
+     to be cleaned up.  */
+  if (chgd_mem_ref)
+    for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+      update_stmt (gsi_stmt (gsi));
+
   for (son = first_dom_son (CDI_POST_DOMINATORS, bb);
        son;
        son = next_dom_son (CDI_POST_DOMINATORS, son))