diff mbox

Address lowering [1/3] Main patch

Message ID 1309444782.26980.52.camel@oc2474580526.ibm.com
State New
Headers show

Commit Message

Bill Schmidt June 30, 2011, 2:39 p.m. UTC
This is the first of three patches related to lowering addressing
expressions to MEM_REFs and TARGET_MEM_REFs in late gimple.  This patch
contains the new pass together with supporting changes in existing
modules.  The second patch contains an independent change to the RTL
forward propagator to keep it from undoing an optimization made in the
first patch.  The third patch contains new test cases and changes to
existing test cases.

Although I've broken it up into three patches to make the review easier,
it would be best to commit at least the first and third together to
avoid regressions.  The second can stand alone.

I've done regression tests on powerpc64 and x86_64, and have asked
Andreas Krebbel to test against the IBM z (390) platform.  I've done
performance regression testing on powerpc64.  The only performance
regression of note is the 2% degradation to 188.ammp due to loss of
field disambiguation information.  As discussed in another thread,
fixing this introduces more complexity than it's worth.

I look forward to your comments!

Thanks,
Bill

2011-06-30  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

	PR rtl-optimization/46556
	* dbgcnt.def (lower_addr): New debug counter.
	* tree-ssa-lower-addr.c: New.
	* tree.c (may_be_unaligned_p): Moved from tree-ssa-loop-ivopts.c.
	* tree.h (may_be_unaligned_p): New declaration.
	(copy_ref_info): Likewise.
	* tree-pass.h (pass_tree_lower_addr): Likewise.
	* tree-ssa-loop-ivopts.c (may_be_unaligned_p): Moved to tree.c.
	(copy_ref_info): Moved to tree-ssa-address.c.
	* timevar.def (TV_TREE_LOWER_ADDR): New timevar.
	* tree-ssa-address.c (addr_to_parts): Provide for call on behalf
	of address lowering.
	(copy_ref_info): Moved from tree-ssa-loop-ivopts.c.
	* tree-affine.c (tree-flow.h): New include.
	(mem_tree_to_aff_combination_expand): New.
	* tree-affine.h (mem_tree_to_aff_combination-expand): New decl.
	* common.opt (lower-addr): New -f option.
	* Makefile.in (tree-ssa-lower-addr.o): New.
	(tree-affine.c): New dependency on tree-flow.h.
	* gimple.c (gimple_assign_conversion_p): New.
	* gimple.h (gimple_assign_conversion_p): New decl.
	* passes.c (init_optimization_passes): Add new pass.

Comments

Richard Biener July 4, 2011, 2:21 p.m. UTC | #1
On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
> This is the first of three patches related to lowering addressing
> expressions to MEM_REFs and TARGET_MEM_REFs in late gimple.  This patch
> contains the new pass together with supporting changes in existing
> modules.  The second patch contains an independent change to the RTL
> forward propagator to keep it from undoing an optimization made in the
> first patch.  The third patch contains new test cases and changes to
> existing test cases.
>
> Although I've broken it up into three patches to make the review easier,
> it would be best to commit at least the first and third together to
> avoid regressions.  The second can stand alone.
>
> I've done regression tests on powerpc64 and x86_64, and have asked
> Andreas Krebbel to test against the IBM z (390) platform.  I've done
> performance regression testing on powerpc64.  The only performance
> regression of note is the 2% degradation to 188.ammp due to loss of
> field disambiguation information.  As discussed in another thread,
> fixing this introduces more complexity than it's worth.

Are there also performance improvements?  What about code size?

I tried to get an understanding to what kind of optimizations this patch
produces based on the test of testcases you added, but I have a hard
time here.  Can you outline some please?

I still do not like the implementation of yet another CSE machinery
given that we already have two.  I think most of the need for CSE
comes from the use of the affine combination framework and
force_gimple_operand.  In fact I'd be interested to see cases that
are optimized that could not be handled by a combine-like
pattern matcher?

Thanks,
Richard.
Michael Matz July 4, 2011, 3:30 p.m. UTC | #2
Hi,

On Mon, 4 Jul 2011, Richard Guenther wrote:

> I still do not like the implementation of yet another CSE machinery
> given that we already have two.

From reading it it really seems to be a normal block-local CSE, without 
anything fancy.  Hence, moving the pass just a little earlier (before 
pass_vrp/pass_dominator) should already provide for all optimizations.  If 
not those should be improved.

I see that it is used for also getting rid of the zero-offset statements 
in case non-zero-offsets follow.  I think that's generally worthwhile so 
probably should be done in one of the above optimizers.

You handle NOP_EXPR different from CONVERT_EXPR.  The middle-end doesn't 
distinguish between them (yes, ignore the comment about one generating 
code, the other not).

Your check for small types:

+         if (TYPE_MODE (TREE_TYPE (TREE_OPERAND (expr, 0))) == SImode)
+           ref_found = true;

You probably want != BLKmode .

+  if (changed && is_zero_offset_ref (gimple_assign_lhs (stmt)))
+    VEC_safe_push (gimple, heap, zero_offset_refs, stmt);
+
+  rhs1 = gimple_assign_rhs1_ptr (stmt);
+  rhs_changed = tree_ssa_lower_addr_tree (rhs1, gsi, speed, false);
+
+  /* Record zero-offset mem_refs on the RHS.  */
+  if (rhs_changed && is_zero_offset_ref (gimple_assign_rhs1 (stmt)))
+    VEC_safe_push (gimple, heap, zero_offset_refs, stmt);

This possibly adds stmt twice to zero_offset_refs.  Do you really want 
this?


Ciao,
Michael.
Bill Schmidt July 5, 2011, 1:59 p.m. UTC | #3
(Sorry for the late response; yesterday was a holiday here.)

On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote:
> On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt
> <wschmidt@linux.vnet.ibm.com> wrote:
> > This is the first of three patches related to lowering addressing
> > expressions to MEM_REFs and TARGET_MEM_REFs in late gimple.  This patch
> > contains the new pass together with supporting changes in existing
> > modules.  The second patch contains an independent change to the RTL
> > forward propagator to keep it from undoing an optimization made in the
> > first patch.  The third patch contains new test cases and changes to
> > existing test cases.
> >
> > Although I've broken it up into three patches to make the review easier,
> > it would be best to commit at least the first and third together to
> > avoid regressions.  The second can stand alone.
> >
> > I've done regression tests on powerpc64 and x86_64, and have asked
> > Andreas Krebbel to test against the IBM z (390) platform.  I've done
> > performance regression testing on powerpc64.  The only performance
> > regression of note is the 2% degradation to 188.ammp due to loss of
> > field disambiguation information.  As discussed in another thread,
> > fixing this introduces more complexity than it's worth.
> 
> Are there also performance improvements?  What about code size?

Yes, there are performance improvements.  I've been running cpu2000 on
32- and 64-bit powerpc64.  Thirteen tests show measurable improvements
between 1% and 9%, with 187.facerec showing the largest improvements for
both 32 and 64.

I don't have formal code size results, but anecdotally from code
crawling, I have seen code size either neutral or slightly improved.
The largest code size improvements I've seen were on 32-bit code where
the commoning allowed removal of a number of sign-extend and zero-extend
instructions that were otherwise not seen to be redundant.

> 
> I tried to get an understanding to what kind of optimizations this patch
> produces based on the test of testcases you added, but I have a hard
> time here.  Can you outline some please?
> 

The primary goal is to clean up code such as is shown in the original
post of PR46556.  In late 2007 there were some changes made to address
canonicalization that caused the code gen to be suboptimal on powerpc64.
At that time you and others suggested a pattern recognizer prior to
expand as probably the best solution, similar to what IVopts is doing.

By using the same mem_ref generation machinery used by IVopts, together
with local CSE, the goal was to ensure base registers are properly
shared so that optimal code is generated, particularly for cases of
shared addressability to structures and arrays.  I also observed cases
where it was useful to extend the sharing across the dominator tree.

Secondarily, I noticed that once this was cleaned up we still had the
suboptimal code generation for the zero-offset mem refs scenario
outlined in the code commentary.  The extra logic to clean this up helps
keep register pressure down.  I've observed some spill code reduction
when this is in place.  Again, using expression availability from
dominating blocks helps here in some cases as well.

> I still do not like the implementation of yet another CSE machinery
> given that we already have two.  I think most of the need for CSE
> comes from the use of the affine combination framework and
> force_gimple_operand.  In fact I'd be interested to see cases that
> are optimized that could not be handled by a combine-like
> pattern matcher?
> 

I'm somewhat puzzled by this comment.  Back on 2/4, I posted a skeletal
outline for this pass and asked for comments.  You indicated a concern
that the affine combination expansion would un-CSE a lot of expressions,
and that I should keep track of local candidates during this pass to
ensure that base registers, etc., would be properly shared.  It seemed
to me that a quick and dirty CSE of addressing expressions would be the
best way to handle that.  I posted another RFC on 2/24 with an early
implementation of CSE constrained to a single block, and indicated
shortly thereafter my intent to extend this to a dominator walk.  I
didn't receive any negative comments about using CSE at that time; but
then I didn't get much response at all.

I probably should have pushed harder to get comments at that point, but
I was very new to the community and was feeling my way through the
protocols; I didn't feel comfortable pushing.

In any case, yes, the primary need for CSE is as a result of the affine
combination framework, which creates a large amount of redundant code.
I can certainly take a look at Michael's suggestion to move the pass
earlier and allow pass-dominator to handle the cleanup, and add the
zero-offset mem-ref optimization to that or a subsequent pass.  Do you
agree that this would be a preferable approach?

> Thanks,
> Richard.
Bill Schmidt July 5, 2011, 2:05 p.m. UTC | #4
On Mon, 2011-07-04 at 17:30 +0200, Michael Matz wrote:
> Hi,
> 
> On Mon, 4 Jul 2011, Richard Guenther wrote:
> 
> > I still do not like the implementation of yet another CSE machinery
> > given that we already have two.
> 
> From reading it it really seems to be a normal block-local CSE, without 
> anything fancy.  Hence, moving the pass just a little earlier (before 
> pass_vrp/pass_dominator) should already provide for all optimizations.  If 
> not those should be improved.
> 
> I see that it is used for also getting rid of the zero-offset statements 
> in case non-zero-offsets follow.  I think that's generally worthwhile so 
> probably should be done in one of the above optimizers.

OK, thanks for the suggestion.  I can certainly look into this.

> 
> You handle NOP_EXPR different from CONVERT_EXPR.  The middle-end doesn't 
> distinguish between them (yes, ignore the comment about one generating 
> code, the other not).
> 

Ah, thanks, good to know.

> Your check for small types:
> 
> +         if (TYPE_MODE (TREE_TYPE (TREE_OPERAND (expr, 0))) == SImode)
> +           ref_found = true;
> 
> You probably want != BLKmode .
> 

OK.

> +  if (changed && is_zero_offset_ref (gimple_assign_lhs (stmt)))
> +    VEC_safe_push (gimple, heap, zero_offset_refs, stmt);
> +
> +  rhs1 = gimple_assign_rhs1_ptr (stmt);
> +  rhs_changed = tree_ssa_lower_addr_tree (rhs1, gsi, speed, false);
> +
> +  /* Record zero-offset mem_refs on the RHS.  */
> +  if (rhs_changed && is_zero_offset_ref (gimple_assign_rhs1 (stmt)))
> +    VEC_safe_push (gimple, heap, zero_offset_refs, stmt);
> 
> This possibly adds stmt twice to zero_offset_refs.  Do you really want 
> this?
> 

Hm, I didn't think it was (currently) possible for a gimple statement to
have a mem-ref on both RHS and LHS.  Is that incorrect?  This is easily
changed if so, or if the possibility should be left open for the future.

> 
> Ciao,
> Michael.

Thanks,
Bill
Michael Matz July 5, 2011, 2:24 p.m. UTC | #5
Hi,

On Tue, 5 Jul 2011, William J. Schmidt wrote:

> Hm, I didn't think it was (currently) possible for a gimple statement to 
> have a mem-ref on both RHS and LHS.  Is that incorrect?  This is easily 
> changed if so, or if the possibility should be left open for the future.

Think aggregate copies:

void bla (struct S *d, struct S *s) { *d = *s; }


Ciao,
Michael.
Bill Schmidt July 5, 2011, 5:02 p.m. UTC | #6
On Mon, 2011-07-04 at 17:30 +0200, Michael Matz wrote:

> From reading it it really seems to be a normal block-local CSE, without 
> anything fancy.  Hence, moving the pass just a little earlier (before 
> pass_vrp/pass_dominator) should already provide for all optimizations.  If 
> not those should be improved.
> 
I did a quick hack-up, and this looks like it will work well, at least
on the few simple examples I threw at it.

> I see that it is used for also getting rid of the zero-offset statements 
> in case non-zero-offsets follow.  I think that's generally worthwhile so 
> probably should be done in one of the above optimizers.
> 
I will experiment with adding it to pass_dominator.  Looks like
tree-ssa-dom.c:optimize_stmt is the place to start.

Thanks,
Bill
Richard Biener July 6, 2011, 1:16 p.m. UTC | #7
On Tue, Jul 5, 2011 at 3:59 PM, William J. Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
> (Sorry for the late response; yesterday was a holiday here.)
>
> On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote:
>> On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt
>> <wschmidt@linux.vnet.ibm.com> wrote:
>> > This is the first of three patches related to lowering addressing
>> > expressions to MEM_REFs and TARGET_MEM_REFs in late gimple.  This patch
>> > contains the new pass together with supporting changes in existing
>> > modules.  The second patch contains an independent change to the RTL
>> > forward propagator to keep it from undoing an optimization made in the
>> > first patch.  The third patch contains new test cases and changes to
>> > existing test cases.
>> >
>> > Although I've broken it up into three patches to make the review easier,
>> > it would be best to commit at least the first and third together to
>> > avoid regressions.  The second can stand alone.
>> >
>> > I've done regression tests on powerpc64 and x86_64, and have asked
>> > Andreas Krebbel to test against the IBM z (390) platform.  I've done
>> > performance regression testing on powerpc64.  The only performance
>> > regression of note is the 2% degradation to 188.ammp due to loss of
>> > field disambiguation information.  As discussed in another thread,
>> > fixing this introduces more complexity than it's worth.
>>
>> Are there also performance improvements?  What about code size?
>
> Yes, there are performance improvements.  I've been running cpu2000 on
> 32- and 64-bit powerpc64.  Thirteen tests show measurable improvements
> between 1% and 9%, with 187.facerec showing the largest improvements for
> both 32 and 64.
>
> I don't have formal code size results, but anecdotally from code
> crawling, I have seen code size either neutral or slightly improved.
> The largest code size improvements I've seen were on 32-bit code where
> the commoning allowed removal of a number of sign-extend and zero-extend
> instructions that were otherwise not seen to be redundant.
>>
>> I tried to get an understanding to what kind of optimizations this patch
>> produces based on the test of testcases you added, but I have a hard
>> time here.  Can you outline some please?
>>
>
> The primary goal is to clean up code such as is shown in the original
> post of PR46556.  In late 2007 there were some changes made to address
> canonicalization that caused the code gen to be suboptimal on powerpc64.
> At that time you and others suggested a pattern recognizer prior to
> expand as probably the best solution, similar to what IVopts is doing.

The PR46556 case looks quite simple.

> By using the same mem_ref generation machinery used by IVopts, together
> with local CSE, the goal was to ensure base registers are properly
> shared so that optimal code is generated, particularly for cases of
> shared addressability to structures and arrays.  I also observed cases
> where it was useful to extend the sharing across the dominator tree.

As you are doing IV selection per individual statement only, using
the affine combination machinery looks quite a big hammer to me.
Especially as it is hard to imagine what the side-effects are, apart
from re-associating dependencies that do not fit the MEM-REF and
making the MEM-REF as complicated as permitted by the target.

What I thought originally when suggesting to do something similar
to IVOPTs was to build a list of candidates and uses and optimize
that set using a cost function similar to how IVOPTs does.

Doing addressing-mode selection locally per statement seems like
more a task for a few pattern matchers, for example in tree-ssa-forwprop.c
(for its last invocation).  One pattern would be that of PR46556,
MEM[(p + ((n + 16)*4))] which we can transform to
TARGET_MEM_REF[x + 64] with x = p + n*4 if ((n + 16)*4)) was
a single-use.  The TARGET_MEM_REF generation can easily
re-use the address-description and target-availability checks from
tree-ssa-address.c.  I would be at least interested in whether
handling the pattern from PR46556 alone (or maybe with a few
similar other cases) is responsible for the performance improvements.

Ideally we'd of course have a cost driven machinery that considers
(part of) the whole function.

> Secondarily, I noticed that once this was cleaned up we still had the
> suboptimal code generation for the zero-offset mem refs scenario
> outlined in the code commentary.  The extra logic to clean this up helps
> keep register pressure down.  I've observed some spill code reduction
> when this is in place.  Again, using expression availability from
> dominating blocks helps here in some cases as well.

Yeah, the case is quite odd and doesn't really fit existing optimizers
given that the CSE opportunity is hidden within the TARGET_MEM_REF ...

>> I still do not like the implementation of yet another CSE machinery
>> given that we already have two.  I think most of the need for CSE
>> comes from the use of the affine combination framework and
>> force_gimple_operand.  In fact I'd be interested to see cases that
>> are optimized that could not be handled by a combine-like
>> pattern matcher?
>>
>
> I'm somewhat puzzled by this comment.  Back on 2/4, I posted a skeletal
> outline for this pass and asked for comments.  You indicated a concern
> that the affine combination expansion would un-CSE a lot of expressions,
> and that I should keep track of local candidates during this pass to
> ensure that base registers, etc., would be properly shared.  It seemed
> to me that a quick and dirty CSE of addressing expressions would be the
> best way to handle that.  I posted another RFC on 2/24 with an early
> implementation of CSE constrained to a single block, and indicated
> shortly thereafter my intent to extend this to a dominator walk.  I
> didn't receive any negative comments about using CSE at that time; but
> then I didn't get much response at all.

Sorry about that - I agree that doing a local CSE is one possible
solution, but when considering that we are only considering a statement
at a time it looks like a quite bloated approach.  A local CSE
capability would be indeed useful also for other passes in GCC, like
for example loop unrolling, which expose lots of garbage to later
passes.  I thought of re-using the machinery that DOM provides, avoding
yet another CSE implementation.

>
> I probably should have pushed harder to get comments at that point, but
> I was very new to the community and was feeling my way through the
> protocols; I didn't feel comfortable pushing.

And pushing doesn't help always either.

> In any case, yes, the primary need for CSE is as a result of the affine
> combination framework, which creates a large amount of redundant code.
> I can certainly take a look at Michael's suggestion to move the pass
> earlier and allow pass-dominator to handle the cleanup, and add the
> zero-offset mem-ref optimization to that or a subsequent pass.  Do you
> agree that this would be a preferable approach?

It would definitely preferable to a new CSE implementation.  I'm not
sure the zero-offset optimization easily fits in DOM though.

What I'd really like to better understand is what transformations are
the profitable ones and if we can handle them statement-locally
within our combine machinery (thus, tree-ssa-forwprop.c).  I can
cook up a forwprop patch to fix PR46556 in case you are interested
in more details.

Thanks,
Richard.

>> Thanks,
>> Richard.
>
>
>
Bill Schmidt July 6, 2011, 2:28 p.m. UTC | #8
On Wed, 2011-07-06 at 15:16 +0200, Richard Guenther wrote:
> On Tue, Jul 5, 2011 at 3:59 PM, William J. Schmidt
> <wschmidt@linux.vnet.ibm.com> wrote:
> > (Sorry for the late response; yesterday was a holiday here.)
> >
> > On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote:
> >> On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt
> >> <wschmidt@linux.vnet.ibm.com> wrote:
> >> > This is the first of three patches related to lowering addressing
> >> > expressions to MEM_REFs and TARGET_MEM_REFs in late gimple.  This patch
> >> > contains the new pass together with supporting changes in existing
> >> > modules.  The second patch contains an independent change to the RTL
> >> > forward propagator to keep it from undoing an optimization made in the
> >> > first patch.  The third patch contains new test cases and changes to
> >> > existing test cases.
> >> >
> >> > Although I've broken it up into three patches to make the review easier,
> >> > it would be best to commit at least the first and third together to
> >> > avoid regressions.  The second can stand alone.
> >> >
> >> > I've done regression tests on powerpc64 and x86_64, and have asked
> >> > Andreas Krebbel to test against the IBM z (390) platform.  I've done
> >> > performance regression testing on powerpc64.  The only performance
> >> > regression of note is the 2% degradation to 188.ammp due to loss of
> >> > field disambiguation information.  As discussed in another thread,
> >> > fixing this introduces more complexity than it's worth.
> >>
> >> Are there also performance improvements?  What about code size?
> >
> > Yes, there are performance improvements.  I've been running cpu2000 on
> > 32- and 64-bit powerpc64.  Thirteen tests show measurable improvements
> > between 1% and 9%, with 187.facerec showing the largest improvements for
> > both 32 and 64.
> >
> > I don't have formal code size results, but anecdotally from code
> > crawling, I have seen code size either neutral or slightly improved.
> > The largest code size improvements I've seen were on 32-bit code where
> > the commoning allowed removal of a number of sign-extend and zero-extend
> > instructions that were otherwise not seen to be redundant.
> >>
> >> I tried to get an understanding to what kind of optimizations this patch
> >> produces based on the test of testcases you added, but I have a hard
> >> time here.  Can you outline some please?
> >>
> >
> > The primary goal is to clean up code such as is shown in the original
> > post of PR46556.  In late 2007 there were some changes made to address
> > canonicalization that caused the code gen to be suboptimal on powerpc64.
> > At that time you and others suggested a pattern recognizer prior to
> > expand as probably the best solution, similar to what IVopts is doing.
> 
> The PR46556 case looks quite simple.

It certainly is.  I was personally curious whether there were other
suboptimal sequences that might be hiding out there, that a more general
approach might expose.  There was a comment at the end of the bugzilla
about a pass to expose target addressing modes in gimple for this
purpose.  When I first started looking at this, I looked for some
feedback from the community about whether that should be done, and got a
few favorable comments along with one negative one.  So that's how we
got on this road...

> 
> > By using the same mem_ref generation machinery used by IVopts, together
> > with local CSE, the goal was to ensure base registers are properly
> > shared so that optimal code is generated, particularly for cases of
> > shared addressability to structures and arrays.  I also observed cases
> > where it was useful to extend the sharing across the dominator tree.
> 
> As you are doing IV selection per individual statement only, using
> the affine combination machinery looks quite a big hammer to me.
> Especially as it is hard to imagine what the side-effects are, apart
> from re-associating dependencies that do not fit the MEM-REF and
> making the MEM-REF as complicated as permitted by the target.
> 
> What I thought originally when suggesting to do something similar
> to IVOPTs was to build a list of candidates and uses and optimize
> that set using a cost function similar to how IVOPTs does.
> 
OK, reading back I can see that now...

> Doing addressing-mode selection locally per statement seems like
> more a task for a few pattern matchers, for example in tree-ssa-forwprop.c
> (for its last invocation).  One pattern would be that of PR46556,
> MEM[(p + ((n + 16)*4))] which we can transform to
> TARGET_MEM_REF[x + 64] with x = p + n*4 if ((n + 16)*4)) was
> a single-use.  The TARGET_MEM_REF generation can easily
> re-use the address-description and target-availability checks from
> tree-ssa-address.c.  I would be at least interested in whether
> handling the pattern from PR46556 alone (or maybe with a few
> similar other cases) is responsible for the performance improvements.
> 
Hm, but I don't think forwprop sees the code in this form.  At the time
the last pass of forwprop runs, the gimple for the original problem is:

  D.1997_3 = p_1(D)->a[n_2(D)];
  D.1998_4 = p_1(D)->c[n_2(D)];
  D.1999_5 = p_1(D)->b[n_2(D)];
  foo (D.1997_3, D.1998_4, D.1999_5);

But perhaps this is just a matter of changing the patterns to recognize?

Without running experiments yet, my guess is that this would pick up
some of the benefit, but not all.  As you say, the effects are difficult
to predict; I've seen some surprisingly good results from CSEing some
complex addressing, but I also had to implement several tweaks to avoid
detrimental effects (such as stepping on what IVopts already got right).
It may be that some of the surprising positives (such as removal of
extra sign-extend instructions) indicate some cleanup work that should
be done in the back end instead.

> Ideally we'd of course have a cost driven machinery that considers
> (part of) the whole function.
> 
> > Secondarily, I noticed that once this was cleaned up we still had the
> > suboptimal code generation for the zero-offset mem refs scenario
> > outlined in the code commentary.  The extra logic to clean this up helps
> > keep register pressure down.  I've observed some spill code reduction
> > when this is in place.  Again, using expression availability from
> > dominating blocks helps here in some cases as well.
> 
> Yeah, the case is quite odd and doesn't really fit existing optimizers
> given that the CSE opportunity is hidden within the TARGET_MEM_REF ...
> 
> >> I still do not like the implementation of yet another CSE machinery
> >> given that we already have two.  I think most of the need for CSE
> >> comes from the use of the affine combination framework and
> >> force_gimple_operand.  In fact I'd be interested to see cases that
> >> are optimized that could not be handled by a combine-like
> >> pattern matcher?
> >>
> >
> > I'm somewhat puzzled by this comment.  Back on 2/4, I posted a skeletal
> > outline for this pass and asked for comments.  You indicated a concern
> > that the affine combination expansion would un-CSE a lot of expressions,
> > and that I should keep track of local candidates during this pass to
> > ensure that base registers, etc., would be properly shared.  It seemed
> > to me that a quick and dirty CSE of addressing expressions would be the
> > best way to handle that.  I posted another RFC on 2/24 with an early
> > implementation of CSE constrained to a single block, and indicated
> > shortly thereafter my intent to extend this to a dominator walk.  I
> > didn't receive any negative comments about using CSE at that time; but
> > then I didn't get much response at all.
> 
> Sorry about that - I agree that doing a local CSE is one possible
> solution, but when considering that we are only considering a statement
> at a time it looks like a quite bloated approach.  A local CSE
> capability would be indeed useful also for other passes in GCC, like
> for example loop unrolling, which expose lots of garbage to later
> passes.  I thought of re-using the machinery that DOM provides, avoding
> yet another CSE implementation.
> 
> >
> > I probably should have pushed harder to get comments at that point, but
> > I was very new to the community and was feeling my way through the
> > protocols; I didn't feel comfortable pushing.
> 
> And pushing doesn't help always either.

That's been my general experience; I try to be careful about nagging
busy people.  I think my timing was poor at the beginning of this
project, as you guys were quite busy with closing off the 4.6 release.

> 
> > In any case, yes, the primary need for CSE is as a result of the affine
> > combination framework, which creates a large amount of redundant code.
> > I can certainly take a look at Michael's suggestion to move the pass
> > earlier and allow pass-dominator to handle the cleanup, and add the
> > zero-offset mem-ref optimization to that or a subsequent pass.  Do you
> > agree that this would be a preferable approach?
> 
> It would definitely preferable to a new CSE implementation.  I'm not
> sure the zero-offset optimization easily fits in DOM though.

I took a look at this yesterday and I think it fits easily enough; about
the same as it fit into my prototype pass.  The available expressions
logic is pretty similar to what I was doing, so transplanting the code
wouldn't be difficult.  The only issue I saw is that there isn't a pure
lookup function that doesn't place an expression on the avail-exprs
stack, but that's not hard to add.

> 
> What I'd really like to better understand is what transformations are
> the profitable ones and if we can handle them statement-locally
> within our combine machinery (thus, tree-ssa-forwprop.c).  I can
> cook up a forwprop patch to fix PR46556 in case you are interested
> in more details.
> 
Thanks...let me study forwprop a bit first -- I'm trying to get my head
wrapped around as much of the middle end as I can, and this is a good
excuse to delve into another piece.

I think perhaps the best way forward may be to use my prototype as a
baseline to compare against additions to the combine logic, so we can
tease out what some of the unexpected gains are, and think about how
best to achieve them more simply.

I'll have a look at forwprop and pester you (lightly :) if I have
questions.

Thanks,
Bill

> Thanks,
> Richard.
> 
> >> Thanks,
> >> Richard.
> >
> >
> >
Richard Biener July 6, 2011, 2:58 p.m. UTC | #9
On Wed, Jul 6, 2011 at 4:28 PM, William J. Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
> On Wed, 2011-07-06 at 15:16 +0200, Richard Guenther wrote:
>> On Tue, Jul 5, 2011 at 3:59 PM, William J. Schmidt
>> <wschmidt@linux.vnet.ibm.com> wrote:
>> > (Sorry for the late response; yesterday was a holiday here.)
>> >
>> > On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote:
>> >> On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt
>> >> <wschmidt@linux.vnet.ibm.com> wrote:
>> >> > This is the first of three patches related to lowering addressing
>> >> > expressions to MEM_REFs and TARGET_MEM_REFs in late gimple.  This patch
>> >> > contains the new pass together with supporting changes in existing
>> >> > modules.  The second patch contains an independent change to the RTL
>> >> > forward propagator to keep it from undoing an optimization made in the
>> >> > first patch.  The third patch contains new test cases and changes to
>> >> > existing test cases.
>> >> >
>> >> > Although I've broken it up into three patches to make the review easier,
>> >> > it would be best to commit at least the first and third together to
>> >> > avoid regressions.  The second can stand alone.
>> >> >
>> >> > I've done regression tests on powerpc64 and x86_64, and have asked
>> >> > Andreas Krebbel to test against the IBM z (390) platform.  I've done
>> >> > performance regression testing on powerpc64.  The only performance
>> >> > regression of note is the 2% degradation to 188.ammp due to loss of
>> >> > field disambiguation information.  As discussed in another thread,
>> >> > fixing this introduces more complexity than it's worth.
>> >>
>> >> Are there also performance improvements?  What about code size?
>> >
>> > Yes, there are performance improvements.  I've been running cpu2000 on
>> > 32- and 64-bit powerpc64.  Thirteen tests show measurable improvements
>> > between 1% and 9%, with 187.facerec showing the largest improvements for
>> > both 32 and 64.
>> >
>> > I don't have formal code size results, but anecdotally from code
>> > crawling, I have seen code size either neutral or slightly improved.
>> > The largest code size improvements I've seen were on 32-bit code where
>> > the commoning allowed removal of a number of sign-extend and zero-extend
>> > instructions that were otherwise not seen to be redundant.
>> >>
>> >> I tried to get an understanding to what kind of optimizations this patch
>> >> produces based on the test of testcases you added, but I have a hard
>> >> time here.  Can you outline some please?
>> >>
>> >
>> > The primary goal is to clean up code such as is shown in the original
>> > post of PR46556.  In late 2007 there were some changes made to address
>> > canonicalization that caused the code gen to be suboptimal on powerpc64.
>> > At that time you and others suggested a pattern recognizer prior to
>> > expand as probably the best solution, similar to what IVopts is doing.
>>
>> The PR46556 case looks quite simple.
>
> It certainly is.  I was personally curious whether there were other
> suboptimal sequences that might be hiding out there, that a more general
> approach might expose.  There was a comment at the end of the bugzilla
> about a pass to expose target addressing modes in gimple for this
> purpose.  When I first started looking at this, I looked for some
> feedback from the community about whether that should be done, and got a
> few favorable comments along with one negative one.  So that's how we
> got on this road...
>
>>
>> > By using the same mem_ref generation machinery used by IVopts, together
>> > with local CSE, the goal was to ensure base registers are properly
>> > shared so that optimal code is generated, particularly for cases of
>> > shared addressability to structures and arrays.  I also observed cases
>> > where it was useful to extend the sharing across the dominator tree.
>>
>> As you are doing IV selection per individual statement only, using
>> the affine combination machinery looks quite a big hammer to me.
>> Especially as it is hard to imagine what the side-effects are, apart
>> from re-associating dependencies that do not fit the MEM-REF and
>> making the MEM-REF as complicated as permitted by the target.
>>
>> What I thought originally when suggesting to do something similar
>> to IVOPTs was to build a list of candidates and uses and optimize
>> that set using a cost function similar to how IVOPTs does.
>>
> OK, reading back I can see that now...
>
>> Doing addressing-mode selection locally per statement seems like
>> more a task for a few pattern matchers, for example in tree-ssa-forwprop.c
>> (for its last invocation).  One pattern would be that of PR46556,
>> MEM[(p + ((n + 16)*4))] which we can transform to
>> TARGET_MEM_REF[x + 64] with x = p + n*4 if ((n + 16)*4)) was
>> a single-use.  The TARGET_MEM_REF generation can easily
>> re-use the address-description and target-availability checks from
>> tree-ssa-address.c.  I would be at least interested in whether
>> handling the pattern from PR46556 alone (or maybe with a few
>> similar other cases) is responsible for the performance improvements.
>>
> Hm, but I don't think forwprop sees the code in this form.  At the time
> the last pass of forwprop runs, the gimple for the original problem is:
>
>  D.1997_3 = p_1(D)->a[n_2(D)];
>  D.1998_4 = p_1(D)->c[n_2(D)];
>  D.1999_5 = p_1(D)->b[n_2(D)];
>  foo (D.1997_3, D.1998_4, D.1999_5);
>
> But perhaps this is just a matter of changing the patterns to recognize?

Ah, so we still have the ARRAY_REFs here.  Yeah, well - then the
issue boils down to get_inner_reference canonicalizing the offset
according to what fold-const.c implements, and we simply emit
the offset in that form (if you look at the normal_inner_ref: case
in expand_expr_real*).  Thus with the above form at expansion
time the matching would be applied there (or during our RTL
address-generation in fwprop.c ...).

Another idea is of course to lower all handled_component_p
memory accesses - something I did on the old mem-ref branch
and I believe I also suggested at some point.  Without such lowering
all the address-generation isn't exposed to the GIMPLE optimizers.

The simplest way of doing the lowering is to do sth like

  base = get_inner_reference (rhs, ..., &bitpos, &offset, ...);
  base = fold_build2 (POINTER_PLUS_EXPR, ...,
                               base, offset);
  base = force_gimple_operand (base, ... is_gimple_mem_ref_addr);
  tem = fold_build2 (MEM_REF, TREE_TYPE (rhs),
                             base,
                             build_int_cst (get_alias_ptr_type (rhs), bitpos));

at some point.  I didn't end up doing that because of the loss of
alias information.

> Without running experiments yet, my guess is that this would pick up
> some of the benefit, but not all.  As you say, the effects are difficult
> to predict; I've seen some surprisingly good results from CSEing some
> complex addressing, but I also had to implement several tweaks to avoid
> detrimental effects (such as stepping on what IVopts already got right).
> It may be that some of the surprising positives (such as removal of
> extra sign-extend instructions) indicate some cleanup work that should
> be done in the back end instead.

Indeed - or at some other point in the tree middle-end.  For example
we do not re-associate POINTER_PLUS_EXPR at all, which loses
quite some CSE opportunities for addresses (tree-ssa-reassoc.c
would be the place to enhance).

>> Ideally we'd of course have a cost driven machinery that considers
>> (part of) the whole function.
>>
>> > Secondarily, I noticed that once this was cleaned up we still had the
>> > suboptimal code generation for the zero-offset mem refs scenario
>> > outlined in the code commentary.  The extra logic to clean this up helps
>> > keep register pressure down.  I've observed some spill code reduction
>> > when this is in place.  Again, using expression availability from
>> > dominating blocks helps here in some cases as well.
>>
>> Yeah, the case is quite odd and doesn't really fit existing optimizers
>> given that the CSE opportunity is hidden within the TARGET_MEM_REF ...
>>
>> >> I still do not like the implementation of yet another CSE machinery
>> >> given that we already have two.  I think most of the need for CSE
>> >> comes from the use of the affine combination framework and
>> >> force_gimple_operand.  In fact I'd be interested to see cases that
>> >> are optimized that could not be handled by a combine-like
>> >> pattern matcher?
>> >>
>> >
>> > I'm somewhat puzzled by this comment.  Back on 2/4, I posted a skeletal
>> > outline for this pass and asked for comments.  You indicated a concern
>> > that the affine combination expansion would un-CSE a lot of expressions,
>> > and that I should keep track of local candidates during this pass to
>> > ensure that base registers, etc., would be properly shared.  It seemed
>> > to me that a quick and dirty CSE of addressing expressions would be the
>> > best way to handle that.  I posted another RFC on 2/24 with an early
>> > implementation of CSE constrained to a single block, and indicated
>> > shortly thereafter my intent to extend this to a dominator walk.  I
>> > didn't receive any negative comments about using CSE at that time; but
>> > then I didn't get much response at all.
>>
>> Sorry about that - I agree that doing a local CSE is one possible
>> solution, but when considering that we are only considering a statement
>> at a time it looks like a quite bloated approach.  A local CSE
>> capability would be indeed useful also for other passes in GCC, like
>> for example loop unrolling, which expose lots of garbage to later
>> passes.  I thought of re-using the machinery that DOM provides, avoding
>> yet another CSE implementation.
>>
>> >
>> > I probably should have pushed harder to get comments at that point, but
>> > I was very new to the community and was feeling my way through the
>> > protocols; I didn't feel comfortable pushing.
>>
>> And pushing doesn't help always either.
>
> That's been my general experience; I try to be careful about nagging
> busy people.  I think my timing was poor at the beginning of this
> project, as you guys were quite busy with closing off the 4.6 release.
>
>>
>> > In any case, yes, the primary need for CSE is as a result of the affine
>> > combination framework, which creates a large amount of redundant code.
>> > I can certainly take a look at Michael's suggestion to move the pass
>> > earlier and allow pass-dominator to handle the cleanup, and add the
>> > zero-offset mem-ref optimization to that or a subsequent pass.  Do you
>> > agree that this would be a preferable approach?
>>
>> It would definitely preferable to a new CSE implementation.  I'm not
>> sure the zero-offset optimization easily fits in DOM though.
>
> I took a look at this yesterday and I think it fits easily enough; about
> the same as it fit into my prototype pass.  The available expressions
> logic is pretty similar to what I was doing, so transplanting the code
> wouldn't be difficult.  The only issue I saw is that there isn't a pure
> lookup function that doesn't place an expression on the avail-exprs
> stack, but that's not hard to add.
>
>>
>> What I'd really like to better understand is what transformations are
>> the profitable ones and if we can handle them statement-locally
>> within our combine machinery (thus, tree-ssa-forwprop.c).  I can
>> cook up a forwprop patch to fix PR46556 in case you are interested
>> in more details.
>>
> Thanks...let me study forwprop a bit first -- I'm trying to get my head
> wrapped around as much of the middle end as I can, and this is a good
> excuse to delve into another piece.
>
> I think perhaps the best way forward may be to use my prototype as a
> baseline to compare against additions to the combine logic, so we can
> tease out what some of the unexpected gains are, and think about how
> best to achieve them more simply.

Yeah.  I will re-surrect the simple pass I had to lower all memory
accesses, that will give us another toy to play with.

Thanks,
Richard.

> I'll have a look at forwprop and pester you (lightly :) if I have
> questions.
>
> Thanks,
> Bill
>
>> Thanks,
>> Richard.
>>
>> >> Thanks,
>> >> Richard.
>> >
>> >
>> >
>
>
>
Bill Schmidt July 8, 2011, 1:57 p.m. UTC | #10
On Mon, 2011-07-04 at 17:30 +0200, Michael Matz wrote:
> Hi,
> 
> On Mon, 4 Jul 2011, Richard Guenther wrote:
> 
> > I still do not like the implementation of yet another CSE machinery
> > given that we already have two.
> 
> From reading it it really seems to be a normal block-local CSE, without 
> anything fancy.  Hence, moving the pass just a little earlier (before 
> pass_vrp/pass_dominator) should already provide for all optimizations.  If 
> not those should be improved.
> 
> I see that it is used for also getting rid of the zero-offset statements 
> in case non-zero-offsets follow.  I think that's generally worthwhile so 
> probably should be done in one of the above optimizers.

Just FYI, I've verified that this works as expected; the zero-offset
optimization can be moved into the dom2 pass without too much
difficulty, and the CSE in the dom2 pass is sufficient for what I've
seen in my limited testing.  This reduces the size and complexity of the
patch considerably -- thanks!

If it turns out that we end up going this route, it will mean modifying
at least 11 more scan tests whose expected output out of dom2 changes.

However, I'll be turning my attention now to some of the alternate
implementations that Richard suggested, to see how much of the gains can
be obtained by other means.

Thanks,
Bill
diff mbox

Patch

Index: gcc/dbgcnt.def
===================================================================
--- gcc/dbgcnt.def	(revision 175664)
+++ gcc/dbgcnt.def	(working copy)
@@ -1,5 +1,5 @@ 
 /* This file contains the list of the debug counter for GCC.
-   Copyright (C) 2006, 2007, 2008, 2010 Free Software Foundation, Inc.
+   Copyright (C) 2006, 2007, 2008, 2010, 2011 Free Software Foundation, Inc.
 
 This file is part of GCC.
 
@@ -184,3 +184,4 @@  DEBUG_COUNTER (sms_sched_loop)
 DEBUG_COUNTER (store_motion)
 DEBUG_COUNTER (split_for_sched2)
 DEBUG_COUNTER (tail_call)
+DEBUG_COUNTER (lower_addr)
Index: gcc/tree-ssa-lower-addr.c
===================================================================
--- gcc/tree-ssa-lower-addr.c	(revision 0)
+++ gcc/tree-ssa-lower-addr.c	(revision 0)
@@ -0,0 +1,1430 @@ 
+/* Expose target addressing modes in late gimple representation.
+   Copyright (C) 2011
+   Free Software Foundation, Inc.
+   Contributed by Bill Schmidt <wschmidt@us.ibm.com>
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT
+ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+/* This pass performs a scan over the gimple representation to expose
+   target machine addressing in reference-class expressions.  This is
+   necessary for some targets for which the current address
+   canonicalization is suboptimal.  Induction variable optimizations
+   already expose target addressing for many (but not all) expressions
+   in loops, so this has most benefit in non-loop code.  */
+
+/* Address lowering in gimple loses a small amount of information that
+   is useful in disambiguating memory references.  In alias.c, the
+   nonoverlapping_component_refs_p function examines the mem_exprs
+   associated with two component references and attempts to prove
+   they do not overlap.  This is done by walking the types of the
+   two mem_exprs looking for a common record type, and demonstrating
+   that the two references address non-overlapping fields of that
+   type.  This improves on TBAA by further disambiguating references
+   to fields that have the same type.
+
+   With addresses lowered in gimple, the mem_exprs no longer contain
+   field references, but instead contain MEM_REFs and TARGET_MEM_REFs
+   that cannot necessarily be proven to be field references.  In any
+   case, they no longer describe the full type hierarchy of the 
+   underlying reference, so in general we lose some opportunity to
+   disambiguate field references.
+
+   This has been observed to cause occasional small degradations in
+   generated code.  For example, a 2% degradation was measured for
+   188.ammp (SPEC cpu2000) on powerpc64 when the information loss led
+   to a suboptimal instruction schedule.  However, such occurrences
+   are believed to be infrequent and have minimal effects compared
+   with the overall benefit of address lowering.
+
+   To address this problem would require somehow carrying the lost
+   type information in the intermediate representation from address
+   lowering to the expand phase.  The cost and complexity of maintaining
+   this information does not currently seem justified.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tree.h"
+#include "basic-block.h"
+#include "tree-flow.h"
+#include "tree-dump.h"
+#include "tree-pass.h"
+#include "timevar.h"
+#include "cfgloop.h"
+#include "gimple.h"
+#include "tree-affine.h"
+#include "tree-pretty-print.h"
+#include "dbgcnt.h"
+#include "alloc-pool.h"
+
+/* Forward references.  */
+static bool avail_addr_cse (gimple);
+
+/* Hash table for remembering expressions that contribute to address
+   calculations in the current block and its dominators.  */
+static htab_t avail_addr_table;
+
+/* Expression entry.  Expressions are allocated from an allocation pool.
+   They are addressed both out of the hash table and from vectors of
+   locally available expressions for the current block and its
+   dominators.  Expressions are freed from the allocation pool when
+   processing for the block where they were first seen completes.  */
+typedef struct avail_addr_expr_d
+{
+  /* The expression consists of an operation and 1 or 2 operands.  */
+  enum tree_code operation;
+  tree operand1;
+  tree operand2;
+
+  /* Its representative SSA name.  */
+  tree repr;
+
+  /* Cached hash code.  */
+  hashval_t hashcode;
+} avail_addr_expr, *avail_addr_expr_t;
+
+typedef const struct avail_addr_expr_d *const_avail_addr_expr_t;
+
+/* Allocation pool for expression entries.  */
+static alloc_pool avail_addr_pool;
+
+/* Define a vector of expression entries.  */
+DEF_VEC_P (avail_addr_expr_t);
+DEF_VEC_ALLOC_P (avail_addr_expr_t, heap);
+
+/* Auxiliary information for each basic block for this pass.  */
+typedef struct tree_lower_addr_bb_aux_d
+{
+  /* Expressions made available by this block.  */
+  VEC (avail_addr_expr_t, heap) *avail_exprs;
+} tree_lower_addr_bb_aux, *tree_lower_addr_bb_aux_t;
+
+#define AVAIL_EXPRS(BB) ((tree_lower_addr_bb_aux_t) ((BB)->aux))->avail_exprs
+
+/* Basic block currently being processed.  */
+static basic_block curr_bb;
+
+/* Memory references in the current block that have zero offsets.  */
+static VEC (gimple, heap) *zero_offset_refs;
+
+/* Flag indicating that the CFG needs to be cleaned up afterwards.  */
+static bool clean_cfg;
+
+
+/* Callback to produce a hash value for an expression.  */
+
+static hashval_t
+avail_addr_hash (const void *p)
+{
+  const_avail_addr_expr_t const expr = (const_avail_addr_expr_t) p;
+  return expr->hashcode;
+}
+
+/* Callback when an element is removed from the hash table.  An available
+   expression is deleted from the hash table after its generating block
+   and its dominated blocks have been processed, resulting in a call to
+   avail_addr_free to remove it from the allocation pool.  */
+
+static void
+avail_addr_free (void *p)
+{
+  pool_free (avail_addr_pool, p);
+}
+
+/* Return true if two leaf nodes are equal.  */
+
+static int
+leaves_eq (tree opnd1, tree opnd2)
+{
+  /* Null nodes are vacuously equal.  */
+  if (!opnd1 && !opnd2)
+    return true;
+
+  if (!opnd1 || !opnd2)
+    return false;
+
+  if (TREE_CODE (opnd1) != TREE_CODE (opnd2))
+    return false;
+
+  /* Integer type nodes appear as the second RHS operand for a
+     CONVERT_EXPR in available expressions, indicating the type of 
+     the LHS they will be assigned to.  These must match exactly.  */
+  if (TYPE_P (opnd1))
+    return (TYPE_UNSIGNED (opnd1) == TYPE_UNSIGNED (opnd2)
+	    && TYPE_PRECISION (opnd1) == TYPE_PRECISION (opnd2));
+
+  return operand_equal_p (opnd1, opnd2, 0);
+}
+
+/* Callback for hashtab support.  Return true if two expressions are
+   equal.  We are only interested in simple expressions involving SSA
+   names, addresses of vars or parms, integer constants, and mathematical
+   operations that can reasonably appear in addressing expressions.  */
+
+static int
+avail_addr_eq (const void *p1, const void *p2)
+{
+  const_avail_addr_expr_t const entry1 = (const_avail_addr_expr_t) p1;
+  const_avail_addr_expr_t const entry2 = (const_avail_addr_expr_t) p2;
+  tree left_opnd1 = entry1->operand1;
+  tree left_opnd2 = entry2->operand1;
+  tree right_opnd1 = entry1->operand2; /* NULL_TREE for unary ops.  */
+  tree right_opnd2 = entry2->operand2; /* NULL_TREE for unary ops.  */
+
+  if (entry1->operation != entry2->operation)
+    return false;
+
+  if (TREE_CODE (left_opnd1) != TREE_CODE (left_opnd2))
+    return false;
+
+  if (entry1->operation == SSA_NAME || entry1->operation == NOP_EXPR)
+    return operand_equal_p (left_opnd1, left_opnd2, 0);
+
+  if (right_opnd1 && right_opnd2 &&
+      TREE_CODE (right_opnd1) != TREE_CODE (right_opnd2))
+    return false;
+
+  if (leaves_eq (left_opnd1, left_opnd2) &&
+      leaves_eq (right_opnd1, right_opnd2))
+    return true;
+
+  /* Handle commutative operations.  */
+  if (!commutative_tree_code (entry1->operation))
+    return false;
+
+  return (leaves_eq (left_opnd1, right_opnd2) &&
+	  leaves_eq (right_opnd1, left_opnd2));
+}
+
+/* Provide a hash code for SSA_NAME, INTEGER_CST, type, and ADDR_EXPR of
+   var/parm operands.  */
+
+static hashval_t
+hash_simple_op (tree operand)
+{
+  if (TREE_CODE (operand) == SSA_NAME)
+    return DECL_UID (SSA_NAME_VAR (operand)) * SSA_NAME_VERSION (operand);
+  else if (TREE_CODE (operand) == INTEGER_CST)
+    return (hashval_t)TREE_INT_CST_LOW (operand);
+  else if (TREE_CODE (operand) == ADDR_EXPR)
+    {
+      tree subopnd = TREE_OPERAND (operand, 0);
+      gcc_assert (TREE_CODE (subopnd) == VAR_DECL ||
+		  TREE_CODE (subopnd) == PARM_DECL);
+      return DECL_UID (subopnd);
+    }
+  else if (TYPE_P (operand))
+    return TYPE_PRECISION (operand);
+
+  gcc_unreachable ();
+  return 0;
+}
+
+/* Determine whether hash_simple_op can be called on OPERAND.  */
+
+static bool
+valid_simple_op (tree operand)
+{
+  if (TREE_CODE (operand) == SSA_NAME
+      || TREE_CODE (operand) == INTEGER_CST
+      || TYPE_P (operand))
+    return true;
+
+  if (TREE_CODE (operand) == ADDR_EXPR)
+    {
+      tree subopnd = TREE_OPERAND (operand, 0);
+      if (TREE_CODE (subopnd) == VAR_DECL || TREE_CODE (subopnd) == PARM_DECL)
+	return true;
+    }
+
+  return false;
+}
+
+/* Allocate a new avail_addr_expr from the pool and fill it in.  */
+
+static avail_addr_expr_t
+alloc_avail_addr_expr (enum tree_code operation_in, tree operand1_in,
+		       tree operand2_in, tree repr_in,
+		       hashval_t hashcode_in)
+{
+  avail_addr_expr_t entry = (avail_addr_expr_t)pool_alloc (avail_addr_pool);
+
+  entry->operation = operation_in;
+  entry->operand1  = operand1_in;
+  entry->operand2  = operand2_in;
+  entry->repr      = repr_in;
+  entry->hashcode  = hashcode_in;
+
+  return entry;
+}
+
+/* Free all available expressions generated locally in BB.  */
+
+static void
+free_avail_exprs (basic_block bb)
+{
+  unsigned i, length = VEC_length (avail_addr_expr_t, AVAIL_EXPRS (bb));
+  avail_addr_expr_t entry;
+  void **slot;
+
+  for (i = 0; i < length; i++)
+    {
+      entry = VEC_index (avail_addr_expr_t, AVAIL_EXPRS (bb), i);
+      slot = htab_find_slot_with_hash (avail_addr_table, entry,
+				       entry->hashcode, NO_INSERT);
+      /* This removes the entry from the hash table, and indirectly
+	 removes it from the allocation pool via the callback to
+	 avail_addr_free.  */
+      htab_clear_slot (avail_addr_table, slot);
+    }
+
+  /* Reclaim the available expressions vector for this block.  */
+  VEC_free (avail_addr_expr_t, heap, AVAIL_EXPRS (bb));
+}
+
+/* Look up NAME in the hash table and, if found, return the entry of its
+   representative.  Otherwise make it its own representative and return
+   that entry.  */
+
+static avail_addr_expr_t
+avail_addr_ssa_name_cse (tree name)
+{
+  avail_addr_expr_t new_entry =
+    alloc_avail_addr_expr (SSA_NAME, name, NULL_TREE, name,
+			   hash_simple_op (name));
+
+  void **slot = htab_find_slot_with_hash (avail_addr_table, new_entry,
+					  new_entry->hashcode, INSERT);
+  if (!*slot)
+    {
+      *slot = new_entry;
+      VEC_safe_push (avail_addr_expr_t, heap,
+		     AVAIL_EXPRS (curr_bb), new_entry);
+    }
+
+  return (avail_addr_expr_t)*slot;
+}
+
+/* The given RHS was found on a code-free type conversion.  Look it up
+   in the hash table; if it wasn't there previously, make LHS its
+   representative.  Otherwise keep the existing representative.  Return
+   a pointer to its hash table entry.  */
+
+static avail_addr_expr_t
+avail_addr_nop_expr_cse (tree lhs, tree rhs)
+{
+  avail_addr_expr_t new_entry =
+    alloc_avail_addr_expr (NOP_EXPR, rhs, NULL_TREE, lhs,
+			   hash_simple_op (rhs));
+  void **slot = htab_find_slot_with_hash (avail_addr_table, new_entry,
+					  new_entry->hashcode, INSERT);
+  if (!*slot)
+    {
+      *slot = new_entry;
+      VEC_safe_push (avail_addr_expr_t, heap, AVAIL_EXPRS (curr_bb), new_entry);
+    }
+
+  return (avail_addr_expr_t)*slot;
+}
+
+/* The given RHS was found on an integer type conversion that will generate
+   code.  Look it up in the hash table using the type of the LHS; if it
+   wasn't there previously, make LHS its representative.  Otherwise keep the
+   existing representative.  Return a pointer to its hash table entry.  */
+
+static avail_addr_expr_t
+avail_addr_convert_expr_cse (tree lhs, tree rhs)
+{
+  hashval_t hash = hash_simple_op (rhs) ^ hash_simple_op (TREE_TYPE (lhs));
+  avail_addr_expr_t new_entry =
+    alloc_avail_addr_expr (CONVERT_EXPR, rhs, TREE_TYPE (lhs), lhs, hash);
+  void **slot = htab_find_slot_with_hash (avail_addr_table, new_entry,
+					  hash, INSERT);
+  if (!*slot)
+    {
+      *slot = new_entry;
+      VEC_safe_push (avail_addr_expr_t, heap, AVAIL_EXPRS (curr_bb), new_entry);
+    }
+
+  return (avail_addr_expr_t)*slot;
+}
+
+/* Perform CSE on OPND, possibly replacing it with a representative
+   SSA name.  Returns true iff the tree rooted at *OPND has been
+   modified.  */
+
+static bool
+avail_addr_expr_cse (tree *opnd)
+{
+  enum tree_code code;
+  avail_addr_expr_t entry_ptr;
+  unsigned int opnd_idx;
+  bool changed = false;
+
+  if (!*opnd)
+    return false;
+
+  code = TREE_CODE (*opnd);
+
+  switch (code)
+    {
+    case SSA_NAME:
+      entry_ptr = avail_addr_ssa_name_cse (*opnd);
+      if (!operand_equal_p (*opnd, entry_ptr->repr, 0))
+	{
+	  *opnd = entry_ptr->repr;
+	  return true;
+	}
+      return false;
+
+    case INTEGER_CST:
+      return false;
+
+    default:
+      for (opnd_idx = 0; opnd_idx < TREE_CODE_LENGTH (code); opnd_idx++)
+	changed = avail_addr_expr_cse (&TREE_OPERAND (*opnd, opnd_idx))
+	  || changed;
+    }
+  return changed;
+}
+
+/* If STMT is some sort of copy operation, attempt CSE on it and return
+   TRUE; else return FALSE.  Set *CHANGED to TRUE iff STMT is modified
+   by performing CSE.  */
+
+static bool
+avail_addr_cse_copy (gimple stmt, tree lhs, bool *changed)
+{
+  tree rhs, *ops;
+  avail_addr_expr_t rhs_entry, new_entry;
+  void **slot;
+
+  /* If copying one SSA name to another, first find the representative
+     of the RHS, and then make that the representative of the LHS.  */
+  if (gimple_assign_ssa_name_copy_p (stmt))
+    {
+      rhs = gimple_assign_rhs1 (stmt);
+
+      /* We're only interested in simple SSA names, such as might have
+	 been generated during address expansion.  Ignore complex
+	 memory references.  */
+      if (TREE_CODE (rhs) == COMPONENT_REF ||
+	  TREE_CODE (rhs) == ARRAY_REF ||
+	  TREE_CODE (rhs) == MEM_REF ||
+	  TREE_CODE (rhs) == TARGET_MEM_REF)
+	{
+	  *changed = false;
+	  return true;
+	}
+
+      rhs_entry = avail_addr_ssa_name_cse (rhs);
+      new_entry = alloc_avail_addr_expr (SSA_NAME, lhs, NULL_TREE,
+					 rhs_entry->repr,
+					 hash_simple_op (lhs));
+      slot = htab_find_slot_with_hash (avail_addr_table, new_entry,
+				       new_entry->hashcode, INSERT);
+      gcc_assert (!*slot);
+      *slot = new_entry;
+      VEC_safe_push (avail_addr_expr_t, heap,
+		     AVAIL_EXPRS (curr_bb), new_entry);
+      *changed = false;
+      return true;
+    }
+
+  /* Simple converts that generate no code are treated slightly
+     differently.  Even if the RHS is a simple variable, the conversion
+     means we can't propagate it directly.  Track these as NOP_EXPRs
+     in the hash table.  The first occurrence of a NOP_EXPR pattern
+     will receive its LHS SSA_NAME as its representative.  */
+  if (gimple_assign_unary_nop_p (stmt))
+    {
+      ops = gimple_ops (stmt);
+      rhs = ops[1];
+
+      if (TREE_CODE (lhs) != SSA_NAME || TREE_CODE (rhs) != SSA_NAME)
+	{
+	  *changed = false;
+	  return true;
+	}
+
+      /* First do any replacement on the SSA_NAME of the RHS.  */
+      *changed = avail_addr_expr_cse (&rhs);
+      
+      /* Now CSE the conversion, tracking it as a NOP_EXPR.  */
+      rhs_entry = avail_addr_nop_expr_cse (lhs, rhs);
+
+      if (rhs_entry->repr != lhs)
+	{
+	  gimple_set_op (stmt, 1, rhs_entry->repr);
+	  if (TREE_TYPE (lhs) == TREE_TYPE (rhs_entry->repr))
+	    gimple_assign_set_rhs_code (stmt, SSA_NAME);
+	  (void)avail_addr_cse (stmt);
+	  *changed = true;
+	}
+      return true;
+    }
+
+  /* Conversions that require code generation are recorded in the hash
+     table with two operands:  the SSA name of the RHS, and the tree
+     type of the LHS.  The first occurrence of a CONVERT_EXPR pattern
+     will receive its LHS SSA_NAME as its representative.  */
+  if (gimple_assign_conversion_p (stmt))
+    {
+      ops = gimple_ops (stmt);
+      rhs = ops[1];
+
+      if (TREE_CODE (lhs) != SSA_NAME
+	  || TREE_CODE (rhs) != SSA_NAME
+	  || TREE_CODE (TREE_TYPE (lhs)) != INTEGER_TYPE)
+	{
+	  *changed = false;
+	  return true;
+	}
+
+      /* First do any replacement on the SSA_NAME of the RHS.  */
+      *changed = avail_addr_expr_cse (&rhs);
+
+      /* Now CSE the conversion, tracking it as a CONVERT_EXPR.  */
+      rhs_entry = avail_addr_convert_expr_cse (lhs, rhs);
+
+      if (rhs_entry->repr != lhs)
+	{
+	  gimple_set_op (stmt, 1, rhs_entry->repr);
+	  gimple_assign_set_rhs_code (stmt, SSA_NAME);
+	  (void)avail_addr_cse (stmt);
+	  *changed = true;
+	}
+      return true;
+    }
+
+  /* If it's some other form of copy to a simple SSA_NAME, make the
+     LHS its own representative.  The RHS will be a constant or
+     something equally simple that doesn't need to be examined.
+     If the LHS is not simple, perform CSE substitution on its
+     suboperands.  */
+  if (gimple_assign_copy_p (stmt))
+    {
+      if (TREE_CODE (lhs) == SSA_NAME)
+	{
+	  (void)avail_addr_ssa_name_cse (lhs);
+	  *changed = false;
+	}
+      else
+	*changed = avail_addr_expr_cse (&lhs);
+      return true;
+    }
+
+  /* It's not a copy, so defer to avail_addr_cse_noncopy.  */
+  return false;
+}
+
+/* Return TRUE iff CODE may reasonably appear on the RHS of a non-copy
+   statement used to construct an address.  This includes codes that
+   may be used to calculate an array index.  */
+static bool
+is_relevant_rhs_code (enum tree_code code)
+{
+  switch (code)
+    {
+    case PLUS_EXPR:
+    case POINTER_PLUS_EXPR:
+    case MINUS_EXPR:
+    case MULT_EXPR:
+    case EXACT_DIV_EXPR:
+    case TRUNC_DIV_EXPR:
+    case CEIL_DIV_EXPR:
+    case FLOOR_DIV_EXPR:
+    case ROUND_DIV_EXPR:
+    case TRUNC_MOD_EXPR:
+    case CEIL_MOD_EXPR:
+    case FLOOR_MOD_EXPR:
+    case ROUND_MOD_EXPR:
+    case NEGATE_EXPR:
+    case MIN_EXPR:
+    case MAX_EXPR:
+    case ABS_EXPR:
+    case LSHIFT_EXPR:
+    case RSHIFT_EXPR:
+    case BIT_IOR_EXPR:
+    case BIT_XOR_EXPR:
+    case BIT_AND_EXPR:
+    case BIT_NOT_EXPR:
+      return true;
+    default:
+      break;
+    }
+  return false;
+}
+
+/* Attempt CSE on STMT, which is not some form of copy.  Return TRUE 
+   iff STMT has been modified as a result.  */
+
+static bool
+avail_addr_cse_noncopy (gimple stmt, tree lhs)
+{
+  tree *ops;
+  unsigned i, num_ops, initial_i = 0, imax;
+  bool changed = false;
+  enum tree_code subcode, code[2];
+  hashval_t hash;
+  avail_addr_expr_t new_entry;
+  void **slot;
+
+  /* Walk each tree on the RHS looking for operands that are eligible to
+     be in the CSE table, replacing them with their representatives.
+     If LHS is not an SSA_NAME, walk its operands, as well.  */
+  ops = gimple_ops (stmt);
+  num_ops = gimple_num_ops (stmt);
+
+  if (TREE_CODE (lhs) == SSA_NAME)
+    initial_i = 1;
+
+  for (i = initial_i; i < num_ops; i++)
+    changed = avail_addr_expr_cse (&ops[i]) || changed;
+
+  /* Now the operands are canonicalized, so try to match the expression
+     in the hash table.  */
+  subcode = gimple_assign_rhs_code (stmt);
+
+  if (is_relevant_rhs_code (subcode))
+    {
+      if (TREE_CODE_CLASS (subcode) == tcc_unary)
+	imax = 1;
+      else
+	imax = 2;
+
+      /* Take a quick exit if this is obviously not part of an address
+	 expression.  */
+      for (i = 0; i < imax; i++)
+	{
+	  code[i] = TREE_CODE (ops[i+1]);
+	  if (code[i] == REAL_CST    || code[i] == FIXED_CST  ||
+	      code[i] == COMPLEX_CST || code[i] == VECTOR_CST)
+	    return changed;
+
+          /* Same for complex expressions that haven't been reduced.  */
+	  if (code[i] == ADDR_EXPR)
+	    {
+	      enum tree_code subcode1 = TREE_CODE (TREE_OPERAND (ops[i+1], 0));
+	      if (subcode1 != VAR_DECL && subcode1 != PARM_DECL)
+		return changed;
+	    }
+	}
+
+      /* If all suboperands are simple, look up the expression in the
+	 hash table.  */
+      if (!valid_simple_op (ops[1]))
+	return changed;
+
+      hash = subcode ^ hash_simple_op (ops[1]);
+
+      if (TREE_CODE_CLASS (subcode) == tcc_unary)
+	new_entry = alloc_avail_addr_expr (subcode, ops[1], NULL_TREE,
+					   lhs, hash);
+      else
+	{
+	  if (!valid_simple_op (ops[2]))
+	    return changed;
+
+	  hash ^= hash_simple_op (ops[2]);
+	  new_entry = alloc_avail_addr_expr (subcode, ops[1], ops[2],
+					     lhs, hash);
+	}
+
+      slot = htab_find_slot_with_hash (avail_addr_table, new_entry,
+				       new_entry->hashcode, INSERT);
+
+      /* If we've seen it before, attempt to change this statement to an
+	 SSA_NAME with the representative appearing as RHS1.  If the
+	 types of the LHS and the representative differ, attempt to use
+	 a NOP_EXPR for some simple cases.  Otherwise leave the RHS
+	 alone.  */
+      if (*slot)
+	{
+	  avail_addr_expr_t cached_entry_ptr = (avail_addr_expr_t)*slot;
+	  tree lhs_type, repr_type;
+	  enum tree_code new_code;
+
+	  lhs_type = TREE_TYPE (lhs);
+	  repr_type = TREE_TYPE (cached_entry_ptr->repr);
+
+	  if (lhs_type == repr_type)
+	    new_code = SSA_NAME;
+
+	  else if ((INTEGRAL_TYPE_P (lhs_type)  || 
+		    POINTER_TYPE_P  (lhs_type))  &&
+		   (INTEGRAL_TYPE_P (repr_type) ||
+		    POINTER_TYPE_P  (repr_type)) &&
+		   (TYPE_SIZE (lhs_type) == TYPE_SIZE (repr_type)))
+	    new_code = NOP_EXPR;
+
+	  else
+	    return changed;
+
+	  gimple_assign_set_rhs_code (stmt, new_code);
+	  gimple_set_num_ops (stmt, 2);
+	  gimple_set_op (stmt, 1, cached_entry_ptr->repr);
+	  
+	  /* Reprocess this statement so the LHS picks up the
+	     new RHS as its representative.  */
+	  (void)avail_addr_cse (stmt);
+	  changed = true;
+	}
+      else
+	{
+	  *slot = new_entry;
+	  VEC_safe_push (avail_addr_expr_t, heap,
+			 AVAIL_EXPRS (curr_bb), new_entry);
+	}
+    }
+
+  return changed;
+}
+
+/* Remember expressions that may contribute to address calculation,
+   and replace repeated occurrences with a representative SSA name.
+   Return true iff STMT has been modified.  */
+
+static bool
+avail_addr_cse (gimple stmt)
+{
+  tree lhs;
+  bool changed = false;
+
+  /* We are only interested in assignments.  */
+  if (!is_gimple_assign (stmt))
+    return false;
+
+  lhs = gimple_assign_lhs (stmt);
+
+  if (avail_addr_cse_copy (stmt, lhs, &changed))
+    return changed;
+
+  return avail_addr_cse_noncopy (stmt, lhs);
+
+}
+
+/* Return TRUE iff EXPR contains a direct reference to a field of a small,
+   nonvolatile structure (i.e., something we hope to keep in registers).  */
+static bool
+is_small_struct_reference (tree expr)
+{
+  bool ref_found = false;
+
+  while (expr) 
+    {
+      if (TREE_THIS_VOLATILE (expr))
+	return false;
+
+      switch (TREE_CODE (expr))
+	{
+	case MEM_REF:
+	case TARGET_MEM_REF:
+	  /* Indirection is present; not what we're looking for.  */
+	  return false;
+
+	case COMPONENT_REF:
+	  if (TREE_CODE (TREE_OPERAND (expr, 1)) != FIELD_DECL)
+	    return false;
+	  if (TYPE_MODE (TREE_TYPE (TREE_OPERAND (expr, 0))) == SImode)
+	    ref_found = true;
+	  /* We found a small struct ref, but look down the base expression
+	     for indirection before returning TRUE.  */
+	  break;
+
+	default:
+	  break;
+	}
+      if (TREE_OPERAND_LENGTH (expr) > 0)
+	expr = TREE_OPERAND (expr, 0);
+      else
+	expr = NULL_TREE;
+    }
+
+  return ref_found;
+}
+
+/* Return TRUE iff any of the elements of AFF contain an ADDR_EXPR of
+   a VAR_DECL, PARM_DECL, or RESULT_DECL that is not already marked
+   as addressable.  */
+static bool
+aff_comb_causes_addr_taken (aff_tree aff)
+{
+  unsigned i;
+
+  for (i = 0; i < aff.n; i++)
+    {
+      tree expr = aff.elts[i].val;
+
+      while (expr)
+	{
+	  if (TREE_CODE (expr) == ADDR_EXPR)
+	    {
+	      tree base = TREE_OPERAND (expr, 0);
+	      enum tree_code code = TREE_CODE (base);
+	      if ((code == VAR_DECL || code == PARM_DECL || code == RESULT_DECL)
+		  && (! TREE_ADDRESSABLE (base)))
+		return true;
+	    }
+	  if (TREE_OPERAND_LENGTH (expr) > 0)
+	    expr = TREE_OPERAND (expr, 0);
+	  else
+	    expr = NULL_TREE;
+	}
+    }
+
+  return false;
+}
+
+/* Return TRUE iff the affine combination of trees AFF relies on an update
+   of an induction variable.  If this is the case, AFF's offset will be
+   nonzero, and one of its elements will reference a variable that is
+   directly or indirectly defined by a phi expression.  That phi expression
+   will occur in a loop header.  */
+static bool
+aff_comb_relies_on_ivar_update (aff_tree aff)
+{
+  unsigned i;
+
+  if (double_int_zero_p (aff.offset))
+    return false;
+
+  for (i = 0; i < aff.n; i++)
+    {
+      tree expr = aff.elts[i].val;
+
+      if (TREE_CODE (expr) != SSA_NAME)
+	continue;
+
+      while (expr)
+	{
+	  gimple def_stmt = SSA_NAME_DEF_STMT (expr);
+
+	  if (gimple_code (def_stmt) == GIMPLE_PHI)
+	    {
+	      basic_block bb = gimple_bb (def_stmt);
+	      return (bb == bb->loop_father->header);
+	    }
+
+	  if (gimple_assign_ssa_name_copy_p (def_stmt) 
+	      || gimple_assign_unary_nop_p (def_stmt))
+	    {
+	      expr = gimple_assign_rhs1 (def_stmt);
+	      if (TREE_CODE (expr) != SSA_NAME)
+		expr = NULL_TREE;
+	    }
+	  else
+	    expr = NULL_TREE;
+	}
+    }
+
+  return false;
+}
+
+/* Examine AFF to pick a component to use as the base hint to
+   create_mem_ref.  */
+
+static tree
+choose_base_hint (aff_tree *aff)
+{
+  unsigned i;
+
+  /* Pick the first tree with pointer type.  If none exists,
+     pick the first tree that is either an SSA_NAME, NOP_EXPR, or
+     CONVERT_EXPR.  The latter two can exist because, when a 
+     cast loses precision, the affine tree expansion stops.  */
+  for (i = 0; i < aff->n; i++)
+    if (POINTER_TYPE_P (TREE_TYPE (aff->elts[i].val)))
+      return aff->elts[i].val;
+
+  for (i = 0; i < aff->n; i++)
+    if (TREE_CODE (aff->elts[i].val) == SSA_NAME ||
+	CONVERT_EXPR_CODE_P (TREE_CODE (aff->elts[i].val)))
+      return aff->elts[i].val;
+  
+  /* If all else fails, default to the first element.  */
+  return aff->elts[i].val;
+}
+
+/* Return TRUE iff EXPR might be a hidden global store as defined by
+   the dead code eliminator.  See tree-ssa-sink.c:is_hidden_global_store.  */
+
+static bool
+may_be_global_store (tree expr)
+{
+  if (REFERENCE_CLASS_P (expr))
+    expr = get_base_address (expr);
+
+  /* If the base address couldn't be obtained, don't perform the 
+     replacement.  This shouldn't be possible if we've gotten this far.  */
+  if (expr == NULL_TREE)
+    return false;
+
+  if (DECL_P (expr) && !is_global_var (expr))
+    return false;
+
+  if (INDIRECT_REF_P (expr)       ||
+      TREE_CODE (expr) == MEM_REF ||
+      TREE_CODE (expr) == TARGET_MEM_REF)
+    return ptr_deref_may_alias_global_p (TREE_OPERAND (expr, 0));
+
+  if (CONSTANT_CLASS_P (expr))
+    return true;
+      
+  return false;
+}
+
+/* We are considering replacing ORIG with REPL.  Return TRUE iff
+   ORIG will be recognized as a hidden global store by the dead
+   code eliminator, but REPL will not.  */
+/* This should not be necessary to check, though it is harmless.
+   Without the check, the test in libstdc++-v3/testsuite/23_containers/
+   vector/ext_pointer_modifiers/insert.cc fails.  However, this is
+   because the affine expansion simplifies some complex addressing
+   sufficiently that the entire procedure uninitialized_fill_n_a
+   is proven to have no side effects.  Examining the gimple prior
+   to this phase shows that it indeed does not have side effects at
+   that point.  There may be a front end problem or an upstream
+   optimization problem here.  */
+
+static bool
+global_store_may_be_lost (tree orig, tree repl)
+{
+  return (may_be_global_store (orig) && !may_be_global_store (repl));
+}
+
+/* Expose target addressing modes in EXPR.  Logic extracted and
+   simplified from tree-ssa-loop-ivopts.c to handle non-ivar
+   opportunities.  IS_VDEF_STORE is true iff EXPR is the lhs of
+   a virtual definition.  */
+
+static bool
+tree_ssa_lower_addr_tree (tree *expr, gimple_stmt_iterator *gsi,
+			  bool speed, bool is_vdef_store)
+{
+  tree addr = *expr, mem_ref, base_hint;
+  aff_tree aff;
+  enum tree_code code = TREE_CODE (addr);
+  gimple_stmt_iterator save_gsi;
+  gimple save_stmt;
+  /*
+  unsigned i;
+  */
+
+  if (!REFERENCE_CLASS_P (addr))
+    return false;
+
+  /* TODO: Bitfields are not yet handled.  These will be lowered
+     separately to read-modify-write sequences in pass "memlower" in
+     gimple-low.c (work in progress by Richard Guenther).  It would
+     be preferable for that lowering to happen prior to this pass.
+     It is also possible that this pass should be merged with that one;
+     but the two passes may have different ideal positions in the phase
+     ordering.  */
+  if (code == BIT_FIELD_REF)
+    return false;
+  if (code == COMPONENT_REF && DECL_BIT_FIELD (TREE_OPERAND (addr, 1)))
+    return false;
+
+  /* If the expression contains a direct field reference from a small
+     nonvolatile structure, don't expose the addressing.  We assume such
+     things will remain in registers where possible.  */
+  if (is_small_struct_reference (*expr))
+    return false;
+
+  /* Express the address of the expression as an affine combination of
+     trees, looking through SSA defs.  If the expression may not be
+     addressable or is otherwise not a candidate, detect that and exit. */
+  if (!mem_tree_to_aff_combination_expand (&addr, &aff))
+    return false;
+
+  /* It is possible for aff to have no elements, in which case do
+     nothing.  Example: lhs = MEM[(struct foo *) 0B].bar does not
+     produce an affine combination.  */
+  if (!aff.n)
+    return false;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "Original expression = \n");
+      print_generic_expr (dump_file, *expr, TDF_VOPS|TDF_MEMSYMS);
+      fprintf (dump_file, "\n\naff_comb =\n");
+      print_aff (dump_file, &aff);
+      fprintf (dump_file, "\n");
+    }
+
+  /* If the affine combination contains any ADDR_EXPRs for decls that
+     are not yet marked as addressable, do not replace.  We don't want
+     to interfere with downstream optimizations, particularly on autos.  */
+  if (aff_comb_causes_addr_taken (aff))
+    return false;
+
+  /* We must take care when inside a loop to avoid calculating induction
+     expressions twice.  For example, an induction expression i = i + 1
+     may appear prior to an addressing expression based on i.  Assume
+     that IVOPTS will have already done the best possible job on such
+     addressing expressions, and leave them alone.  */
+  if (aff_comb_relies_on_ivar_update (aff))
+    return false;
+
+  /* Choose the best SSA name in the affine combination of trees
+     to coerce into the base position on the MEM_REF.  If this isn't
+     done, a base expression of zero may occur, which confuses the
+     may-aliasing in subsequent dead store elimination.  Furthermore,
+     if we choose badly, we can end up with a base expression that
+     points below the beginning of the object, which produces suboptimal
+     code.  */
+  base_hint = choose_base_hint (&aff);
+
+  /* Copy the current statement iterator, and memorize the position
+     one statement previous.  */
+  save_stmt = gsi_stmt (*gsi);
+  save_gsi = *gsi;
+  gsi_prev (&save_gsi);
+
+  /* Replace the input expression with an equivalent memory reference
+     legal on the target machine.  As a side effect, feeding expressions
+     may be generated prior to GSI.  */
+  mem_ref = create_mem_ref (gsi, TREE_TYPE (*expr), &aff, 
+			    reference_alias_ptr_type (*expr),
+			    NULL_TREE, base_hint, speed);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "\nmem_ref = ");
+      print_generic_expr (dump_file, mem_ref, TDF_VOPS|TDF_MEMSYMS);
+      fprintf (dump_file, "\n\n");
+    }
+
+  /* Ensure the memory reference was not built with a base expression
+     of zero, which confuses the may-aliasing in subsequent dead
+     store elimination.  This can happen with questionable coding
+     practices, such as treating an integer parameter as a pointer.
+     TODO: create_mem_ref should perhaps be modified to avoid
+     producing a zero base.  */
+  if (integer_zerop (TREE_OPERAND (mem_ref, 0)))
+    return false;
+
+  /* If the original expression is a mem_ref that will be recognized
+     as a hidden global store (see tree-ssa-sink.c:is_hidden_global_store),
+     but the replacement will not be so recognized, the replacement
+     is unsafe.  The statement may not be marked as always-necessary
+     in the dead code eliminator.  */
+  if (is_vdef_store && global_store_may_be_lost (*expr, mem_ref))
+    return false;
+
+  /* Chicken switch for debug.  Use -fdbg-cnt=lower_addr:N to disable
+     after N replacements.  */
+  if (!dbg_cnt (lower_addr))
+    return false;
+
+  /* We're going to use the replacement.  CSE any new gimple stmts
+     that were added, since we are likely exposing redundant
+     computations.  */
+  if (gsi_end_p (save_gsi))
+    save_gsi = gsi_start (gsi->seq);
+  else
+    gsi_next (&save_gsi);
+  for (; gsi_stmt (save_gsi) != save_stmt; gsi_next (&save_gsi))
+    (void)avail_addr_cse (gsi_stmt (save_gsi));
+
+  /* Copy reference data from the replaced expression.  */
+  copy_ref_info (mem_ref, *expr);
+
+  /* Finish the replacement.  */
+  *expr = mem_ref;
+  
+  return true;
+}
+
+/* Return TRUE iff EXPR is a memory reference with a zero offset,
+   a zero step, a non-zero index, and a zero index2.  */
+static bool
+is_zero_offset_ref (tree expr)
+{
+  tree offset, index, step, index2;
+
+  /* A MEM_REF has only a base and an offset, so it doesn't apply.  */
+  if (TREE_CODE (expr) != TARGET_MEM_REF)
+    return false;
+
+  offset = TREE_OPERAND (expr, 1);
+
+  if (!offset
+      || TREE_CODE (offset) != INTEGER_CST
+      || !double_int_zero_p (TREE_INT_CST (offset)))
+    return false;
+
+  index = TREE_OPERAND (expr, 2);
+
+  if (!index || (TREE_CODE (index) == INTEGER_CST
+		 && double_int_zero_p (TREE_INT_CST (index))))
+    return false;
+
+  step = TREE_OPERAND (expr, 3);
+  
+  if (step && (TREE_CODE (step) != INTEGER_CST
+	       || !double_int_zero_p (TREE_INT_CST (step))))
+    return false;
+
+  index2 = TREE_OPERAND (expr, 4);
+
+  if (index2 && (TREE_CODE (index2) != INTEGER_CST
+		 || !double_int_zero_p (TREE_INT_CST (index2))))
+    return false;
+
+  return true;
+}
+
+/* Expose target addressing modes in STMT, which must be an assignment.  */
+
+static bool
+tree_ssa_lower_addr_stmt (gimple stmt, gimple_stmt_iterator *gsi, 
+			  basic_block bb, bool speed)
+{
+  tree *lhs, *rhs1;
+  bool changed, rhs_changed;
+
+  /* If this is the last statement of the block, it may have an associated
+     exception handling landing pad.  The original statement may appear
+     to possibly throw an exception, but after we simplify it this may no
+     longer be true.  */
+  bool orig_may_throw = stmt_could_throw_p (stmt)
+    && lookup_stmt_eh_lp (stmt) > 0;
+
+  lhs = gimple_assign_lhs_ptr (stmt);
+  changed = tree_ssa_lower_addr_tree (lhs, gsi, speed, 
+				      gimple_vdef (stmt) != NULL);
+
+  /* If the LHS now contains a simple target_mem_ref with zero offset,
+     record this.  */
+  if (changed && is_zero_offset_ref (gimple_assign_lhs (stmt)))
+    VEC_safe_push (gimple, heap, zero_offset_refs, stmt);
+
+  rhs1 = gimple_assign_rhs1_ptr (stmt);
+  rhs_changed = tree_ssa_lower_addr_tree (rhs1, gsi, speed, false);
+
+  /* Record zero-offset mem_refs on the RHS.  */
+  if (rhs_changed && is_zero_offset_ref (gimple_assign_rhs1 (stmt)))
+    VEC_safe_push (gimple, heap, zero_offset_refs, stmt);
+
+  changed = changed || rhs_changed;
+
+  /* If the original statement was in the landing pad table, but the
+     revised one can't signal an exception, remove it from the table.
+     We may have to clean up the CFG afterwards.  */
+  if (changed
+      && orig_may_throw
+      && maybe_clean_eh_stmt (stmt)
+      && gimple_purge_dead_eh_edges (bb))
+    clean_cfg = true;
+
+  return changed;
+}
+
+/* Given that EXPR in STMT is a TARGET_MEM_REF with zero offset, zero
+   step, non-zero index, and zero index2, determine whether the sum of
+   EXPR's base and index components is available in the CSE table.  If
+   so, convert EXPR to use that sum as its base component and eliminate
+   the index component.  If the sum is defined following STMT, move the
+   defining statement ahead of STMT.  This is safe since the base and
+   index components must be available.  */
+
+static void
+process_zero_offset_ref (tree expr, gimple stmt)
+{
+  tree base = TREE_OPERAND (expr, 0);
+  tree index = TREE_OPERAND (expr, 2);
+  avail_addr_expr entry;
+  hashval_t hash;
+  void **slot;
+  avail_addr_expr_t sum_entry;
+  gimple_stmt_iterator gsi, gsi2;
+  gimple repr_def_stmt;
+  bool use_seen, def_seen;
+
+  /* Base and index need to take certain simple forms.  If they do not,
+     skip this ref.  */
+  if (!valid_simple_op (base) || !valid_simple_op (index))
+    return;
+
+  /* Look up the sum in the hash table.  Try first as a POINTER_PLUS_EXPR,
+     then as a PLUS_EXPR.  */
+  entry.operation = POINTER_PLUS_EXPR;
+  entry.operand1 = base;
+  entry.operand2 = index;
+  entry.repr = NULL_TREE;
+  hash = hash_simple_op (base) ^ hash_simple_op (index);
+  entry.hashcode = POINTER_PLUS_EXPR ^ hash;
+
+  slot = htab_find_slot_with_hash (avail_addr_table, &entry,
+				   entry.hashcode, NO_INSERT);
+
+  if (!slot)
+    {
+      entry.operation = PLUS_EXPR;
+      entry.hashcode = PLUS_EXPR ^ hash;
+      slot = htab_find_slot_with_hash (avail_addr_table, &entry,
+				       entry.hashcode, NO_INSERT);
+    }
+
+  /* If the sum is not available, the index form is preferable.  */
+  if (!slot)
+    return;
+
+  /* Otherwise, set base = sum and turn this into a MEM_REF.
+     This causes the index operand to no longer be visible.  */
+  sum_entry = (avail_addr_expr_t)*slot;
+  TREE_OPERAND (expr, 0) = sum_entry->repr;
+  TREE_SET_CODE (expr, MEM_REF);
+
+  /* Find the definition of the sum.  */
+  repr_def_stmt = SSA_NAME_DEF_STMT (sum_entry->repr);
+
+  /* If it's defined in a different block, the sum is available from
+     a dominator.  */
+  if (gimple_bb (repr_def_stmt) != curr_bb)
+    return;
+
+  /* Otherwise, determine whether the defining statement appears after
+     EXPR.  If so, move it.  */
+  /* ???  In the worst case, this requires a full block scan for each
+     MEM_REF in the block for which the optimization is appropriate.
+     This is unlikely to be too bad in practice, but something more
+     clever may be called for.  */
+  use_seen = def_seen = false;
+
+  for (gsi = gsi_start_bb (curr_bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple cand = gsi_stmt (gsi);
+
+      if (cand == stmt)
+	if (def_seen)
+	  return;
+	else
+	  {
+	    use_seen = true;
+	    gsi2 = gsi;
+	  }
+      else if (cand == repr_def_stmt)
+	{
+	  if (use_seen)
+	    {
+	      gsi_move_before (&gsi, &gsi2);
+	      return;
+	    }
+	  else
+	    return;
+	}
+    }
+
+  /* If we don't see both the def and the use in this block, something
+     is badly wrong.  */
+  gcc_unreachable();
+}
+
+/* Examine the current basic block for opportunities to profitably coerce
+   TARGET_MEM_REFs with zero offsets into non-indexed forms.  */
+
+static void
+process_zero_offset_refs (void)
+{
+  unsigned i, length = VEC_length (gimple, zero_offset_refs);
+  gimple stmt;
+  tree lhs, rhs;
+
+  for (i = 0; i < length; i++)
+    {
+      stmt = VEC_index (gimple, zero_offset_refs, i);
+      lhs = gimple_assign_lhs (stmt);
+
+      if (is_zero_offset_ref (lhs))
+	process_zero_offset_ref (lhs, stmt);
+      else
+	{
+	  rhs = gimple_assign_rhs1 (stmt);
+	  gcc_assert (is_zero_offset_ref (rhs));
+	  process_zero_offset_ref (rhs, stmt);
+	}
+    }
+}
+
+/* Process the statements in block BB.  */
+
+static void
+tree_ssa_lower_addr_block (basic_block bb)
+{
+  gimple_stmt_iterator gsi;
+  basic_block son;
+  bool changed = false;
+  bool speed = optimize_bb_for_speed_p (bb);
+
+  curr_bb = bb;
+  AVAIL_EXPRS (bb) = VEC_alloc (avail_addr_expr_t, heap, 32);
+  VEC_truncate (gimple, zero_offset_refs, 0);
+
+  /* Walk the statements in the block, gathering CSE data and making
+     address replacements as opportunities arise.  */
+  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple stmt = gsi_stmt (gsi);
+
+      if (gimple_vuse (stmt) && gimple_assign_single_p (stmt))
+	changed = tree_ssa_lower_addr_stmt (stmt, &gsi, bb, speed) || changed;
+
+      /* Whether changed or not, gather potential information for CSE.  */
+      changed = avail_addr_cse (stmt) || changed;
+    }
+
+  /* Following is a fairly common pattern that results from address
+     lowering with CSE:
+
+      <bb 2>:
+        D.2154_8 = n_2(D) * 4;
+        D.2107_3 = MEM[base: p_1(D), index: D.2154_8, offset: 0B];
+	D.2156_10 = p_1(D) + D.2154_8;
+	D.2108_4 = MEM[(struct x *)D.2156_10 + 128B];
+	D.2109_5 = MEM[(struct x *)D.2156_10 + 64B];
+	foo (D.2107_3, D.2108_4, D.2109_5);
+	return;
+
+     When the addressing expression consists of two tree addends and
+     a non-zero offset, the addends are pre-added and used as the base
+     expression, with the offset becoming the immediate operand in the
+     storage op.  When the offset is zero, the base and index are
+     added implicitly in an indexed-form storage op.
+
+     The bias to the indexed form with a zero offset is good with no
+     knowledge of the surrounding code; it saves an instruction
+     and a dependency with a possible small increase in register
+     pressure.  As can be seen here, though, we end up with a missed
+     commoning opportunity, since the second instruction could become:
+
+       D.2107_3 = MEM[(struct x *)D.2156_10 + 0B];
+
+     Since the add into D.2156_10 is going to occur anyway, the 
+     advantage of the bias to indexed form vanishes.  Instead, it
+     leads to an increase in register pressure.  This is more noticeable
+     as expressions are CSEd across blocks.
+
+     Detect zero_offset memory references that can profitably be
+     converted into non-indexed forms.  */
+  process_zero_offset_refs ();
+
+  /* If anything changed in the block, the SSA immediate use lists may
+     be hosed up.  Clean the whole block.    */
+  if (changed)
+    for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+      update_stmt (gsi_stmt (gsi));
+
+  if (dump_file && changed && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "Changed bb %d:\n", bb->index);
+      dump_bb (bb, dump_file, 0);
+      fprintf (dump_file, "\n");
+    }
+
+  /* Process dominated blocks.  */
+  for (son = first_dom_son (CDI_DOMINATORS, bb);
+       son;
+       son = next_dom_son (CDI_DOMINATORS, son))
+    tree_ssa_lower_addr_block (son);
+
+  /* We're done with this block, so remove expressions local to this
+     block from the allocation pool.  */
+  free_avail_exprs (bb);
+}
+
+/* Main entry point for exposing target addressing modes not caught
+   by tree-ssa-loop-ivopts.c.  */
+
+static unsigned int
+tree_ssa_lower_addr (void)
+{
+  /* A hash table of subexpressions that contribute to addressing
+     computation is maintained for the current block and all of its
+     dominators.  The entries addressed out of the hash table are
+     stored in an allocation pool.  At any given time, the allocation
+     pool likewise contains entries for the current block and its
+     dominators.  */
+  avail_addr_table = htab_create (500, avail_addr_hash,
+				  avail_addr_eq, avail_addr_free);
+  avail_addr_pool = create_alloc_pool ("avail_addr_pool",
+				       sizeof (avail_addr_expr),
+				       500);
+
+  /* Allocate auxiliary CFG information for this pass.  */
+  alloc_aux_for_blocks (sizeof (tree_lower_addr_bb_aux));
+
+  /* Allocate a vector to track memory references with zero offsets.  */
+  zero_offset_refs = VEC_alloc (gimple, heap, 32);
+
+  /* Track whether we purged any exception handling edges.  */
+  clean_cfg = false;
+
+  /* We need loop information for heuristics.  IVOPTS has already lowered
+     some addressing expressions in loops, and we don't want to step on
+     what's already been done.  */
+  loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
+
+  /* Walk the CFG in dominator order.  */
+  calculate_dominance_info (CDI_DOMINATORS);
+  tree_ssa_lower_addr_block (ENTRY_BLOCK_PTR);
+
+  /* Free resources.  */
+  loop_optimizer_finalize ();
+  htab_delete (avail_addr_table);
+  free_alloc_pool (avail_addr_pool);
+  free_aux_for_blocks ();
+  VEC_free (gimple, heap, zero_offset_refs);
+
+  if (clean_cfg)
+    return TODO_cleanup_cfg;
+
+  return 0;
+}
+
+/* Gating function for this pass.  Currently we disable the phase
+   if mudflap instrumentation is to be done.  Exposing target addresses
+   before mudflap confuses mudflap about the base addresses of arrays
+   and so forth.  */
+
+static bool
+gate_lower_addr (void)
+{
+  return (optimize && flag_lower_addr == 1 && flag_mudflap == 0);
+}
+
+struct gimple_opt_pass pass_tree_lower_addr =
+{
+ {
+  GIMPLE_PASS,
+  "loweraddr",				/* name  */
+  gate_lower_addr,			/* gate  */
+  tree_ssa_lower_addr,			/* execute  */
+  NULL,					/* sub  */
+  NULL,					/* next  */
+  0,					/* static_pass_number  */
+  TV_TREE_LOWER_ADDR,			/* tv_id  */
+  PROP_cfg,				/* properties_required  */
+  0,					/* properties_provided  */
+  0,					/* properties_destroyed  */
+  0,					/* todo_flags_start  */
+  TODO_dump_func | TODO_update_ssa	/* todo_flags_finish  */
+ }
+};
+
Index: gcc/tree.c
===================================================================
--- gcc/tree.c	(revision 175664)
+++ gcc/tree.c	(working copy)
@@ -5753,6 +5753,53 @@  build_aligned_type (tree type, unsigned int align)
   return t;
 }
 
+/* Returns true if memory reference REF with step STEP may be unaligned.  */
+
+bool
+may_be_unaligned_p (tree ref, tree step)
+{
+  tree base;
+  tree base_type;
+  HOST_WIDE_INT bitsize;
+  HOST_WIDE_INT bitpos;
+  tree toffset;
+  enum machine_mode mode;
+  int unsignedp, volatilep;
+  unsigned base_align;
+
+  /* TARGET_MEM_REFs are translated directly to valid MEMs on the target,
+     thus they are not misaligned.  */
+  if (TREE_CODE (ref) == TARGET_MEM_REF)
+    return false;
+
+  /* The test below is basically copy of what expr.c:normal_inner_ref
+     does to check whether the object must be loaded by parts when
+     STRICT_ALIGNMENT is true.  */
+  base = get_inner_reference (ref, &bitsize, &bitpos, &toffset, &mode,
+			      &unsignedp, &volatilep, true);
+  base_type = TREE_TYPE (base);
+  base_align = TYPE_ALIGN (base_type);
+
+  if (mode != BLKmode)
+    {
+      unsigned mode_align = GET_MODE_ALIGNMENT (mode);
+
+      if (base_align < mode_align
+	  || (bitpos % mode_align) != 0
+	  || (bitpos % BITS_PER_UNIT) != 0)
+	return true;
+
+      if (toffset
+	  && (highest_pow2_factor (toffset) * BITS_PER_UNIT) < mode_align)
+	return true;
+
+      if ((highest_pow2_factor (step) * BITS_PER_UNIT) < mode_align)
+	return true;
+    }
+
+  return false;
+}
+
 /* Create a new distinct copy of TYPE.  The new type is made its own
    MAIN_VARIANT. If TYPE requires structural equality checks, the
    resulting type requires structural equality checks; otherwise, its
Index: gcc/tree.h
===================================================================
--- gcc/tree.h	(revision 175664)
+++ gcc/tree.h	(working copy)
@@ -5139,6 +5139,7 @@  extern tree tree_strip_sign_nop_conversions (tree)
 extern tree lhd_gcc_personality (void);
 extern void assign_assembler_name_if_neeeded (tree);
 extern void warn_deprecated_use (tree, tree);
+extern bool may_be_unaligned_p (tree, tree);
 
 
 /* In cgraph.c */
@@ -5763,6 +5764,7 @@  tree target_for_debug_bind (tree);
 /* In tree-ssa-address.c.  */
 extern tree tree_mem_ref_addr (tree, tree);
 extern void copy_mem_ref_info (tree, tree);
+extern void copy_ref_info (tree, tree);
 
 /* In tree-vrp.c */
 extern bool ssa_name_nonnegative_p (const_tree);
Index: gcc/tree-pass.h
===================================================================
--- gcc/tree-pass.h	(revision 175664)
+++ gcc/tree-pass.h	(working copy)
@@ -386,6 +386,7 @@  extern struct gimple_opt_pass pass_complete_unroll
 extern struct gimple_opt_pass pass_parallelize_loops;
 extern struct gimple_opt_pass pass_loop_prefetch;
 extern struct gimple_opt_pass pass_iv_optimize;
+extern struct gimple_opt_pass pass_tree_lower_addr;
 extern struct gimple_opt_pass pass_tree_loop_done;
 extern struct gimple_opt_pass pass_ch;
 extern struct gimple_opt_pass pass_ccp;
Index: gcc/tree-ssa-loop-ivopts.c
===================================================================
--- gcc/tree-ssa-loop-ivopts.c	(revision 175664)
+++ gcc/tree-ssa-loop-ivopts.c	(working copy)
@@ -1610,53 +1610,6 @@  constant_multiple_of (tree top, tree bot, double_i
     }
 }
 
-/* Returns true if memory reference REF with step STEP may be unaligned.  */
-
-static bool
-may_be_unaligned_p (tree ref, tree step)
-{
-  tree base;
-  tree base_type;
-  HOST_WIDE_INT bitsize;
-  HOST_WIDE_INT bitpos;
-  tree toffset;
-  enum machine_mode mode;
-  int unsignedp, volatilep;
-  unsigned base_align;
-
-  /* TARGET_MEM_REFs are translated directly to valid MEMs on the target,
-     thus they are not misaligned.  */
-  if (TREE_CODE (ref) == TARGET_MEM_REF)
-    return false;
-
-  /* The test below is basically copy of what expr.c:normal_inner_ref
-     does to check whether the object must be loaded by parts when
-     STRICT_ALIGNMENT is true.  */
-  base = get_inner_reference (ref, &bitsize, &bitpos, &toffset, &mode,
-			      &unsignedp, &volatilep, true);
-  base_type = TREE_TYPE (base);
-  base_align = TYPE_ALIGN (base_type);
-
-  if (mode != BLKmode)
-    {
-      unsigned mode_align = GET_MODE_ALIGNMENT (mode);
-
-      if (base_align < mode_align
-	  || (bitpos % mode_align) != 0
-	  || (bitpos % BITS_PER_UNIT) != 0)
-	return true;
-
-      if (toffset
-	  && (highest_pow2_factor (toffset) * BITS_PER_UNIT) < mode_align)
-	return true;
-
-      if ((highest_pow2_factor (step) * BITS_PER_UNIT) < mode_align)
-	return true;
-    }
-
-  return false;
-}
-
 /* Return true if EXPR may be non-addressable.   */
 
 bool
@@ -6031,64 +5984,6 @@  rewrite_use_nonlinear_expr (struct ivopts_data *da
     }
 }
 
-/* Copies the reference information from OLD_REF to NEW_REF.  */
-
-static void
-copy_ref_info (tree new_ref, tree old_ref)
-{
-  tree new_ptr_base = NULL_TREE;
-
-  TREE_SIDE_EFFECTS (new_ref) = TREE_SIDE_EFFECTS (old_ref);
-  TREE_THIS_VOLATILE (new_ref) = TREE_THIS_VOLATILE (old_ref);
-
-  new_ptr_base = TREE_OPERAND (new_ref, 0);
-
-  /* We can transfer points-to information from an old pointer
-     or decl base to the new one.  */
-  if (new_ptr_base
-      && TREE_CODE (new_ptr_base) == SSA_NAME
-      && !SSA_NAME_PTR_INFO (new_ptr_base))
-    {
-      tree base = get_base_address (old_ref);
-      if (!base)
-	;
-      else if ((TREE_CODE (base) == MEM_REF
-		|| TREE_CODE (base) == TARGET_MEM_REF)
-	       && TREE_CODE (TREE_OPERAND (base, 0)) == SSA_NAME
-	       && SSA_NAME_PTR_INFO (TREE_OPERAND (base, 0)))
-	{
-	  struct ptr_info_def *new_pi;
-	  duplicate_ssa_name_ptr_info
-	    (new_ptr_base, SSA_NAME_PTR_INFO (TREE_OPERAND (base, 0)));
-	  new_pi = SSA_NAME_PTR_INFO (new_ptr_base);
-	  /* We have to be careful about transfering alignment information.  */
-	  if (TREE_CODE (old_ref) == MEM_REF
-	      && !(TREE_CODE (new_ref) == TARGET_MEM_REF
-		   && (TMR_INDEX2 (new_ref)
-		       || (TMR_STEP (new_ref)
-			   && (TREE_INT_CST_LOW (TMR_STEP (new_ref))
-			       < new_pi->align)))))
-	    {
-	      new_pi->misalign += double_int_sub (mem_ref_offset (old_ref),
-						  mem_ref_offset (new_ref)).low;
-	      new_pi->misalign &= (new_pi->align - 1);
-	    }
-	  else
-	    {
-	      new_pi->align = 1;
-	      new_pi->misalign = 0;
-	    }
-	}
-      else if (TREE_CODE (base) == VAR_DECL
-	       || TREE_CODE (base) == PARM_DECL
-	       || TREE_CODE (base) == RESULT_DECL)
-	{
-	  struct ptr_info_def *pi = get_ptr_info (new_ptr_base);
-	  pt_solution_set_var (&pi->pt, base);
-	}
-    }
-}
-
 /* Performs a peephole optimization to reorder the iv update statement with
    a mem ref to enable instruction combining in later phases. The mem ref uses
    the iv value before the update, so the reordering transformation requires
Index: gcc/timevar.def
===================================================================
--- gcc/timevar.def	(revision 175664)
+++ gcc/timevar.def	(working copy)
@@ -1,7 +1,7 @@ 
 /* This file contains the definitions for timing variables used to
    measure run-time performance of the compiler.
    Copyright (C) 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008,
-   2009, 2010
+   2009, 2010, 2011
    Free Software Foundation, Inc.
    Contributed by Alex Samuel <samuel@codesourcery.com>
 
@@ -250,6 +250,7 @@  DEFTIMEVAR (TV_TREE_IFCOMBINE        , "tree if-co
 DEFTIMEVAR (TV_TREE_UNINIT           , "uninit var analysis")
 DEFTIMEVAR (TV_PLUGIN_INIT           , "plugin initialization")
 DEFTIMEVAR (TV_PLUGIN_RUN            , "plugin execution")
+DEFTIMEVAR (TV_TREE_LOWER_ADDR       , "expose addressing modes")
 
 /* Everything else in rest_of_compilation not included above.  */
 DEFTIMEVAR (TV_EARLY_LOCAL	     , "early local passes")
Index: gcc/tree-ssa-address.c
===================================================================
--- gcc/tree-ssa-address.c	(revision 175664)
+++ gcc/tree-ssa-address.c	(working copy)
@@ -612,7 +612,7 @@  most_expensive_mult_to_index (tree type, struct me
 static void
 addr_to_parts (tree type, aff_tree *addr, tree iv_cand,
 	       tree base_hint, struct mem_address *parts,
-               bool speed)
+	       bool speed)
 {
   tree part;
   unsigned i;
@@ -632,9 +632,12 @@  addr_to_parts (tree type, aff_tree *addr, tree iv_
 
   /* No need to do address parts reassociation if the number of parts
      is <= 2 -- in that case, no loop invariant code motion can be
-     exposed.  */
+     exposed.
 
-  if (!base_hint && (addr->n > 2))
+     Ensure iv_cand exists before attempting this, since this code
+     is also called from outside IVopts.  */
+
+  if (!base_hint && (addr->n > 2) && iv_cand)
     move_variant_to_index (parts, addr, iv_cand);
 
   /* First move the most expensive feasible multiplication
@@ -839,6 +842,64 @@  copy_mem_ref_info (tree to, tree from)
   TREE_THIS_VOLATILE (to) = TREE_THIS_VOLATILE (from);
 }
 
+/* Copies the reference information from OLD_REF to NEW_REF.  */
+
+void
+copy_ref_info (tree new_ref, tree old_ref)
+{
+  tree new_ptr_base = NULL_TREE;
+
+  TREE_SIDE_EFFECTS (new_ref) = TREE_SIDE_EFFECTS (old_ref);
+  TREE_THIS_VOLATILE (new_ref) = TREE_THIS_VOLATILE (old_ref);
+
+  new_ptr_base = TREE_OPERAND (new_ref, 0);
+
+  /* We can transfer points-to information from an old pointer
+     or decl base to the new one.  */
+  if (new_ptr_base
+      && TREE_CODE (new_ptr_base) == SSA_NAME
+      && !SSA_NAME_PTR_INFO (new_ptr_base))
+    {
+      tree base = get_base_address (old_ref);
+      if (!base)
+	;
+      else if ((TREE_CODE (base) == MEM_REF
+		|| TREE_CODE (base) == TARGET_MEM_REF)
+	       && TREE_CODE (TREE_OPERAND (base, 0)) == SSA_NAME
+	       && SSA_NAME_PTR_INFO (TREE_OPERAND (base, 0)))
+	{
+	  struct ptr_info_def *new_pi;
+	  duplicate_ssa_name_ptr_info
+	    (new_ptr_base, SSA_NAME_PTR_INFO (TREE_OPERAND (base, 0)));
+	  new_pi = SSA_NAME_PTR_INFO (new_ptr_base);
+	  /* We have to be careful about transfering alignment information.  */
+	  if (TREE_CODE (old_ref) == MEM_REF
+	      && !(TREE_CODE (new_ref) == TARGET_MEM_REF
+		   && (TMR_INDEX2 (new_ref)
+		       || (TMR_STEP (new_ref)
+			   && (TREE_INT_CST_LOW (TMR_STEP (new_ref))
+			       < new_pi->align)))))
+	    {
+	      new_pi->misalign += double_int_sub (mem_ref_offset (old_ref),
+						  mem_ref_offset (new_ref)).low;
+	      new_pi->misalign &= (new_pi->align - 1);
+	    }
+	  else
+	    {
+	      new_pi->align = 1;
+	      new_pi->misalign = 0;
+	    }
+	}
+      else if (TREE_CODE (base) == VAR_DECL
+	       || TREE_CODE (base) == PARM_DECL
+	       || TREE_CODE (base) == RESULT_DECL)
+	{
+	  struct ptr_info_def *pi = get_ptr_info (new_ptr_base);
+	  pt_solution_set_var (&pi->pt, base);
+	}
+    }
+}
+
 /* Move constants in target_mem_ref REF to offset.  Returns the new target
    mem ref if anything changes, NULL_TREE otherwise.  */
 
Index: gcc/tree-affine.c
===================================================================
--- gcc/tree-affine.c	(revision 175664)
+++ gcc/tree-affine.c	(working copy)
@@ -1,5 +1,5 @@ 
 /* Operations with affine combinations of trees.
-   Copyright (C) 2005, 2007, 2008, 2010 Free Software Foundation, Inc.
+   Copyright (C) 2005, 2007, 2008, 2010, 2011 Free Software Foundation, Inc.
 
 This file is part of GCC.
 
@@ -28,6 +28,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "tree-affine.h"
 #include "gimple.h"
 #include "flags.h"
+#include "tree-flow.h"
 
 /* Extends CST as appropriate for the affine combinations COMB.  */
 
@@ -712,6 +713,45 @@  tree_to_aff_combination_expand (tree expr, tree ty
   aff_combination_expand (comb, cache);
 }
 
+/* Expand a memory reference into an affine combination, returning
+   false if EXPR is possibly not addressable, or if EXPR might be
+   misaligned on a STRICT_ALIGNMENT target.  EXPR is modified to
+   an addressing expression.  */
+bool
+mem_tree_to_aff_combination_expand (tree *expr, aff_tree *comb)
+{
+  struct pointer_map_t *cache;
+
+  if (TREE_CODE (*expr) == TARGET_MEM_REF)
+    {
+      tree type = build_pointer_type (TREE_TYPE (*expr));
+      *expr = tree_mem_ref_addr (type, *expr);
+    }
+  else
+    {
+      /* Check that the expression is addressable, since the access
+	 may include a VIEW_CONVERT_EXPR or a bitfield reference.  */
+      if (may_be_nonaddressable_p (*expr))
+	return false;
+
+      /* On strict alignment platforms, check that the base expression
+	 is sufficiently aligned.  There is no additional "step" 
+	 information required, so specify single-byte alignment for
+	 that parameter.  */
+      if (STRICT_ALIGNMENT && may_be_unaligned_p (*expr, size_one_node))
+	return false;
+
+      *expr = build_fold_addr_expr (*expr);
+    }
+
+  /* Expand to affine combination, looking throuh SSA defs.  */
+  cache = pointer_map_create ();
+  tree_to_aff_combination_expand (*expr, TREE_TYPE (*expr), comb, &cache);
+  pointer_map_destroy (cache);
+
+  return true;
+}
+
 /* Frees memory occupied by struct name_expansion in *VALUE.  Callback for
    pointer_map_traverse.  */
 
Index: gcc/tree-affine.h
===================================================================
--- gcc/tree-affine.h	(revision 175664)
+++ gcc/tree-affine.h	(working copy)
@@ -1,5 +1,5 @@ 
 /* Operations with affine combinations of trees.
-   Copyright (C) 2005, 2007, 2008 Free Software Foundation, Inc.
+   Copyright (C) 2005, 2007, 2008, 2011 Free Software Foundation, Inc.
 
 This file is part of GCC.
 
@@ -74,6 +74,7 @@  bool aff_combination_constant_multiple_p (aff_tree
 void aff_combination_expand (aff_tree *, struct pointer_map_t **);
 void tree_to_aff_combination_expand (tree, tree, aff_tree *,
 				     struct pointer_map_t **);
+bool mem_tree_to_aff_combination_expand (tree *, aff_tree *);
 void get_inner_reference_aff (tree, aff_tree *, double_int *);
 void free_affine_expand_cache (struct pointer_map_t **);
 
Index: gcc/common.opt
===================================================================
--- gcc/common.opt	(revision 175664)
+++ gcc/common.opt	(working copy)
@@ -1369,6 +1369,10 @@  floop-optimize
 Common Ignore
 Does nothing.  Preserved for backward compatibility.
 
+flower-addr
+Common Report Var(flag_lower_addr) Init(1) Optimization
+Expose target addressing modes in gimple
+
 flto
 Common
 Enable link-time optimization.
Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in	(revision 175664)
+++ gcc/Makefile.in	(working copy)
@@ -1436,6 +1436,7 @@  OBJS = \
 	tree-sra.o \
 	tree-switch-conversion.o \
 	tree-ssa-address.o \
+	tree-ssa-lower-addr.o \
 	tree-ssa-alias.o \
 	tree-ssa-ccp.o \
 	tree-ssa-coalesce.o \
@@ -2657,7 +2658,7 @@  tree-ssa-loop-ivopts.o : tree-ssa-loop-ivopts.c $(
 tree-affine.o : tree-affine.c tree-affine.h $(CONFIG_H) pointer-set.h \
    $(SYSTEM_H) $(TREE_H) $(GIMPLE_H) \
    output.h $(DIAGNOSTIC_H) coretypes.h $(TREE_DUMP_H) $(FLAGS_H) \
-   tree-pretty-print.h
+   tree-pretty-print.h $(TREE_FLOW_H)
 tree-ssa-loop-manip.o : tree-ssa-loop-manip.c $(TREE_FLOW_H) $(CONFIG_H) \
    $(SYSTEM_H) coretypes.h $(TM_H) $(TREE_H) \
    $(BASIC_BLOCK_H) output.h $(DIAGNOSTIC_H) $(TREE_FLOW_H) $(TREE_DUMP_H) \
@@ -2669,6 +2670,10 @@  tree-ssa-loop-im.o : tree-ssa-loop-im.c $(TREE_FLO
    $(TREE_DUMP_H) $(TREE_PASS_H) $(FLAGS_H) $(BASIC_BLOCK_H) \
    pointer-set.h tree-affine.h tree-ssa-propagate.h gimple-pretty-print.h \
    tree-pretty-print.h
+tree-ssa-lower-addr.o : tree-ssa-lower-addr.c $(CONFIG_H) $(SYSTEM_H) \
+   coretypes.h $(TREE_H) $(BASIC_BLOCK_H) $(TREE_FLOW_H) $(TREE_DUMP_H) \
+   $(TREE_PASS_H) $(TIMEVAR_H) $(CFGLOOP_H) $(GIMPLE_H) $(TREE_AFFINE_H) \
+   tree-pretty-print.h $(DBGCNT_H) alloc-pool.h gt-tree-ssa-lower-addr.h
 tree-ssa-math-opts.o : tree-ssa-math-opts.c $(CONFIG_H) $(SYSTEM_H) coretypes.h \
    $(TM_H) $(FLAGS_H) $(TREE_H) $(TREE_FLOW_H) $(TIMEVAR_H) \
    $(TREE_PASS_H) alloc-pool.h $(BASIC_BLOCK_H) $(TARGET_H) \
@@ -3844,6 +3849,7 @@  GTFILES = $(CPP_ID_DATA_H) $(srcdir)/input.h $(src
   $(srcdir)/tree-iterator.c $(srcdir)/gimplify.c \
   $(srcdir)/tree-chrec.h \
   $(srcdir)/tree-scalar-evolution.c \
+  $(srcdir)/tree-ssa-lower-addr.c \
   $(srcdir)/tree-ssa-operands.h \
   $(srcdir)/tree-profile.c $(srcdir)/tree-nested.c \
   $(srcdir)/varpool.c \
Index: gcc/gimple.c
===================================================================
--- gcc/gimple.c	(revision 175664)
+++ gcc/gimple.c	(working copy)
@@ -2018,6 +2018,19 @@  gimple_assign_unary_nop_p (gimple gs)
               == TYPE_MODE (TREE_TYPE (gimple_assign_rhs1 (gs)))));
 }
 
+/* Return true if GS is an assignment that performs a true type
+   conversion, i.e., one that requires code.  */
+
+bool
+gimple_assign_conversion_p (gimple gs)
+{
+  return (is_gimple_assign (gs)
+	  && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (gs))
+	  && gimple_assign_rhs1 (gs) != error_mark_node
+	  && (TYPE_MODE (TREE_TYPE (gimple_assign_lhs (gs)))
+	      != TYPE_MODE (TREE_TYPE (gimple_assign_rhs1 (gs)))));
+}
+
 /* Set BB to be the basic block holding G.  */
 
 void
Index: gcc/gimple.h
===================================================================
--- gcc/gimple.h	(revision 175664)
+++ gcc/gimple.h	(working copy)
@@ -884,6 +884,7 @@  void gimple_call_reset_alias_info (gimple);
 bool gimple_assign_copy_p (gimple);
 bool gimple_assign_ssa_name_copy_p (gimple);
 bool gimple_assign_unary_nop_p (gimple);
+bool gimple_assign_conversion_p (gimple);
 void gimple_set_bb (gimple, struct basic_block_def *);
 void gimple_assign_set_rhs_from_tree (gimple_stmt_iterator *, tree);
 void gimple_assign_set_rhs_with_ops_1 (gimple_stmt_iterator *, enum tree_code,
Index: gcc/passes.c
===================================================================
--- gcc/passes.c	(revision 175664)
+++ gcc/passes.c	(working copy)
@@ -1,7 +1,7 @@ 
 /* Top level of GCC compilers (cc1, cc1plus, etc.)
    Copyright (C) 1987, 1988, 1989, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
-   1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010
-   Free Software Foundation, Inc.
+   1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,
+   2011 Free Software Foundation, Inc.
 
 This file is part of GCC.
 
@@ -1375,6 +1375,7 @@  init_optimization_passes (void)
 	 only examines PHIs to discover const/copy propagation
 	 opportunities.  */
       NEXT_PASS (pass_phi_only_cprop);
+      NEXT_PASS (pass_tree_lower_addr);
       NEXT_PASS (pass_cd_dce);
       NEXT_PASS (pass_tracer);