diff mbox

[PING] Optional alternative base_expr in finding basis for CAND_REFs

Message ID 529DD39F.9080405@arm.com
State New
Headers show

Commit Message

Yufeng Zhang Dec. 3, 2013, 12:50 p.m. UTC
On 12/03/13 06:48, Jeff Law wrote:
> On 12/02/13 08:47, Yufeng Zhang wrote:
>> Ping~
>>
>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>
>>
>> Thanks,
>> Yufeng
>>
>> On 11/26/13 15:02, Yufeng Zhang wrote:
>>> On 11/26/13 12:45, Richard Biener wrote:
>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>>>> Zhang<Yufeng.Zhang@arm.com>    wrote:
>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>>>>>> The second version of your original patch is ok with me with the
>>>>>> following changes.  Sorry for the little side adventure into the
>>>>>> next-interp logic; in the end that's going to hurt more than it
>>>>>> helps in
>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks also for
>>>>>> cleaning up this version to be less intrusive to common interfaces; I
>>>>>> appreciate it.
>>>>>
>>>>>
>>>>> Thanks a lot for the review.  I've attached an updated patch with the
>>>>> suggested changes incorporated.
>>>>>
>>>>> For the next-interp adventure, I was quite happy to do the
>>>>> experiment; it's
>>>>> a good chance of gaining insight into the pass.  Many thanks for
>>>>> your prompt
>>>>> replies and patience in guiding!
>>>>>
>>>>>
>>>>>> Everything else looks OK to me.  Please ask Richard for final
>>>>>> approval,
>>>>>> as I'm not a maintainer.
> First a note, I need to check on voting for Bill as the slsr maintainer
> from the steering committee.   Voting was in progress just before the
> close of stage1 development so I haven't tallied the results :-)

Looking forward to some good news! :)

>>>
>>> Yes, you are right about the non-trivial 'base' tree are rarely shared.
>>>     The cached is introduced mainly because get_alternative_base () may be
>>> called twice on the same 'base' tree, once in the
>>> find_basis_for_candidate () for look-up and the other time in
>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm happy
>>> to leave out the cache if you think the benefit is trivial.
> Without some sense of how expensive the lookups are vs how often the
> cache hits it's awful hard to know if the cache is worth it.
>
> I'd say take it out unless you have some sense it's really saving time.
>    It's a pretty minor implementation detail either way.

I think the affine tree routines are generally expensive; it is worth 
having a cache to avoid calling them too many times.  I run the slsr-*.c 
tests under gcc.dg/tree-ssa/ and find out that the cache hit rates range 
from 55.6% to 90%, with 73.5% as the average.  The samples may not well 
represent the real world scenario, but they do show the fact that the 
'base' tree can be shared to some extent.  So I'd like to have the cache 
in the patch.

>
>>>
>>>> +/* { dg-do compile } */
>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>>>> +
>>>> +typedef int arr_2[50][50];
>>>> +
>>>> +void foo (arr_2 a2, int v1)
>>>> +{
>>>> +  int i, j;
>>>> +
>>>> +  i = v1 + 5;
>>>> +  j = i;
>>>> +  a2 [i-10] [j] = 2;
>>>> +  a2 [i] [j++] = i;
>>>> +  a2 [i+20] [j++] = i;
>>>> +  a2 [i-3] [i-1] += 1;
>>>> +  return;
>>>> +}
>>>> +
>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>>>>
>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>>>> you expect?  I see other slsr testcases do similar non-sensical
>>>> checking which is bad, too.
>>>
>>> As the slsr optimizes CAND_REF candidates by simply lowering them to
>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of MEM_REFs
>>> is an effective check.  Alternatively, I can add a follow-up patch to
>>> add some dumping facility in replace_ref () to print out the replacing
>>> actions when -fdump-tree-slsr-details is on.
> I think adding some details to the dump and scanning for them would be
> better.  That's the only change that is required for this to move forward.

I've updated to patch to dump more details when -fdump-tree-slsr-details 
is on.  The tests have also been updated to scan for these new dumps 
instead of MEMs.

>
> I suggest doing it quickly.  We're well past stage1 close at this point.

The bootstrapping on x86_64 is still running.  OK to commit if it succeeds?

Thanks,
Yufeng

gcc/

	* gimple-ssa-strength-reduction.c: Include tree-affine.h.
	(name_expansions): New static variable.
	(alt_base_map): Ditto.
	(get_alternative_base): New function.
	(find_basis_for_candidate): For CAND_REF, optionally call
	find_basis_for_base_expr with the returned value from
	get_alternative_base.
	(record_potential_basis): Add new parameter 'base' of type 'tree';
	add an assertion of non-NULL base; use base to set node->base_expr.
	(alloc_cand_and_find_basis): Update; call record_potential_basis
	for CAND_REF with the returned value from get_alternative_base.
	(replace_refs): Dump details on the replacing.
	(execute_strength_reduction): Call pointer_map_create for
	alt_base_map; call free_affine_expand_cache with &name_expansions.

gcc/testsuite/

	* gcc.dg/tree-ssa/slsr-39.c: Update.
	* gcc.dg/tree-ssa/slsr-41.c: New test.

Comments

Richard Biener Dec. 3, 2013, 2:20 p.m. UTC | #1
On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
> On 12/03/13 06:48, Jeff Law wrote:
>>
>> On 12/02/13 08:47, Yufeng Zhang wrote:
>>>
>>> Ping~
>>>
>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>>
>>
>>>
>>> Thanks,
>>> Yufeng
>>>
>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>>>>
>>>> On 11/26/13 12:45, Richard Biener wrote:
>>>>>
>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>>>>> Zhang<Yufeng.Zhang@arm.com>    wrote:
>>>>>>
>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>>>>>>>
>>>>>>> The second version of your original patch is ok with me with the
>>>>>>> following changes.  Sorry for the little side adventure into the
>>>>>>> next-interp logic; in the end that's going to hurt more than it
>>>>>>> helps in
>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks also for
>>>>>>> cleaning up this version to be less intrusive to common interfaces; I
>>>>>>> appreciate it.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks a lot for the review.  I've attached an updated patch with the
>>>>>> suggested changes incorporated.
>>>>>>
>>>>>> For the next-interp adventure, I was quite happy to do the
>>>>>> experiment; it's
>>>>>> a good chance of gaining insight into the pass.  Many thanks for
>>>>>> your prompt
>>>>>> replies and patience in guiding!
>>>>>>
>>>>>>
>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>>>>>>> approval,
>>>>>>> as I'm not a maintainer.
>>
>> First a note, I need to check on voting for Bill as the slsr maintainer
>> from the steering committee.   Voting was in progress just before the
>> close of stage1 development so I haven't tallied the results :-)
>
>
> Looking forward to some good news! :)
>
>
>>>>
>>>> Yes, you are right about the non-trivial 'base' tree are rarely shared.
>>>>     The cached is introduced mainly because get_alternative_base () may
>>>> be
>>>> called twice on the same 'base' tree, once in the
>>>> find_basis_for_candidate () for look-up and the other time in
>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm happy
>>>> to leave out the cache if you think the benefit is trivial.
>>
>> Without some sense of how expensive the lookups are vs how often the
>> cache hits it's awful hard to know if the cache is worth it.
>>
>> I'd say take it out unless you have some sense it's really saving time.
>>    It's a pretty minor implementation detail either way.
>
>
> I think the affine tree routines are generally expensive; it is worth having
> a cache to avoid calling them too many times.  I run the slsr-*.c tests
> under gcc.dg/tree-ssa/ and find out that the cache hit rates range from
> 55.6% to 90%, with 73.5% as the average.  The samples may not well represent
> the real world scenario, but they do show the fact that the 'base' tree can
> be shared to some extent.  So I'd like to have the cache in the patch.
>
>
>>
>>>>
>>>>> +/* { dg-do compile } */
>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>>>>> +
>>>>> +typedef int arr_2[50][50];
>>>>> +
>>>>> +void foo (arr_2 a2, int v1)
>>>>> +{
>>>>> +  int i, j;
>>>>> +
>>>>> +  i = v1 + 5;
>>>>> +  j = i;
>>>>> +  a2 [i-10] [j] = 2;
>>>>> +  a2 [i] [j++] = i;
>>>>> +  a2 [i+20] [j++] = i;
>>>>> +  a2 [i-3] [i-1] += 1;
>>>>> +  return;
>>>>> +}
>>>>> +
>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>>>>>
>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>>>>> you expect?  I see other slsr testcases do similar non-sensical
>>>>> checking which is bad, too.
>>>>
>>>>
>>>> As the slsr optimizes CAND_REF candidates by simply lowering them to
>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of MEM_REFs
>>>> is an effective check.  Alternatively, I can add a follow-up patch to
>>>> add some dumping facility in replace_ref () to print out the replacing
>>>> actions when -fdump-tree-slsr-details is on.
>>
>> I think adding some details to the dump and scanning for them would be
>> better.  That's the only change that is required for this to move forward.
>
>
> I've updated to patch to dump more details when -fdump-tree-slsr-details is
> on.  The tests have also been updated to scan for these new dumps instead of
> MEMs.
>
>
>>
>> I suggest doing it quickly.  We're well past stage1 close at this point.
>
>
> The bootstrapping on x86_64 is still running.  OK to commit if it succeeds?

I still don't like it.  It's using the wrong and too expensive tools to do
stuff.  What kind of bases are we ultimately interested in?  Browsing
the code it looks like we're having

  /* Base expression for the chain of candidates:  often, but not
     always, an SSA name.  */
  tree base_expr;

which isn't really too informative but I suppose they are all
kind-of-gimple_val()s?  That said, I wonder if you can simply
use get_addr_base_and_unit_offset in place of get_alternative_base (),
ignoring the returned offset.

Richard.

> Thanks,
>
> Yufeng
>
> gcc/
>
>         * gimple-ssa-strength-reduction.c: Include tree-affine.h.
>         (name_expansions): New static variable.
>         (alt_base_map): Ditto.
>         (get_alternative_base): New function.
>         (find_basis_for_candidate): For CAND_REF, optionally call
>         find_basis_for_base_expr with the returned value from
>         get_alternative_base.
>         (record_potential_basis): Add new parameter 'base' of type 'tree';
>         add an assertion of non-NULL base; use base to set node->base_expr.
>         (alloc_cand_and_find_basis): Update; call record_potential_basis
>         for CAND_REF with the returned value from get_alternative_base.
>         (replace_refs): Dump details on the replacing.
>
>         (execute_strength_reduction): Call pointer_map_create for
>         alt_base_map; call free_affine_expand_cache with &name_expansions.
>
> gcc/testsuite/
>
>         * gcc.dg/tree-ssa/slsr-39.c: Update.
>         * gcc.dg/tree-ssa/slsr-41.c: New test.
Yufeng Zhang Dec. 3, 2013, 3:52 p.m. UTC | #2
On 12/03/13 14:20, Richard Biener wrote:
> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>  wrote:
>> On 12/03/13 06:48, Jeff Law wrote:
>>>
>>> On 12/02/13 08:47, Yufeng Zhang wrote:
>>>>
>>>> Ping~
>>>>
>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>>>
>>>
>>>>
>>>> Thanks,
>>>> Yufeng
>>>>
>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>>>>>
>>>>> On 11/26/13 12:45, Richard Biener wrote:
>>>>>>
>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>>>>>> Zhang<Yufeng.Zhang@arm.com>     wrote:
>>>>>>>
>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>>>>>>>>
>>>>>>>> The second version of your original patch is ok with me with the
>>>>>>>> following changes.  Sorry for the little side adventure into the
>>>>>>>> next-interp logic; in the end that's going to hurt more than it
>>>>>>>> helps in
>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks also for
>>>>>>>> cleaning up this version to be less intrusive to common interfaces; I
>>>>>>>> appreciate it.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks a lot for the review.  I've attached an updated patch with the
>>>>>>> suggested changes incorporated.
>>>>>>>
>>>>>>> For the next-interp adventure, I was quite happy to do the
>>>>>>> experiment; it's
>>>>>>> a good chance of gaining insight into the pass.  Many thanks for
>>>>>>> your prompt
>>>>>>> replies and patience in guiding!
>>>>>>>
>>>>>>>
>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>>>>>>>> approval,
>>>>>>>> as I'm not a maintainer.
>>>
>>> First a note, I need to check on voting for Bill as the slsr maintainer
>>> from the steering committee.   Voting was in progress just before the
>>> close of stage1 development so I haven't tallied the results :-)
>>
>>
>> Looking forward to some good news! :)
>>
>>
>>>>>
>>>>> Yes, you are right about the non-trivial 'base' tree are rarely shared.
>>>>>      The cached is introduced mainly because get_alternative_base () may
>>>>> be
>>>>> called twice on the same 'base' tree, once in the
>>>>> find_basis_for_candidate () for look-up and the other time in
>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm happy
>>>>> to leave out the cache if you think the benefit is trivial.
>>>
>>> Without some sense of how expensive the lookups are vs how often the
>>> cache hits it's awful hard to know if the cache is worth it.
>>>
>>> I'd say take it out unless you have some sense it's really saving time.
>>>     It's a pretty minor implementation detail either way.
>>
>>
>> I think the affine tree routines are generally expensive; it is worth having
>> a cache to avoid calling them too many times.  I run the slsr-*.c tests
>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range from
>> 55.6% to 90%, with 73.5% as the average.  The samples may not well represent
>> the real world scenario, but they do show the fact that the 'base' tree can
>> be shared to some extent.  So I'd like to have the cache in the patch.
>>
>>
>>>
>>>>>
>>>>>> +/* { dg-do compile } */
>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>>>>>> +
>>>>>> +typedef int arr_2[50][50];
>>>>>> +
>>>>>> +void foo (arr_2 a2, int v1)
>>>>>> +{
>>>>>> +  int i, j;
>>>>>> +
>>>>>> +  i = v1 + 5;
>>>>>> +  j = i;
>>>>>> +  a2 [i-10] [j] = 2;
>>>>>> +  a2 [i] [j++] = i;
>>>>>> +  a2 [i+20] [j++] = i;
>>>>>> +  a2 [i-3] [i-1] += 1;
>>>>>> +  return;
>>>>>> +}
>>>>>> +
>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>>>>>>
>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>>>>>> you expect?  I see other slsr testcases do similar non-sensical
>>>>>> checking which is bad, too.
>>>>>
>>>>>
>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them to
>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of MEM_REFs
>>>>> is an effective check.  Alternatively, I can add a follow-up patch to
>>>>> add some dumping facility in replace_ref () to print out the replacing
>>>>> actions when -fdump-tree-slsr-details is on.
>>>
>>> I think adding some details to the dump and scanning for them would be
>>> better.  That's the only change that is required for this to move forward.
>>
>>
>> I've updated to patch to dump more details when -fdump-tree-slsr-details is
>> on.  The tests have also been updated to scan for these new dumps instead of
>> MEMs.
>>
>>
>>>
>>> I suggest doing it quickly.  We're well past stage1 close at this point.
>>
>>
>> The bootstrapping on x86_64 is still running.  OK to commit if it succeeds?
>
> I still don't like it.  It's using the wrong and too expensive tools to do
> stuff.  What kind of bases are we ultimately interested in?  Browsing
> the code it looks like we're having
>
>    /* Base expression for the chain of candidates:  often, but not
>       always, an SSA name.  */
>    tree base_expr;
>
> which isn't really too informative but I suppose they are all
> kind-of-gimple_val()s?  That said, I wonder if you can simply
> use get_addr_base_and_unit_offset in place of get_alternative_base (),
> ignoring the returned offset.

'base_expr' is essentially the base address of a handled_component_p, 
e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of 
the object returned by get_inner_reference ().

Given a test case like the following:

typedef int arr_2[20][20];

void foo (arr_2 a2, int i, int j)
{
   a2[i+10][j] = 1;
   a2[i+10][j+1] = 1;
   a2[i+20][j] = 1;
}

The IR before SLSR is (on x86_64):

   _2 = (long unsigned int) i_1(D);
   _3 = _2 * 80;
   _4 = _3 + 800;
   _6 = a2_5(D) + _4;
   *_6[j_8(D)] = 1;
   _10 = j_8(D) + 1;
   *_6[_10] = 1;
   _12 = _3 + 1600;
   _13 = a2_5(D) + _12;
   *_13[j_8(D)] = 1;

The base_expr for the 1st and 2nd memory reference are the same, i.e. 
_6, while the base_expr for a2[i+20][j] is _13.

_13 is essentially (_6 + 800), so all of the three memory references 
essentially share the same base address.  As their strides are also the 
same (MULT_EXPR (j, 4)), the three references can all be lowered to 
MEM_REFs.  What this patch does is to use the tree affine tools to help 
recognize the underlying base address expression; as it requires looking 
into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset () 
won't help here.

Bill has helped me exploit other ways of achieving this in SLSR, but so 
far we think this is the best way to proceed.  The use of tree affine 
routines has been restricted to CAND_REFs only and there is the 
aforementioned cache facility to help reduce the overhead.

Thanks,
Yufeng

P.S. some more details what the patch does:

The CAND_REF for the three memory references are:

  6  [2] *_6[j_8(D)] = 1;
      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
      basis: 0  dependent: 8  sibling: 0
      next-interp: 0  dead-savings: 0

   8  [2] *_6[_10] = 1;
      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
      basis: 6  dependent: 11  sibling: 0
      next-interp: 0  dead-savings: 0

  11  [2] *_13[j_8(D)] = 1;
      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
      basis: 8  dependent: 0  sibling: 0
      next-interp: 0  dead-savings: 0

Before the patch, the strength reduction candidate chains for the three 
CAND_REFs are:

   _6 -> 6 -> 8
   _13 -> 11

i.e. SLSR recognizes the first two references share the same basis, 
while the last one is on it own.

With the patch, an extra candidate chain can be recognized:

   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8

i.e. all of the three references are found to have the same basis 
(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded _6 
or _13, with the immediate offset removed.  The pass is now able to 
lower all of the three references, instead of the first two only, to 
MEM_REFs.
Jeff Law Dec. 3, 2013, 7:20 p.m. UTC | #3
On 12/03/13 08:52, Yufeng Zhang wrote:
>>
>> I still don't like it.  It's using the wrong and too expensive tools
>> to do
>> stuff.  What kind of bases are we ultimately interested in?  Browsing
>> the code it looks like we're having
>>
>>    /* Base expression for the chain of candidates:  often, but not
>>       always, an SSA name.  */
>>    tree base_expr;
>>
>> which isn't really too informative but I suppose they are all
>> kind-of-gimple_val()s?  That said, I wonder if you can simply
>> use get_addr_base_and_unit_offset in place of get_alternative_base (),
>> ignoring the returned offset.
>
> 'base_expr' is essentially the base address of a handled_component_p,
> e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
> the object returned by get_inner_reference ().
>
> Given a test case like the following:
>
> typedef int arr_2[20][20];
>
> void foo (arr_2 a2, int i, int j)
> {
>    a2[i+10][j] = 1;
>    a2[i+10][j+1] = 1;
>    a2[i+20][j] = 1;
> }
>
> The IR before SLSR is (on x86_64):
>
>    _2 = (long unsigned int) i_1(D);
>    _3 = _2 * 80;
>    _4 = _3 + 800;
>    _6 = a2_5(D) + _4;
>    *_6[j_8(D)] = 1;
>    _10 = j_8(D) + 1;
>    *_6[_10] = 1;
>    _12 = _3 + 1600;
>    _13 = a2_5(D) + _12;
>    *_13[j_8(D)] = 1;
>
> The base_expr for the 1st and 2nd memory reference are the same, i.e.
> _6, while the base_expr for a2[i+20][j] is _13.
>
> _13 is essentially (_6 + 800), so all of the three memory references
> essentially share the same base address.  As their strides are also the
> same (MULT_EXPR (j, 4)), the three references can all be lowered to
> MEM_REFs.  What this patch does is to use the tree affine tools to help
> recognize the underlying base address expression; as it requires looking
> into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
> won't help here.
>
> Bill has helped me exploit other ways of achieving this in SLSR, but so
> far we think this is the best way to proceed.  The use of tree affine
> routines has been restricted to CAND_REFs only and there is the
> aforementioned cache facility to help reduce the overhead.
Right and I think Bill's opinions should carry the weight here since he 
wrote the SLSR code and will likely be its maintainer.

So let's keep the cache per your data.  OK for the trunk.  Thanks for 
your patience and understanding.

jeff
Richard Biener Dec. 3, 2013, 8:35 p.m. UTC | #4
Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
>On 12/03/13 14:20, Richard Biener wrote:
>> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com> 
>wrote:
>>> On 12/03/13 06:48, Jeff Law wrote:
>>>>
>>>> On 12/02/13 08:47, Yufeng Zhang wrote:
>>>>>
>>>>> Ping~
>>>>>
>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Yufeng
>>>>>
>>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>>>>>>
>>>>>> On 11/26/13 12:45, Richard Biener wrote:
>>>>>>>
>>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>>>>>>> Zhang<Yufeng.Zhang@arm.com>     wrote:
>>>>>>>>
>>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>>>>>>>>>
>>>>>>>>> The second version of your original patch is ok with me with
>the
>>>>>>>>> following changes.  Sorry for the little side adventure into
>the
>>>>>>>>> next-interp logic; in the end that's going to hurt more than
>it
>>>>>>>>> helps in
>>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
>also for
>>>>>>>>> cleaning up this version to be less intrusive to common
>interfaces; I
>>>>>>>>> appreciate it.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks a lot for the review.  I've attached an updated patch
>with the
>>>>>>>> suggested changes incorporated.
>>>>>>>>
>>>>>>>> For the next-interp adventure, I was quite happy to do the
>>>>>>>> experiment; it's
>>>>>>>> a good chance of gaining insight into the pass.  Many thanks
>for
>>>>>>>> your prompt
>>>>>>>> replies and patience in guiding!
>>>>>>>>
>>>>>>>>
>>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>>>>>>>>> approval,
>>>>>>>>> as I'm not a maintainer.
>>>>
>>>> First a note, I need to check on voting for Bill as the slsr
>maintainer
>>>> from the steering committee.   Voting was in progress just before
>the
>>>> close of stage1 development so I haven't tallied the results :-)
>>>
>>>
>>> Looking forward to some good news! :)
>>>
>>>
>>>>>>
>>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
>shared.
>>>>>>      The cached is introduced mainly because get_alternative_base
>() may
>>>>>> be
>>>>>> called twice on the same 'base' tree, once in the
>>>>>> find_basis_for_candidate () for look-up and the other time in
>>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
>happy
>>>>>> to leave out the cache if you think the benefit is trivial.
>>>>
>>>> Without some sense of how expensive the lookups are vs how often
>the
>>>> cache hits it's awful hard to know if the cache is worth it.
>>>>
>>>> I'd say take it out unless you have some sense it's really saving
>time.
>>>>     It's a pretty minor implementation detail either way.
>>>
>>>
>>> I think the affine tree routines are generally expensive; it is
>worth having
>>> a cache to avoid calling them too many times.  I run the slsr-*.c
>tests
>>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
>from
>>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
>represent
>>> the real world scenario, but they do show the fact that the 'base'
>tree can
>>> be shared to some extent.  So I'd like to have the cache in the
>patch.
>>>
>>>
>>>>
>>>>>>
>>>>>>> +/* { dg-do compile } */
>>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>>>>>>> +
>>>>>>> +typedef int arr_2[50][50];
>>>>>>> +
>>>>>>> +void foo (arr_2 a2, int v1)
>>>>>>> +{
>>>>>>> +  int i, j;
>>>>>>> +
>>>>>>> +  i = v1 + 5;
>>>>>>> +  j = i;
>>>>>>> +  a2 [i-10] [j] = 2;
>>>>>>> +  a2 [i] [j++] = i;
>>>>>>> +  a2 [i+20] [j++] = i;
>>>>>>> +  a2 [i-3] [i-1] += 1;
>>>>>>> +  return;
>>>>>>> +}
>>>>>>> +
>>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>>>>>>>
>>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>>>>>>> you expect?  I see other slsr testcases do similar non-sensical
>>>>>>> checking which is bad, too.
>>>>>>
>>>>>>
>>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
>to
>>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
>MEM_REFs
>>>>>> is an effective check.  Alternatively, I can add a follow-up
>patch to
>>>>>> add some dumping facility in replace_ref () to print out the
>replacing
>>>>>> actions when -fdump-tree-slsr-details is on.
>>>>
>>>> I think adding some details to the dump and scanning for them would
>be
>>>> better.  That's the only change that is required for this to move
>forward.
>>>
>>>
>>> I've updated to patch to dump more details when
>-fdump-tree-slsr-details is
>>> on.  The tests have also been updated to scan for these new dumps
>instead of
>>> MEMs.
>>>
>>>
>>>>
>>>> I suggest doing it quickly.  We're well past stage1 close at this
>point.
>>>
>>>
>>> The bootstrapping on x86_64 is still running.  OK to commit if it
>succeeds?
>>
>> I still don't like it.  It's using the wrong and too expensive tools
>to do
>> stuff.  What kind of bases are we ultimately interested in?  Browsing
>> the code it looks like we're having
>>
>>    /* Base expression for the chain of candidates:  often, but not
>>       always, an SSA name.  */
>>    tree base_expr;
>>
>> which isn't really too informative but I suppose they are all
>> kind-of-gimple_val()s?  That said, I wonder if you can simply
>> use get_addr_base_and_unit_offset in place of get_alternative_base
>(),
>> ignoring the returned offset.
>
>'base_expr' is essentially the base address of a handled_component_p, 
>e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
>
>the object returned by get_inner_reference ().
>
>Given a test case like the following:
>
>typedef int arr_2[20][20];
>
>void foo (arr_2 a2, int i, int j)
>{
>   a2[i+10][j] = 1;
>   a2[i+10][j+1] = 1;
>   a2[i+20][j] = 1;
>}
>
>The IR before SLSR is (on x86_64):
>
>   _2 = (long unsigned int) i_1(D);
>   _3 = _2 * 80;
>   _4 = _3 + 800;
>   _6 = a2_5(D) + _4;
>   *_6[j_8(D)] = 1;
>   _10 = j_8(D) + 1;
>   *_6[_10] = 1;
>   _12 = _3 + 1600;
>   _13 = a2_5(D) + _12;
>   *_13[j_8(D)] = 1;
>
>The base_expr for the 1st and 2nd memory reference are the same, i.e. 
>_6, while the base_expr for a2[i+20][j] is _13.
>
>_13 is essentially (_6 + 800), so all of the three memory references 
>essentially share the same base address.  As their strides are also the
>
>same (MULT_EXPR (j, 4)), the three references can all be lowered to 
>MEM_REFs.  What this patch does is to use the tree affine tools to help
>
>recognize the underlying base address expression; as it requires
>looking 
>into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset () 
>won't help here.
>
>Bill has helped me exploit other ways of achieving this in SLSR, but so
>
>far we think this is the best way to proceed.  The use of tree affine 
>routines has been restricted to CAND_REFs only and there is the 
>aforementioned cache facility to help reduce the overhead.
>
>Thanks,
>Yufeng
>
>P.S. some more details what the patch does:
>
>The CAND_REF for the three memory references are:
>
>  6  [2] *_6[j_8(D)] = 1;
>      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>      basis: 0  dependent: 8  sibling: 0
>      next-interp: 0  dead-savings: 0
>
>   8  [2] *_6[_10] = 1;
>      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
>      basis: 6  dependent: 11  sibling: 0
>      next-interp: 0  dead-savings: 0
>
>  11  [2] *_13[j_8(D)] = 1;
>      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>      basis: 8  dependent: 0  sibling: 0
>      next-interp: 0  dead-savings: 0
>
>Before the patch, the strength reduction candidate chains for the three
>
>CAND_REFs are:
>
>   _6 -> 6 -> 8
>   _13 -> 11
>
>i.e. SLSR recognizes the first two references share the same basis, 
>while the last one is on it own.
>
>With the patch, an extra candidate chain can be recognized:
>
>   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8
>
>i.e. all of the three references are found to have the same basis 
>(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
>_6 
>or _13, with the immediate offset removed.  The pass is now able to 
>lower all of the three references, instead of the first two only, to 
>MEM_REFs.

Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)

Richard.
Yufeng Zhang Dec. 3, 2013, 9:57 p.m. UTC | #5
On 12/03/13 20:35, Richard Biener wrote:
> Yufeng Zhang<Yufeng.Zhang@arm.com>  wrote:
>> On 12/03/13 14:20, Richard Biener wrote:
>>> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
>> wrote:
>>>> On 12/03/13 06:48, Jeff Law wrote:
>>>>>
>>>>> On 12/02/13 08:47, Yufeng Zhang wrote:
>>>>>>
>>>>>> Ping~
>>>>>>
>>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Yufeng
>>>>>>
>>>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>>>>>>>
>>>>>>> On 11/26/13 12:45, Richard Biener wrote:
>>>>>>>>
>>>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>>>>>>>> Zhang<Yufeng.Zhang@arm.com>      wrote:
>>>>>>>>>
>>>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>>>>>>>>>>
>>>>>>>>>> The second version of your original patch is ok with me with
>> the
>>>>>>>>>> following changes.  Sorry for the little side adventure into
>> the
>>>>>>>>>> next-interp logic; in the end that's going to hurt more than
>> it
>>>>>>>>>> helps in
>>>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
>> also for
>>>>>>>>>> cleaning up this version to be less intrusive to common
>> interfaces; I
>>>>>>>>>> appreciate it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks a lot for the review.  I've attached an updated patch
>> with the
>>>>>>>>> suggested changes incorporated.
>>>>>>>>>
>>>>>>>>> For the next-interp adventure, I was quite happy to do the
>>>>>>>>> experiment; it's
>>>>>>>>> a good chance of gaining insight into the pass.  Many thanks
>> for
>>>>>>>>> your prompt
>>>>>>>>> replies and patience in guiding!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>>>>>>>>>> approval,
>>>>>>>>>> as I'm not a maintainer.
>>>>>
>>>>> First a note, I need to check on voting for Bill as the slsr
>> maintainer
>>>>> from the steering committee.   Voting was in progress just before
>> the
>>>>> close of stage1 development so I haven't tallied the results :-)
>>>>
>>>>
>>>> Looking forward to some good news! :)
>>>>
>>>>
>>>>>>>
>>>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
>> shared.
>>>>>>>       The cached is introduced mainly because get_alternative_base
>> () may
>>>>>>> be
>>>>>>> called twice on the same 'base' tree, once in the
>>>>>>> find_basis_for_candidate () for look-up and the other time in
>>>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
>> happy
>>>>>>> to leave out the cache if you think the benefit is trivial.
>>>>>
>>>>> Without some sense of how expensive the lookups are vs how often
>> the
>>>>> cache hits it's awful hard to know if the cache is worth it.
>>>>>
>>>>> I'd say take it out unless you have some sense it's really saving
>> time.
>>>>>      It's a pretty minor implementation detail either way.
>>>>
>>>>
>>>> I think the affine tree routines are generally expensive; it is
>> worth having
>>>> a cache to avoid calling them too many times.  I run the slsr-*.c
>> tests
>>>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
>> from
>>>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
>> represent
>>>> the real world scenario, but they do show the fact that the 'base'
>> tree can
>>>> be shared to some extent.  So I'd like to have the cache in the
>> patch.
>>>>
>>>>
>>>>>
>>>>>>>
>>>>>>>> +/* { dg-do compile } */
>>>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>>>>>>>> +
>>>>>>>> +typedef int arr_2[50][50];
>>>>>>>> +
>>>>>>>> +void foo (arr_2 a2, int v1)
>>>>>>>> +{
>>>>>>>> +  int i, j;
>>>>>>>> +
>>>>>>>> +  i = v1 + 5;
>>>>>>>> +  j = i;
>>>>>>>> +  a2 [i-10] [j] = 2;
>>>>>>>> +  a2 [i] [j++] = i;
>>>>>>>> +  a2 [i+20] [j++] = i;
>>>>>>>> +  a2 [i-3] [i-1] += 1;
>>>>>>>> +  return;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>>>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>>>>>>>>
>>>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>>>>>>>> you expect?  I see other slsr testcases do similar non-sensical
>>>>>>>> checking which is bad, too.
>>>>>>>
>>>>>>>
>>>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
>> to
>>>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
>> MEM_REFs
>>>>>>> is an effective check.  Alternatively, I can add a follow-up
>> patch to
>>>>>>> add some dumping facility in replace_ref () to print out the
>> replacing
>>>>>>> actions when -fdump-tree-slsr-details is on.
>>>>>
>>>>> I think adding some details to the dump and scanning for them would
>> be
>>>>> better.  That's the only change that is required for this to move
>> forward.
>>>>
>>>>
>>>> I've updated to patch to dump more details when
>> -fdump-tree-slsr-details is
>>>> on.  The tests have also been updated to scan for these new dumps
>> instead of
>>>> MEMs.
>>>>
>>>>
>>>>>
>>>>> I suggest doing it quickly.  We're well past stage1 close at this
>> point.
>>>>
>>>>
>>>> The bootstrapping on x86_64 is still running.  OK to commit if it
>> succeeds?
>>>
>>> I still don't like it.  It's using the wrong and too expensive tools
>> to do
>>> stuff.  What kind of bases are we ultimately interested in?  Browsing
>>> the code it looks like we're having
>>>
>>>     /* Base expression for the chain of candidates:  often, but not
>>>        always, an SSA name.  */
>>>     tree base_expr;
>>>
>>> which isn't really too informative but I suppose they are all
>>> kind-of-gimple_val()s?  That said, I wonder if you can simply
>>> use get_addr_base_and_unit_offset in place of get_alternative_base
>> (),
>>> ignoring the returned offset.
>>
>> 'base_expr' is essentially the base address of a handled_component_p,
>> e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
>>
>> the object returned by get_inner_reference ().
>>
>> Given a test case like the following:
>>
>> typedef int arr_2[20][20];
>>
>> void foo (arr_2 a2, int i, int j)
>> {
>>    a2[i+10][j] = 1;
>>    a2[i+10][j+1] = 1;
>>    a2[i+20][j] = 1;
>> }
>>
>> The IR before SLSR is (on x86_64):
>>
>>    _2 = (long unsigned int) i_1(D);
>>    _3 = _2 * 80;
>>    _4 = _3 + 800;
>>    _6 = a2_5(D) + _4;
>>    *_6[j_8(D)] = 1;
>>    _10 = j_8(D) + 1;
>>    *_6[_10] = 1;
>>    _12 = _3 + 1600;
>>    _13 = a2_5(D) + _12;
>>    *_13[j_8(D)] = 1;
>>
>> The base_expr for the 1st and 2nd memory reference are the same, i.e.
>> _6, while the base_expr for a2[i+20][j] is _13.
>>
>> _13 is essentially (_6 + 800), so all of the three memory references
>> essentially share the same base address.  As their strides are also the
>>
>> same (MULT_EXPR (j, 4)), the three references can all be lowered to
>> MEM_REFs.  What this patch does is to use the tree affine tools to help
>>
>> recognize the underlying base address expression; as it requires
>> looking
>> into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
>> won't help here.
>>
>> Bill has helped me exploit other ways of achieving this in SLSR, but so
>>
>> far we think this is the best way to proceed.  The use of tree affine
>> routines has been restricted to CAND_REFs only and there is the
>> aforementioned cache facility to help reduce the overhead.
>>
>> Thanks,
>> Yufeng
>>
>> P.S. some more details what the patch does:
>>
>> The CAND_REF for the three memory references are:
>>
>>   6  [2] *_6[j_8(D)] = 1;
>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>>       basis: 0  dependent: 8  sibling: 0
>>       next-interp: 0  dead-savings: 0
>>
>>    8  [2] *_6[_10] = 1;
>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
>>       basis: 6  dependent: 11  sibling: 0
>>       next-interp: 0  dead-savings: 0
>>
>>   11  [2] *_13[j_8(D)] = 1;
>>       REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>>       basis: 8  dependent: 0  sibling: 0
>>       next-interp: 0  dead-savings: 0
>>
>> Before the patch, the strength reduction candidate chains for the three
>>
>> CAND_REFs are:
>>
>>    _6 ->  6 ->  8
>>    _13 ->  11
>>
>> i.e. SLSR recognizes the first two references share the same basis,
>> while the last one is on it own.
>>
>> With the patch, an extra candidate chain can be recognized:
>>
>>    a2_5(D) + (sizetype) i_1(D) * 80 ->  6 ->  11 ->  8
>>
>> i.e. all of the three references are found to have the same basis
>> (a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
>> _6
>> or _13, with the immediate offset removed.  The pass is now able to
>> lower all of the three references, instead of the first two only, to
>> MEM_REFs.
>
> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference?

slsr is indeed already using get_inner_reference () to figure out the 
common components; see restructure_reference () and the comment at the 
beginning of gimple-ssa-strength-reduction.c.  Quote some of the comment 
here for the convenience:

    Specifically, we are interested in references for which
    get_inner_reference returns a base address, offset, and bitpos as
    follows:

      base:    MEM_REF (T1, C1)
      offset:  MULT_EXPR (PLUS_EXPR (T2, C2), C3)
      bitpos:  C4 * BITS_PER_UNIT

    Here T1 and T2 are arbitrary trees, and C1, C2, C3, C4 are
    arbitrary integer constants.  Note that C2 may be zero, in which
    case the offset will be MULT_EXPR (T2, C3).

    When this pattern is recognized, the original memory reference
    can be replaced with:

      MEM_REF (POINTER_PLUS_EXPR (T1, MULT_EXPR (T2, C3)),
               C1 + (C2 * C3) + C4)

    which distributes the multiply to allow constant folding.  When
    two or more addressing expressions can be represented by MEM_REFs
    of this form, differing only in the constants C1, C2, and C4,
    making this substitution produces more efficient addressing during
    the RTL phases.  When there are not at least two expressions with
    the same values of T1, T2, and C3, there is nothing to be gained
    by the replacement.

    Strength reduction of CAND_REFs uses the same infrastructure as
    that used by CAND_MULTs and CAND_ADDs.  We record T1 in the base (B)
    field, MULT_EXPR (T2, C3) in the stride (S) field, and
    C1 + (C2 * C3) + C4 in the index (i) field.  A basis for a CAND_REF
    is thus another CAND_REF with the same B and S values.  When at
    least two CAND_REFs are chained together using the basis relation,
    each of them is replaced as above, resulting in improved code
    generation for addressing.

As the last paragraphs says, a basis for a CAND_REF is another CAND_REF 
with the same B and S values.  This patch extends the definition of 
basis by allowing B's not to be the same but only differ by an immediate 
constant.

>  After all slsr does not use tree-affine as representation for bases (which it could?)

In theory it could use tree-affine for bases and I had experimented this 
approach as well, but encountered some unexpected re-association issue 
when building spec2k, as when tree-affine is combined to tree, the 
association order can be different from, or worse than, what was before 
tree-affine.  I also didn't see any obvious benefit, so didn't proceed 
further.

In the long run, the additional lowering of memory accesses you 
mentioned in http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03731.html may 
be a better solution to what I'm trying to tackle here.  I'll see if I 
can get time to work out something useful for 4.10. :)

Yufeng
Bill Schmidt Dec. 3, 2013, 10:04 p.m. UTC | #6
On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
> Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
> >On 12/03/13 14:20, Richard Biener wrote:
> >> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com> 
> >wrote:
> >>> On 12/03/13 06:48, Jeff Law wrote:
> >>>>
> >>>> On 12/02/13 08:47, Yufeng Zhang wrote:
> >>>>>
> >>>>> Ping~
> >>>>>
> >>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
> >>>>
> >>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Yufeng
> >>>>>
> >>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
> >>>>>>
> >>>>>> On 11/26/13 12:45, Richard Biener wrote:
> >>>>>>>
> >>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
> >>>>>>> Zhang<Yufeng.Zhang@arm.com>     wrote:
> >>>>>>>>
> >>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
> >>>>>>>>>
> >>>>>>>>> The second version of your original patch is ok with me with
> >the
> >>>>>>>>> following changes.  Sorry for the little side adventure into
> >the
> >>>>>>>>> next-interp logic; in the end that's going to hurt more than
> >it
> >>>>>>>>> helps in
> >>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
> >also for
> >>>>>>>>> cleaning up this version to be less intrusive to common
> >interfaces; I
> >>>>>>>>> appreciate it.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks a lot for the review.  I've attached an updated patch
> >with the
> >>>>>>>> suggested changes incorporated.
> >>>>>>>>
> >>>>>>>> For the next-interp adventure, I was quite happy to do the
> >>>>>>>> experiment; it's
> >>>>>>>> a good chance of gaining insight into the pass.  Many thanks
> >for
> >>>>>>>> your prompt
> >>>>>>>> replies and patience in guiding!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
> >>>>>>>>> approval,
> >>>>>>>>> as I'm not a maintainer.
> >>>>
> >>>> First a note, I need to check on voting for Bill as the slsr
> >maintainer
> >>>> from the steering committee.   Voting was in progress just before
> >the
> >>>> close of stage1 development so I haven't tallied the results :-)
> >>>
> >>>
> >>> Looking forward to some good news! :)
> >>>
> >>>
> >>>>>>
> >>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
> >shared.
> >>>>>>      The cached is introduced mainly because get_alternative_base
> >() may
> >>>>>> be
> >>>>>> called twice on the same 'base' tree, once in the
> >>>>>> find_basis_for_candidate () for look-up and the other time in
> >>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
> >happy
> >>>>>> to leave out the cache if you think the benefit is trivial.
> >>>>
> >>>> Without some sense of how expensive the lookups are vs how often
> >the
> >>>> cache hits it's awful hard to know if the cache is worth it.
> >>>>
> >>>> I'd say take it out unless you have some sense it's really saving
> >time.
> >>>>     It's a pretty minor implementation detail either way.
> >>>
> >>>
> >>> I think the affine tree routines are generally expensive; it is
> >worth having
> >>> a cache to avoid calling them too many times.  I run the slsr-*.c
> >tests
> >>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
> >from
> >>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
> >represent
> >>> the real world scenario, but they do show the fact that the 'base'
> >tree can
> >>> be shared to some extent.  So I'd like to have the cache in the
> >patch.
> >>>
> >>>
> >>>>
> >>>>>>
> >>>>>>> +/* { dg-do compile } */
> >>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
> >>>>>>> +
> >>>>>>> +typedef int arr_2[50][50];
> >>>>>>> +
> >>>>>>> +void foo (arr_2 a2, int v1)
> >>>>>>> +{
> >>>>>>> +  int i, j;
> >>>>>>> +
> >>>>>>> +  i = v1 + 5;
> >>>>>>> +  j = i;
> >>>>>>> +  a2 [i-10] [j] = 2;
> >>>>>>> +  a2 [i] [j++] = i;
> >>>>>>> +  a2 [i+20] [j++] = i;
> >>>>>>> +  a2 [i-3] [i-1] += 1;
> >>>>>>> +  return;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
> >>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
> >>>>>>>
> >>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
> >>>>>>> you expect?  I see other slsr testcases do similar non-sensical
> >>>>>>> checking which is bad, too.
> >>>>>>
> >>>>>>
> >>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
> >to
> >>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
> >MEM_REFs
> >>>>>> is an effective check.  Alternatively, I can add a follow-up
> >patch to
> >>>>>> add some dumping facility in replace_ref () to print out the
> >replacing
> >>>>>> actions when -fdump-tree-slsr-details is on.
> >>>>
> >>>> I think adding some details to the dump and scanning for them would
> >be
> >>>> better.  That's the only change that is required for this to move
> >forward.
> >>>
> >>>
> >>> I've updated to patch to dump more details when
> >-fdump-tree-slsr-details is
> >>> on.  The tests have also been updated to scan for these new dumps
> >instead of
> >>> MEMs.
> >>>
> >>>
> >>>>
> >>>> I suggest doing it quickly.  We're well past stage1 close at this
> >point.
> >>>
> >>>
> >>> The bootstrapping on x86_64 is still running.  OK to commit if it
> >succeeds?
> >>
> >> I still don't like it.  It's using the wrong and too expensive tools
> >to do
> >> stuff.  What kind of bases are we ultimately interested in?  Browsing
> >> the code it looks like we're having
> >>
> >>    /* Base expression for the chain of candidates:  often, but not
> >>       always, an SSA name.  */
> >>    tree base_expr;
> >>
> >> which isn't really too informative but I suppose they are all
> >> kind-of-gimple_val()s?  That said, I wonder if you can simply
> >> use get_addr_base_and_unit_offset in place of get_alternative_base
> >(),
> >> ignoring the returned offset.
> >
> >'base_expr' is essentially the base address of a handled_component_p, 
> >e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
> >
> >the object returned by get_inner_reference ().
> >
> >Given a test case like the following:
> >
> >typedef int arr_2[20][20];
> >
> >void foo (arr_2 a2, int i, int j)
> >{
> >   a2[i+10][j] = 1;
> >   a2[i+10][j+1] = 1;
> >   a2[i+20][j] = 1;
> >}
> >
> >The IR before SLSR is (on x86_64):
> >
> >   _2 = (long unsigned int) i_1(D);
> >   _3 = _2 * 80;
> >   _4 = _3 + 800;
> >   _6 = a2_5(D) + _4;
> >   *_6[j_8(D)] = 1;
> >   _10 = j_8(D) + 1;
> >   *_6[_10] = 1;
> >   _12 = _3 + 1600;
> >   _13 = a2_5(D) + _12;
> >   *_13[j_8(D)] = 1;
> >
> >The base_expr for the 1st and 2nd memory reference are the same, i.e. 
> >_6, while the base_expr for a2[i+20][j] is _13.
> >
> >_13 is essentially (_6 + 800), so all of the three memory references 
> >essentially share the same base address.  As their strides are also the
> >
> >same (MULT_EXPR (j, 4)), the three references can all be lowered to 
> >MEM_REFs.  What this patch does is to use the tree affine tools to help
> >
> >recognize the underlying base address expression; as it requires
> >looking 
> >into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset () 
> >won't help here.
> >
> >Bill has helped me exploit other ways of achieving this in SLSR, but so
> >
> >far we think this is the best way to proceed.  The use of tree affine 
> >routines has been restricted to CAND_REFs only and there is the 
> >aforementioned cache facility to help reduce the overhead.
> >
> >Thanks,
> >Yufeng
> >
> >P.S. some more details what the patch does:
> >
> >The CAND_REF for the three memory references are:
> >
> >  6  [2] *_6[j_8(D)] = 1;
> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >      basis: 0  dependent: 8  sibling: 0
> >      next-interp: 0  dead-savings: 0
> >
> >   8  [2] *_6[_10] = 1;
> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
> >      basis: 6  dependent: 11  sibling: 0
> >      next-interp: 0  dead-savings: 0
> >
> >  11  [2] *_13[j_8(D)] = 1;
> >      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >      basis: 8  dependent: 0  sibling: 0
> >      next-interp: 0  dead-savings: 0
> >
> >Before the patch, the strength reduction candidate chains for the three
> >
> >CAND_REFs are:
> >
> >   _6 -> 6 -> 8
> >   _13 -> 11
> >
> >i.e. SLSR recognizes the first two references share the same basis, 
> >while the last one is on it own.
> >
> >With the patch, an extra candidate chain can be recognized:
> >
> >   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8
> >
> >i.e. all of the three references are found to have the same basis 
> >(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
> >_6 
> >or _13, with the immediate offset removed.  The pass is now able to 
> >lower all of the three references, instead of the first two only, to 
> >MEM_REFs.
> 
> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)

I think that's overstating SLSR's current capabilities a bit. :)  We do
use get_inner_reference to come up with the base expression for
reference candidates (based on some of your suggestions a couple of
years back).  However, in the case of multiple levels of array
references, we miss opportunities because get_inner_reference stops at
an SSA name that could be further expanded by following its definition
back to a more fundamental base expression.

Part of the issue here is that reference candidates are basis for a more
specific optimization than the mult and add candidates.  The latter have
a more general framework for building up a recording of simple affine
expressions that can be strength-reduced.  Ultimately we ought to be
able to do something similar for reference candidates, building up
simple affine expressions from base expressions, so that everything is
done in a forward order and the tree-affine interfaces aren't needed.
But that will take some more fundamental design changes, and since this
provides some good improvements for important cases, I feel it's
reasonable to get this into the release.

Thanks,
Bill

> 
> Richard.
> 
>
Bill Schmidt Dec. 3, 2013, 10:19 p.m. UTC | #7
On Tue, 2013-12-03 at 21:57 +0000, Yufeng Zhang wrote:
> On 12/03/13 20:35, Richard Biener wrote:
> > Yufeng Zhang<Yufeng.Zhang@arm.com>  wrote:
> >> On 12/03/13 14:20, Richard Biener wrote:
> >>> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
> >> wrote:
> >>>> On 12/03/13 06:48, Jeff Law wrote:
> >>>>>
> >>>>> On 12/02/13 08:47, Yufeng Zhang wrote:
> >>>>>>
> >>>>>> Ping~
> >>>>>>
> >>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Yufeng
> >>>>>>
> >>>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
> >>>>>>>
> >>>>>>> On 11/26/13 12:45, Richard Biener wrote:
> >>>>>>>>
> >>>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
> >>>>>>>> Zhang<Yufeng.Zhang@arm.com>      wrote:
> >>>>>>>>>
> >>>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
> >>>>>>>>>>
> >>>>>>>>>> The second version of your original patch is ok with me with
> >> the
> >>>>>>>>>> following changes.  Sorry for the little side adventure into
> >> the
> >>>>>>>>>> next-interp logic; in the end that's going to hurt more than
> >> it
> >>>>>>>>>> helps in
> >>>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
> >> also for
> >>>>>>>>>> cleaning up this version to be less intrusive to common
> >> interfaces; I
> >>>>>>>>>> appreciate it.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks a lot for the review.  I've attached an updated patch
> >> with the
> >>>>>>>>> suggested changes incorporated.
> >>>>>>>>>
> >>>>>>>>> For the next-interp adventure, I was quite happy to do the
> >>>>>>>>> experiment; it's
> >>>>>>>>> a good chance of gaining insight into the pass.  Many thanks
> >> for
> >>>>>>>>> your prompt
> >>>>>>>>> replies and patience in guiding!
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
> >>>>>>>>>> approval,
> >>>>>>>>>> as I'm not a maintainer.
> >>>>>
> >>>>> First a note, I need to check on voting for Bill as the slsr
> >> maintainer
> >>>>> from the steering committee.   Voting was in progress just before
> >> the
> >>>>> close of stage1 development so I haven't tallied the results :-)
> >>>>
> >>>>
> >>>> Looking forward to some good news! :)
> >>>>
> >>>>
> >>>>>>>
> >>>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
> >> shared.
> >>>>>>>       The cached is introduced mainly because get_alternative_base
> >> () may
> >>>>>>> be
> >>>>>>> called twice on the same 'base' tree, once in the
> >>>>>>> find_basis_for_candidate () for look-up and the other time in
> >>>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
> >> happy
> >>>>>>> to leave out the cache if you think the benefit is trivial.
> >>>>>
> >>>>> Without some sense of how expensive the lookups are vs how often
> >> the
> >>>>> cache hits it's awful hard to know if the cache is worth it.
> >>>>>
> >>>>> I'd say take it out unless you have some sense it's really saving
> >> time.
> >>>>>      It's a pretty minor implementation detail either way.
> >>>>
> >>>>
> >>>> I think the affine tree routines are generally expensive; it is
> >> worth having
> >>>> a cache to avoid calling them too many times.  I run the slsr-*.c
> >> tests
> >>>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
> >> from
> >>>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
> >> represent
> >>>> the real world scenario, but they do show the fact that the 'base'
> >> tree can
> >>>> be shared to some extent.  So I'd like to have the cache in the
> >> patch.
> >>>>
> >>>>
> >>>>>
> >>>>>>>
> >>>>>>>> +/* { dg-do compile } */
> >>>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
> >>>>>>>> +
> >>>>>>>> +typedef int arr_2[50][50];
> >>>>>>>> +
> >>>>>>>> +void foo (arr_2 a2, int v1)
> >>>>>>>> +{
> >>>>>>>> +  int i, j;
> >>>>>>>> +
> >>>>>>>> +  i = v1 + 5;
> >>>>>>>> +  j = i;
> >>>>>>>> +  a2 [i-10] [j] = 2;
> >>>>>>>> +  a2 [i] [j++] = i;
> >>>>>>>> +  a2 [i+20] [j++] = i;
> >>>>>>>> +  a2 [i-3] [i-1] += 1;
> >>>>>>>> +  return;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
> >>>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
> >>>>>>>>
> >>>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
> >>>>>>>> you expect?  I see other slsr testcases do similar non-sensical
> >>>>>>>> checking which is bad, too.
> >>>>>>>
> >>>>>>>
> >>>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
> >> to
> >>>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
> >> MEM_REFs
> >>>>>>> is an effective check.  Alternatively, I can add a follow-up
> >> patch to
> >>>>>>> add some dumping facility in replace_ref () to print out the
> >> replacing
> >>>>>>> actions when -fdump-tree-slsr-details is on.
> >>>>>
> >>>>> I think adding some details to the dump and scanning for them would
> >> be
> >>>>> better.  That's the only change that is required for this to move
> >> forward.
> >>>>
> >>>>
> >>>> I've updated to patch to dump more details when
> >> -fdump-tree-slsr-details is
> >>>> on.  The tests have also been updated to scan for these new dumps
> >> instead of
> >>>> MEMs.
> >>>>
> >>>>
> >>>>>
> >>>>> I suggest doing it quickly.  We're well past stage1 close at this
> >> point.
> >>>>
> >>>>
> >>>> The bootstrapping on x86_64 is still running.  OK to commit if it
> >> succeeds?
> >>>
> >>> I still don't like it.  It's using the wrong and too expensive tools
> >> to do
> >>> stuff.  What kind of bases are we ultimately interested in?  Browsing
> >>> the code it looks like we're having
> >>>
> >>>     /* Base expression for the chain of candidates:  often, but not
> >>>        always, an SSA name.  */
> >>>     tree base_expr;
> >>>
> >>> which isn't really too informative but I suppose they are all
> >>> kind-of-gimple_val()s?  That said, I wonder if you can simply
> >>> use get_addr_base_and_unit_offset in place of get_alternative_base
> >> (),
> >>> ignoring the returned offset.
> >>
> >> 'base_expr' is essentially the base address of a handled_component_p,
> >> e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
> >>
> >> the object returned by get_inner_reference ().
> >>
> >> Given a test case like the following:
> >>
> >> typedef int arr_2[20][20];
> >>
> >> void foo (arr_2 a2, int i, int j)
> >> {
> >>    a2[i+10][j] = 1;
> >>    a2[i+10][j+1] = 1;
> >>    a2[i+20][j] = 1;
> >> }
> >>
> >> The IR before SLSR is (on x86_64):
> >>
> >>    _2 = (long unsigned int) i_1(D);
> >>    _3 = _2 * 80;
> >>    _4 = _3 + 800;
> >>    _6 = a2_5(D) + _4;
> >>    *_6[j_8(D)] = 1;
> >>    _10 = j_8(D) + 1;
> >>    *_6[_10] = 1;
> >>    _12 = _3 + 1600;
> >>    _13 = a2_5(D) + _12;
> >>    *_13[j_8(D)] = 1;
> >>
> >> The base_expr for the 1st and 2nd memory reference are the same, i.e.
> >> _6, while the base_expr for a2[i+20][j] is _13.
> >>
> >> _13 is essentially (_6 + 800), so all of the three memory references
> >> essentially share the same base address.  As their strides are also the
> >>
> >> same (MULT_EXPR (j, 4)), the three references can all be lowered to
> >> MEM_REFs.  What this patch does is to use the tree affine tools to help
> >>
> >> recognize the underlying base address expression; as it requires
> >> looking
> >> into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
> >> won't help here.
> >>
> >> Bill has helped me exploit other ways of achieving this in SLSR, but so
> >>
> >> far we think this is the best way to proceed.  The use of tree affine
> >> routines has been restricted to CAND_REFs only and there is the
> >> aforementioned cache facility to help reduce the overhead.
> >>
> >> Thanks,
> >> Yufeng
> >>
> >> P.S. some more details what the patch does:
> >>
> >> The CAND_REF for the three memory references are:
> >>
> >>   6  [2] *_6[j_8(D)] = 1;
> >>       REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >>       basis: 0  dependent: 8  sibling: 0
> >>       next-interp: 0  dead-savings: 0
> >>
> >>    8  [2] *_6[_10] = 1;
> >>       REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
> >>       basis: 6  dependent: 11  sibling: 0
> >>       next-interp: 0  dead-savings: 0
> >>
> >>   11  [2] *_13[j_8(D)] = 1;
> >>       REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >>       basis: 8  dependent: 0  sibling: 0
> >>       next-interp: 0  dead-savings: 0
> >>
> >> Before the patch, the strength reduction candidate chains for the three
> >>
> >> CAND_REFs are:
> >>
> >>    _6 ->  6 ->  8
> >>    _13 ->  11
> >>
> >> i.e. SLSR recognizes the first two references share the same basis,
> >> while the last one is on it own.
> >>
> >> With the patch, an extra candidate chain can be recognized:
> >>
> >>    a2_5(D) + (sizetype) i_1(D) * 80 ->  6 ->  11 ->  8
> >>
> >> i.e. all of the three references are found to have the same basis
> >> (a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
> >> _6
> >> or _13, with the immediate offset removed.  The pass is now able to
> >> lower all of the three references, instead of the first two only, to
> >> MEM_REFs.
> >
> > Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference?
> 
> slsr is indeed already using get_inner_reference () to figure out the 
> common components; see restructure_reference () and the comment at the 
> beginning of gimple-ssa-strength-reduction.c.  Quote some of the comment 
> here for the convenience:
> 
>     Specifically, we are interested in references for which
>     get_inner_reference returns a base address, offset, and bitpos as
>     follows:
> 
>       base:    MEM_REF (T1, C1)
>       offset:  MULT_EXPR (PLUS_EXPR (T2, C2), C3)
>       bitpos:  C4 * BITS_PER_UNIT
> 
>     Here T1 and T2 are arbitrary trees, and C1, C2, C3, C4 are
>     arbitrary integer constants.  Note that C2 may be zero, in which
>     case the offset will be MULT_EXPR (T2, C3).
> 
>     When this pattern is recognized, the original memory reference
>     can be replaced with:
> 
>       MEM_REF (POINTER_PLUS_EXPR (T1, MULT_EXPR (T2, C3)),
>                C1 + (C2 * C3) + C4)
> 
>     which distributes the multiply to allow constant folding.  When
>     two or more addressing expressions can be represented by MEM_REFs
>     of this form, differing only in the constants C1, C2, and C4,
>     making this substitution produces more efficient addressing during
>     the RTL phases.  When there are not at least two expressions with
>     the same values of T1, T2, and C3, there is nothing to be gained
>     by the replacement.
> 
>     Strength reduction of CAND_REFs uses the same infrastructure as
>     that used by CAND_MULTs and CAND_ADDs.  We record T1 in the base (B)
>     field, MULT_EXPR (T2, C3) in the stride (S) field, and
>     C1 + (C2 * C3) + C4 in the index (i) field.  A basis for a CAND_REF
>     is thus another CAND_REF with the same B and S values.  When at
>     least two CAND_REFs are chained together using the basis relation,
>     each of them is replaced as above, resulting in improved code
>     generation for addressing.
> 
> As the last paragraphs says, a basis for a CAND_REF is another CAND_REF 
> with the same B and S values.  This patch extends the definition of 
> basis by allowing B's not to be the same but only differ by an immediate 
> constant.
> 
> >  After all slsr does not use tree-affine as representation for bases (which it could?)
> 
> In theory it could use tree-affine for bases and I had experimented this 
> approach as well, but encountered some unexpected re-association issue 
> when building spec2k, as when tree-affine is combined to tree, the 
> association order can be different from, or worse than, what was before 
> tree-affine.  I also didn't see any obvious benefit, so didn't proceed 
> further.
> 
> In the long run, the additional lowering of memory accesses you 
> mentioned in http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03731.html may 
> be a better solution to what I'm trying to tackle here.  I'll see if I 
> can get time to work out something useful for 4.10. :)

This can be pretty dicey as well.  I originally tried to address the
reference candidate commoning by doing earlier lowering of memory
references.  The problem I ran into is a premature loss of aliasing
information, resulting in much worse code generation in a number of
places.  We couldn't think of a good way to carry the lost aliasing
information forward that wasn't a complete bloody hack.  So I backed off
from that.  Richard then made the astute observation that this was a
special case of straight line strength reduction, which GCC didn't
handle yet.  And that's how we ended up where we are...

I don't want to discourage you from looking at further lowering of
memory reference components, but be aware that you need to be thinking
carefully about the TBAA issue from the start.  Otherwise the RTL phases
can be much more constrained (particularly scheduling).

Bill

> 
> Yufeng
>
Richard Biener Dec. 4, 2013, 10:26 a.m. UTC | #8
On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
>> Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
>> >On 12/03/13 14:20, Richard Biener wrote:
>> >> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
>> >wrote:
>> >>> On 12/03/13 06:48, Jeff Law wrote:
>> >>>>
>> >>>> On 12/02/13 08:47, Yufeng Zhang wrote:
>> >>>>>
>> >>>>> Ping~
>> >>>>>
>> >>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>> >>>>
>> >>>>
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Yufeng
>> >>>>>
>> >>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>> >>>>>>
>> >>>>>> On 11/26/13 12:45, Richard Biener wrote:
>> >>>>>>>
>> >>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>> >>>>>>> Zhang<Yufeng.Zhang@arm.com>     wrote:
>> >>>>>>>>
>> >>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>> >>>>>>>>>
>> >>>>>>>>> The second version of your original patch is ok with me with
>> >the
>> >>>>>>>>> following changes.  Sorry for the little side adventure into
>> >the
>> >>>>>>>>> next-interp logic; in the end that's going to hurt more than
>> >it
>> >>>>>>>>> helps in
>> >>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
>> >also for
>> >>>>>>>>> cleaning up this version to be less intrusive to common
>> >interfaces; I
>> >>>>>>>>> appreciate it.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Thanks a lot for the review.  I've attached an updated patch
>> >with the
>> >>>>>>>> suggested changes incorporated.
>> >>>>>>>>
>> >>>>>>>> For the next-interp adventure, I was quite happy to do the
>> >>>>>>>> experiment; it's
>> >>>>>>>> a good chance of gaining insight into the pass.  Many thanks
>> >for
>> >>>>>>>> your prompt
>> >>>>>>>> replies and patience in guiding!
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>> >>>>>>>>> approval,
>> >>>>>>>>> as I'm not a maintainer.
>> >>>>
>> >>>> First a note, I need to check on voting for Bill as the slsr
>> >maintainer
>> >>>> from the steering committee.   Voting was in progress just before
>> >the
>> >>>> close of stage1 development so I haven't tallied the results :-)
>> >>>
>> >>>
>> >>> Looking forward to some good news! :)
>> >>>
>> >>>
>> >>>>>>
>> >>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
>> >shared.
>> >>>>>>      The cached is introduced mainly because get_alternative_base
>> >() may
>> >>>>>> be
>> >>>>>> called twice on the same 'base' tree, once in the
>> >>>>>> find_basis_for_candidate () for look-up and the other time in
>> >>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
>> >happy
>> >>>>>> to leave out the cache if you think the benefit is trivial.
>> >>>>
>> >>>> Without some sense of how expensive the lookups are vs how often
>> >the
>> >>>> cache hits it's awful hard to know if the cache is worth it.
>> >>>>
>> >>>> I'd say take it out unless you have some sense it's really saving
>> >time.
>> >>>>     It's a pretty minor implementation detail either way.
>> >>>
>> >>>
>> >>> I think the affine tree routines are generally expensive; it is
>> >worth having
>> >>> a cache to avoid calling them too many times.  I run the slsr-*.c
>> >tests
>> >>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
>> >from
>> >>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
>> >represent
>> >>> the real world scenario, but they do show the fact that the 'base'
>> >tree can
>> >>> be shared to some extent.  So I'd like to have the cache in the
>> >patch.
>> >>>
>> >>>
>> >>>>
>> >>>>>>
>> >>>>>>> +/* { dg-do compile } */
>> >>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>> >>>>>>> +
>> >>>>>>> +typedef int arr_2[50][50];
>> >>>>>>> +
>> >>>>>>> +void foo (arr_2 a2, int v1)
>> >>>>>>> +{
>> >>>>>>> +  int i, j;
>> >>>>>>> +
>> >>>>>>> +  i = v1 + 5;
>> >>>>>>> +  j = i;
>> >>>>>>> +  a2 [i-10] [j] = 2;
>> >>>>>>> +  a2 [i] [j++] = i;
>> >>>>>>> +  a2 [i+20] [j++] = i;
>> >>>>>>> +  a2 [i-3] [i-1] += 1;
>> >>>>>>> +  return;
>> >>>>>>> +}
>> >>>>>>> +
>> >>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>> >>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>> >>>>>>>
>> >>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>> >>>>>>> you expect?  I see other slsr testcases do similar non-sensical
>> >>>>>>> checking which is bad, too.
>> >>>>>>
>> >>>>>>
>> >>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
>> >to
>> >>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
>> >MEM_REFs
>> >>>>>> is an effective check.  Alternatively, I can add a follow-up
>> >patch to
>> >>>>>> add some dumping facility in replace_ref () to print out the
>> >replacing
>> >>>>>> actions when -fdump-tree-slsr-details is on.
>> >>>>
>> >>>> I think adding some details to the dump and scanning for them would
>> >be
>> >>>> better.  That's the only change that is required for this to move
>> >forward.
>> >>>
>> >>>
>> >>> I've updated to patch to dump more details when
>> >-fdump-tree-slsr-details is
>> >>> on.  The tests have also been updated to scan for these new dumps
>> >instead of
>> >>> MEMs.
>> >>>
>> >>>
>> >>>>
>> >>>> I suggest doing it quickly.  We're well past stage1 close at this
>> >point.
>> >>>
>> >>>
>> >>> The bootstrapping on x86_64 is still running.  OK to commit if it
>> >succeeds?
>> >>
>> >> I still don't like it.  It's using the wrong and too expensive tools
>> >to do
>> >> stuff.  What kind of bases are we ultimately interested in?  Browsing
>> >> the code it looks like we're having
>> >>
>> >>    /* Base expression for the chain of candidates:  often, but not
>> >>       always, an SSA name.  */
>> >>    tree base_expr;
>> >>
>> >> which isn't really too informative but I suppose they are all
>> >> kind-of-gimple_val()s?  That said, I wonder if you can simply
>> >> use get_addr_base_and_unit_offset in place of get_alternative_base
>> >(),
>> >> ignoring the returned offset.
>> >
>> >'base_expr' is essentially the base address of a handled_component_p,
>> >e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
>> >
>> >the object returned by get_inner_reference ().
>> >
>> >Given a test case like the following:
>> >
>> >typedef int arr_2[20][20];
>> >
>> >void foo (arr_2 a2, int i, int j)
>> >{
>> >   a2[i+10][j] = 1;
>> >   a2[i+10][j+1] = 1;
>> >   a2[i+20][j] = 1;
>> >}
>> >
>> >The IR before SLSR is (on x86_64):
>> >
>> >   _2 = (long unsigned int) i_1(D);
>> >   _3 = _2 * 80;
>> >   _4 = _3 + 800;
>> >   _6 = a2_5(D) + _4;
>> >   *_6[j_8(D)] = 1;
>> >   _10 = j_8(D) + 1;
>> >   *_6[_10] = 1;
>> >   _12 = _3 + 1600;
>> >   _13 = a2_5(D) + _12;
>> >   *_13[j_8(D)] = 1;
>> >
>> >The base_expr for the 1st and 2nd memory reference are the same, i.e.
>> >_6, while the base_expr for a2[i+20][j] is _13.
>> >
>> >_13 is essentially (_6 + 800), so all of the three memory references
>> >essentially share the same base address.  As their strides are also the
>> >
>> >same (MULT_EXPR (j, 4)), the three references can all be lowered to
>> >MEM_REFs.  What this patch does is to use the tree affine tools to help
>> >
>> >recognize the underlying base address expression; as it requires
>> >looking
>> >into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
>> >won't help here.
>> >
>> >Bill has helped me exploit other ways of achieving this in SLSR, but so
>> >
>> >far we think this is the best way to proceed.  The use of tree affine
>> >routines has been restricted to CAND_REFs only and there is the
>> >aforementioned cache facility to help reduce the overhead.
>> >
>> >Thanks,
>> >Yufeng
>> >
>> >P.S. some more details what the patch does:
>> >
>> >The CAND_REF for the three memory references are:
>> >
>> >  6  [2] *_6[j_8(D)] = 1;
>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>> >      basis: 0  dependent: 8  sibling: 0
>> >      next-interp: 0  dead-savings: 0
>> >
>> >   8  [2] *_6[_10] = 1;
>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
>> >      basis: 6  dependent: 11  sibling: 0
>> >      next-interp: 0  dead-savings: 0
>> >
>> >  11  [2] *_13[j_8(D)] = 1;
>> >      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>> >      basis: 8  dependent: 0  sibling: 0
>> >      next-interp: 0  dead-savings: 0
>> >
>> >Before the patch, the strength reduction candidate chains for the three
>> >
>> >CAND_REFs are:
>> >
>> >   _6 -> 6 -> 8
>> >   _13 -> 11
>> >
>> >i.e. SLSR recognizes the first two references share the same basis,
>> >while the last one is on it own.
>> >
>> >With the patch, an extra candidate chain can be recognized:
>> >
>> >   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8
>> >
>> >i.e. all of the three references are found to have the same basis
>> >(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
>> >_6
>> >or _13, with the immediate offset removed.  The pass is now able to
>> >lower all of the three references, instead of the first two only, to
>> >MEM_REFs.
>>
>> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
>
> I think that's overstating SLSR's current capabilities a bit. :)  We do
> use get_inner_reference to come up with the base expression for
> reference candidates (based on some of your suggestions a couple of
> years back).  However, in the case of multiple levels of array
> references, we miss opportunities because get_inner_reference stops at
> an SSA name that could be further expanded by following its definition
> back to a more fundamental base expression.

Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
same problem.

> Part of the issue here is that reference candidates are basis for a more
> specific optimization than the mult and add candidates.  The latter have
> a more general framework for building up a recording of simple affine
> expressions that can be strength-reduced.  Ultimately we ought to be
> able to do something similar for reference candidates, building up
> simple affine expressions from base expressions, so that everything is
> done in a forward order and the tree-affine interfaces aren't needed.
> But that will take some more fundamental design changes, and since this
> provides some good improvements for important cases, I feel it's
> reasonable to get this into the release.

But I fail to see what is special about doing the dance to affine and
then back to trees just to drop the constant offset which would be
done by get_inner_reference as well and cheaper if you just ignore
bitpos.

?!

Richard.

> Thanks,
> Bill
>
>>
>> Richard.
>>
>>
>
Richard Biener Dec. 4, 2013, 10:30 a.m. UTC | #9
On Wed, Dec 4, 2013 at 11:26 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
> <wschmidt@linux.vnet.ibm.com> wrote:
>> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
>>> Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
>>> >On 12/03/13 14:20, Richard Biener wrote:
>>> >> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
>>> >wrote:
>>> >>> On 12/03/13 06:48, Jeff Law wrote:
>>> >>>>
>>> >>>> On 12/02/13 08:47, Yufeng Zhang wrote:
>>> >>>>>
>>> >>>>> Ping~
>>> >>>>>
>>> >>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>>> >>>>
>>> >>>>
>>> >>>>>
>>> >>>>> Thanks,
>>> >>>>> Yufeng
>>> >>>>>
>>> >>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>>> >>>>>>
>>> >>>>>> On 11/26/13 12:45, Richard Biener wrote:
>>> >>>>>>>
>>> >>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>>> >>>>>>> Zhang<Yufeng.Zhang@arm.com>     wrote:
>>> >>>>>>>>
>>> >>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>>> >>>>>>>>>
>>> >>>>>>>>> The second version of your original patch is ok with me with
>>> >the
>>> >>>>>>>>> following changes.  Sorry for the little side adventure into
>>> >the
>>> >>>>>>>>> next-interp logic; in the end that's going to hurt more than
>>> >it
>>> >>>>>>>>> helps in
>>> >>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
>>> >also for
>>> >>>>>>>>> cleaning up this version to be less intrusive to common
>>> >interfaces; I
>>> >>>>>>>>> appreciate it.
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> Thanks a lot for the review.  I've attached an updated patch
>>> >with the
>>> >>>>>>>> suggested changes incorporated.
>>> >>>>>>>>
>>> >>>>>>>> For the next-interp adventure, I was quite happy to do the
>>> >>>>>>>> experiment; it's
>>> >>>>>>>> a good chance of gaining insight into the pass.  Many thanks
>>> >for
>>> >>>>>>>> your prompt
>>> >>>>>>>> replies and patience in guiding!
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>>> >>>>>>>>> approval,
>>> >>>>>>>>> as I'm not a maintainer.
>>> >>>>
>>> >>>> First a note, I need to check on voting for Bill as the slsr
>>> >maintainer
>>> >>>> from the steering committee.   Voting was in progress just before
>>> >the
>>> >>>> close of stage1 development so I haven't tallied the results :-)
>>> >>>
>>> >>>
>>> >>> Looking forward to some good news! :)
>>> >>>
>>> >>>
>>> >>>>>>
>>> >>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
>>> >shared.
>>> >>>>>>      The cached is introduced mainly because get_alternative_base
>>> >() may
>>> >>>>>> be
>>> >>>>>> called twice on the same 'base' tree, once in the
>>> >>>>>> find_basis_for_candidate () for look-up and the other time in
>>> >>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
>>> >happy
>>> >>>>>> to leave out the cache if you think the benefit is trivial.
>>> >>>>
>>> >>>> Without some sense of how expensive the lookups are vs how often
>>> >the
>>> >>>> cache hits it's awful hard to know if the cache is worth it.
>>> >>>>
>>> >>>> I'd say take it out unless you have some sense it's really saving
>>> >time.
>>> >>>>     It's a pretty minor implementation detail either way.
>>> >>>
>>> >>>
>>> >>> I think the affine tree routines are generally expensive; it is
>>> >worth having
>>> >>> a cache to avoid calling them too many times.  I run the slsr-*.c
>>> >tests
>>> >>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
>>> >from
>>> >>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
>>> >represent
>>> >>> the real world scenario, but they do show the fact that the 'base'
>>> >tree can
>>> >>> be shared to some extent.  So I'd like to have the cache in the
>>> >patch.
>>> >>>
>>> >>>
>>> >>>>
>>> >>>>>>
>>> >>>>>>> +/* { dg-do compile } */
>>> >>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>>> >>>>>>> +
>>> >>>>>>> +typedef int arr_2[50][50];
>>> >>>>>>> +
>>> >>>>>>> +void foo (arr_2 a2, int v1)
>>> >>>>>>> +{
>>> >>>>>>> +  int i, j;
>>> >>>>>>> +
>>> >>>>>>> +  i = v1 + 5;
>>> >>>>>>> +  j = i;
>>> >>>>>>> +  a2 [i-10] [j] = 2;
>>> >>>>>>> +  a2 [i] [j++] = i;
>>> >>>>>>> +  a2 [i+20] [j++] = i;
>>> >>>>>>> +  a2 [i-3] [i-1] += 1;
>>> >>>>>>> +  return;
>>> >>>>>>> +}
>>> >>>>>>> +
>>> >>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>>> >>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>>> >>>>>>>
>>> >>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>>> >>>>>>> you expect?  I see other slsr testcases do similar non-sensical
>>> >>>>>>> checking which is bad, too.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
>>> >to
>>> >>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
>>> >MEM_REFs
>>> >>>>>> is an effective check.  Alternatively, I can add a follow-up
>>> >patch to
>>> >>>>>> add some dumping facility in replace_ref () to print out the
>>> >replacing
>>> >>>>>> actions when -fdump-tree-slsr-details is on.
>>> >>>>
>>> >>>> I think adding some details to the dump and scanning for them would
>>> >be
>>> >>>> better.  That's the only change that is required for this to move
>>> >forward.
>>> >>>
>>> >>>
>>> >>> I've updated to patch to dump more details when
>>> >-fdump-tree-slsr-details is
>>> >>> on.  The tests have also been updated to scan for these new dumps
>>> >instead of
>>> >>> MEMs.
>>> >>>
>>> >>>
>>> >>>>
>>> >>>> I suggest doing it quickly.  We're well past stage1 close at this
>>> >point.
>>> >>>
>>> >>>
>>> >>> The bootstrapping on x86_64 is still running.  OK to commit if it
>>> >succeeds?
>>> >>
>>> >> I still don't like it.  It's using the wrong and too expensive tools
>>> >to do
>>> >> stuff.  What kind of bases are we ultimately interested in?  Browsing
>>> >> the code it looks like we're having
>>> >>
>>> >>    /* Base expression for the chain of candidates:  often, but not
>>> >>       always, an SSA name.  */
>>> >>    tree base_expr;
>>> >>
>>> >> which isn't really too informative but I suppose they are all
>>> >> kind-of-gimple_val()s?  That said, I wonder if you can simply
>>> >> use get_addr_base_and_unit_offset in place of get_alternative_base
>>> >(),
>>> >> ignoring the returned offset.
>>> >
>>> >'base_expr' is essentially the base address of a handled_component_p,
>>> >e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
>>> >
>>> >the object returned by get_inner_reference ().
>>> >
>>> >Given a test case like the following:
>>> >
>>> >typedef int arr_2[20][20];
>>> >
>>> >void foo (arr_2 a2, int i, int j)
>>> >{
>>> >   a2[i+10][j] = 1;
>>> >   a2[i+10][j+1] = 1;
>>> >   a2[i+20][j] = 1;
>>> >}
>>> >
>>> >The IR before SLSR is (on x86_64):
>>> >
>>> >   _2 = (long unsigned int) i_1(D);
>>> >   _3 = _2 * 80;
>>> >   _4 = _3 + 800;
>>> >   _6 = a2_5(D) + _4;
>>> >   *_6[j_8(D)] = 1;
>>> >   _10 = j_8(D) + 1;
>>> >   *_6[_10] = 1;
>>> >   _12 = _3 + 1600;
>>> >   _13 = a2_5(D) + _12;
>>> >   *_13[j_8(D)] = 1;
>>> >
>>> >The base_expr for the 1st and 2nd memory reference are the same, i.e.
>>> >_6, while the base_expr for a2[i+20][j] is _13.
>>> >
>>> >_13 is essentially (_6 + 800), so all of the three memory references
>>> >essentially share the same base address.  As their strides are also the
>>> >
>>> >same (MULT_EXPR (j, 4)), the three references can all be lowered to
>>> >MEM_REFs.  What this patch does is to use the tree affine tools to help
>>> >
>>> >recognize the underlying base address expression; as it requires
>>> >looking
>>> >into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
>>> >won't help here.
>>> >
>>> >Bill has helped me exploit other ways of achieving this in SLSR, but so
>>> >
>>> >far we think this is the best way to proceed.  The use of tree affine
>>> >routines has been restricted to CAND_REFs only and there is the
>>> >aforementioned cache facility to help reduce the overhead.
>>> >
>>> >Thanks,
>>> >Yufeng
>>> >
>>> >P.S. some more details what the patch does:
>>> >
>>> >The CAND_REF for the three memory references are:
>>> >
>>> >  6  [2] *_6[j_8(D)] = 1;
>>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>>> >      basis: 0  dependent: 8  sibling: 0
>>> >      next-interp: 0  dead-savings: 0
>>> >
>>> >   8  [2] *_6[_10] = 1;
>>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
>>> >      basis: 6  dependent: 11  sibling: 0
>>> >      next-interp: 0  dead-savings: 0
>>> >
>>> >  11  [2] *_13[j_8(D)] = 1;
>>> >      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>>> >      basis: 8  dependent: 0  sibling: 0
>>> >      next-interp: 0  dead-savings: 0
>>> >
>>> >Before the patch, the strength reduction candidate chains for the three
>>> >
>>> >CAND_REFs are:
>>> >
>>> >   _6 -> 6 -> 8
>>> >   _13 -> 11
>>> >
>>> >i.e. SLSR recognizes the first two references share the same basis,
>>> >while the last one is on it own.
>>> >
>>> >With the patch, an extra candidate chain can be recognized:
>>> >
>>> >   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8
>>> >
>>> >i.e. all of the three references are found to have the same basis
>>> >(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
>>> >_6
>>> >or _13, with the immediate offset removed.  The pass is now able to
>>> >lower all of the three references, instead of the first two only, to
>>> >MEM_REFs.
>>>
>>> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
>>
>> I think that's overstating SLSR's current capabilities a bit. :)  We do
>> use get_inner_reference to come up with the base expression for
>> reference candidates (based on some of your suggestions a couple of
>> years back).  However, in the case of multiple levels of array
>> references, we miss opportunities because get_inner_reference stops at
>> an SSA name that could be further expanded by following its definition
>> back to a more fundamental base expression.
>
> Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
> same problem.

Oh, you're using affine combination expansion ... which is even more
expensive.  So why isn't that then done for all ref candidates?  That is,
why do two different things, get_inner_reference _and_ affine-combination
dances.  And why build back trees from that instead of storing the
affine combination.

I'll bet we come back with compile-time issues after this patch
went in.  I'll count on you two to fix them then.

Richard.

>> Part of the issue here is that reference candidates are basis for a more
>> specific optimization than the mult and add candidates.  The latter have
>> a more general framework for building up a recording of simple affine
>> expressions that can be strength-reduced.  Ultimately we ought to be
>> able to do something similar for reference candidates, building up
>> simple affine expressions from base expressions, so that everything is
>> done in a forward order and the tree-affine interfaces aren't needed.
>> But that will take some more fundamental design changes, and since this
>> provides some good improvements for important cases, I feel it's
>> reasonable to get this into the release.
>
> But I fail to see what is special about doing the dance to affine and
> then back to trees just to drop the constant offset which would be
> done by get_inner_reference as well and cheaper if you just ignore
> bitpos.
>
> ?!
>
> Richard.
>
>> Thanks,
>> Bill
>>
>>>
>>> Richard.
>>>
>>>
>>
Yufeng Zhang Dec. 4, 2013, 11:32 a.m. UTC | #10
On 12/04/13 10:30, Richard Biener wrote:
> On Wed, Dec 4, 2013 at 11:26 AM, Richard Biener
> <richard.guenther@gmail.com>  wrote:
>> On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
>> <wschmidt@linux.vnet.ibm.com>  wrote:
>>> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
>>>> Yufeng Zhang<Yufeng.Zhang@arm.com>  wrote:
>>>>> On 12/03/13 14:20, Richard Biener wrote:
>>>>>> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
>>>>> wrote:
>>>>>>> On 12/03/13 06:48, Jeff Law wrote:
>>>>>>>>
>>>>>>>> On 12/02/13 08:47, Yufeng Zhang wrote:
>>>>>>>>>
>>>>>>>>> Ping~
>>>>>>>>>
>>>>>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yufeng
>>>>>>>>>
>>>>>>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>>>>>>>>>>
>>>>>>>>>> On 11/26/13 12:45, Richard Biener wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>>>>>>>>>>> Zhang<Yufeng.Zhang@arm.com>      wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> The second version of your original patch is ok with me with
>>>>> the
>>>>>>>>>>>>> following changes.  Sorry for the little side adventure into
>>>>> the
>>>>>>>>>>>>> next-interp logic; in the end that's going to hurt more than
>>>>> it
>>>>>>>>>>>>> helps in
>>>>>>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
>>>>> also for
>>>>>>>>>>>>> cleaning up this version to be less intrusive to common
>>>>> interfaces; I
>>>>>>>>>>>>> appreciate it.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks a lot for the review.  I've attached an updated patch
>>>>> with the
>>>>>>>>>>>> suggested changes incorporated.
>>>>>>>>>>>>
>>>>>>>>>>>> For the next-interp adventure, I was quite happy to do the
>>>>>>>>>>>> experiment; it's
>>>>>>>>>>>> a good chance of gaining insight into the pass.  Many thanks
>>>>> for
>>>>>>>>>>>> your prompt
>>>>>>>>>>>> replies and patience in guiding!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>>>>>>>>>>>>> approval,
>>>>>>>>>>>>> as I'm not a maintainer.
>>>>>>>>
>>>>>>>> First a note, I need to check on voting for Bill as the slsr
>>>>> maintainer
>>>>>>>> from the steering committee.   Voting was in progress just before
>>>>> the
>>>>>>>> close of stage1 development so I haven't tallied the results :-)
>>>>>>>
>>>>>>>
>>>>>>> Looking forward to some good news! :)
>>>>>>>
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
>>>>> shared.
>>>>>>>>>>       The cached is introduced mainly because get_alternative_base
>>>>> () may
>>>>>>>>>> be
>>>>>>>>>> called twice on the same 'base' tree, once in the
>>>>>>>>>> find_basis_for_candidate () for look-up and the other time in
>>>>>>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
>>>>> happy
>>>>>>>>>> to leave out the cache if you think the benefit is trivial.
>>>>>>>>
>>>>>>>> Without some sense of how expensive the lookups are vs how often
>>>>> the
>>>>>>>> cache hits it's awful hard to know if the cache is worth it.
>>>>>>>>
>>>>>>>> I'd say take it out unless you have some sense it's really saving
>>>>> time.
>>>>>>>>      It's a pretty minor implementation detail either way.
>>>>>>>
>>>>>>>
>>>>>>> I think the affine tree routines are generally expensive; it is
>>>>> worth having
>>>>>>> a cache to avoid calling them too many times.  I run the slsr-*.c
>>>>> tests
>>>>>>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
>>>>> from
>>>>>>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
>>>>> represent
>>>>>>> the real world scenario, but they do show the fact that the 'base'
>>>>> tree can
>>>>>>> be shared to some extent.  So I'd like to have the cache in the
>>>>> patch.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> +/* { dg-do compile } */
>>>>>>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>>>>>>>>>>> +
>>>>>>>>>>> +typedef int arr_2[50][50];
>>>>>>>>>>> +
>>>>>>>>>>> +void foo (arr_2 a2, int v1)
>>>>>>>>>>> +{
>>>>>>>>>>> +  int i, j;
>>>>>>>>>>> +
>>>>>>>>>>> +  i = v1 + 5;
>>>>>>>>>>> +  j = i;
>>>>>>>>>>> +  a2 [i-10] [j] = 2;
>>>>>>>>>>> +  a2 [i] [j++] = i;
>>>>>>>>>>> +  a2 [i+20] [j++] = i;
>>>>>>>>>>> +  a2 [i-3] [i-1] += 1;
>>>>>>>>>>> +  return;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>>>>>>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>>>>>>>>>>>
>>>>>>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>>>>>>>>>>> you expect?  I see other slsr testcases do similar non-sensical
>>>>>>>>>>> checking which is bad, too.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
>>>>> to
>>>>>>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
>>>>> MEM_REFs
>>>>>>>>>> is an effective check.  Alternatively, I can add a follow-up
>>>>> patch to
>>>>>>>>>> add some dumping facility in replace_ref () to print out the
>>>>> replacing
>>>>>>>>>> actions when -fdump-tree-slsr-details is on.
>>>>>>>>
>>>>>>>> I think adding some details to the dump and scanning for them would
>>>>> be
>>>>>>>> better.  That's the only change that is required for this to move
>>>>> forward.
>>>>>>>
>>>>>>>
>>>>>>> I've updated to patch to dump more details when
>>>>> -fdump-tree-slsr-details is
>>>>>>> on.  The tests have also been updated to scan for these new dumps
>>>>> instead of
>>>>>>> MEMs.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I suggest doing it quickly.  We're well past stage1 close at this
>>>>> point.
>>>>>>>
>>>>>>>
>>>>>>> The bootstrapping on x86_64 is still running.  OK to commit if it
>>>>> succeeds?
>>>>>>
>>>>>> I still don't like it.  It's using the wrong and too expensive tools
>>>>> to do
>>>>>> stuff.  What kind of bases are we ultimately interested in?  Browsing
>>>>>> the code it looks like we're having
>>>>>>
>>>>>>     /* Base expression for the chain of candidates:  often, but not
>>>>>>        always, an SSA name.  */
>>>>>>     tree base_expr;
>>>>>>
>>>>>> which isn't really too informative but I suppose they are all
>>>>>> kind-of-gimple_val()s?  That said, I wonder if you can simply
>>>>>> use get_addr_base_and_unit_offset in place of get_alternative_base
>>>>> (),
>>>>>> ignoring the returned offset.
>>>>>
>>>>> 'base_expr' is essentially the base address of a handled_component_p,
>>>>> e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
>>>>>
>>>>> the object returned by get_inner_reference ().
>>>>>
>>>>> Given a test case like the following:
>>>>>
>>>>> typedef int arr_2[20][20];
>>>>>
>>>>> void foo (arr_2 a2, int i, int j)
>>>>> {
>>>>>    a2[i+10][j] = 1;
>>>>>    a2[i+10][j+1] = 1;
>>>>>    a2[i+20][j] = 1;
>>>>> }
>>>>>
>>>>> The IR before SLSR is (on x86_64):
>>>>>
>>>>>    _2 = (long unsigned int) i_1(D);
>>>>>    _3 = _2 * 80;
>>>>>    _4 = _3 + 800;
>>>>>    _6 = a2_5(D) + _4;
>>>>>    *_6[j_8(D)] = 1;
>>>>>    _10 = j_8(D) + 1;
>>>>>    *_6[_10] = 1;
>>>>>    _12 = _3 + 1600;
>>>>>    _13 = a2_5(D) + _12;
>>>>>    *_13[j_8(D)] = 1;
>>>>>
>>>>> The base_expr for the 1st and 2nd memory reference are the same, i.e.
>>>>> _6, while the base_expr for a2[i+20][j] is _13.
>>>>>
>>>>> _13 is essentially (_6 + 800), so all of the three memory references
>>>>> essentially share the same base address.  As their strides are also the
>>>>>
>>>>> same (MULT_EXPR (j, 4)), the three references can all be lowered to
>>>>> MEM_REFs.  What this patch does is to use the tree affine tools to help
>>>>>
>>>>> recognize the underlying base address expression; as it requires
>>>>> looking
>>>>> into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
>>>>> won't help here.
>>>>>
>>>>> Bill has helped me exploit other ways of achieving this in SLSR, but so
>>>>>
>>>>> far we think this is the best way to proceed.  The use of tree affine
>>>>> routines has been restricted to CAND_REFs only and there is the
>>>>> aforementioned cache facility to help reduce the overhead.
>>>>>
>>>>> Thanks,
>>>>> Yufeng
>>>>>
>>>>> P.S. some more details what the patch does:
>>>>>
>>>>> The CAND_REF for the three memory references are:
>>>>>
>>>>>   6  [2] *_6[j_8(D)] = 1;
>>>>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>>>>>       basis: 0  dependent: 8  sibling: 0
>>>>>       next-interp: 0  dead-savings: 0
>>>>>
>>>>>    8  [2] *_6[_10] = 1;
>>>>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
>>>>>       basis: 6  dependent: 11  sibling: 0
>>>>>       next-interp: 0  dead-savings: 0
>>>>>
>>>>>   11  [2] *_13[j_8(D)] = 1;
>>>>>       REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>>>>>       basis: 8  dependent: 0  sibling: 0
>>>>>       next-interp: 0  dead-savings: 0
>>>>>
>>>>> Before the patch, the strength reduction candidate chains for the three
>>>>>
>>>>> CAND_REFs are:
>>>>>
>>>>>    _6 ->  6 ->  8
>>>>>    _13 ->  11
>>>>>
>>>>> i.e. SLSR recognizes the first two references share the same basis,
>>>>> while the last one is on it own.
>>>>>
>>>>> With the patch, an extra candidate chain can be recognized:
>>>>>
>>>>>    a2_5(D) + (sizetype) i_1(D) * 80 ->  6 ->  11 ->  8
>>>>>
>>>>> i.e. all of the three references are found to have the same basis
>>>>> (a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
>>>>> _6
>>>>> or _13, with the immediate offset removed.  The pass is now able to
>>>>> lower all of the three references, instead of the first two only, to
>>>>> MEM_REFs.
>>>>
>>>> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
>>>
>>> I think that's overstating SLSR's current capabilities a bit. :)  We do
>>> use get_inner_reference to come up with the base expression for
>>> reference candidates (based on some of your suggestions a couple of
>>> years back).  However, in the case of multiple levels of array
>>> references, we miss opportunities because get_inner_reference stops at
>>> an SSA name that could be further expanded by following its definition
>>> back to a more fundamental base expression.
>>
>> Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
>> same problem.
>
> Oh, you're using affine combination expansion ... which is even more
> expensive.  So why isn't that then done for all ref candidates?  That is,
> why do two different things, get_inner_reference _and_ affine-combination
> dances.

affine-combination is only called from where get_inner_reference stops 
(an SSA_NAME), rather than on the whole address.

> And why build back trees from that instead of storing the
> affine combination.

I think we can store the affine combination instead, but it would need 
changes to the infrastructure in slsr.

>
> I'll bet we come back with compile-time issues after this patch
> went in.  I'll count on you two to fix them then.

I'll do some timing this week; if the result on the compilation time is 
not good, I'll revert the patch.

Yufeng
Bill Schmidt Dec. 4, 2013, 1:08 p.m. UTC | #11
On Wed, 2013-12-04 at 11:26 +0100, Richard Biener wrote:
> On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
> <wschmidt@linux.vnet.ibm.com> wrote:
> > On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
> >> Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
> >> >On 12/03/13 14:20, Richard Biener wrote:
> >> >> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
> >> >wrote:
> >> >>> On 12/03/13 06:48, Jeff Law wrote:
> >> >>>>
> >> >>>> On 12/02/13 08:47, Yufeng Zhang wrote:
> >> >>>>>
> >> >>>>> Ping~
> >> >>>>>
> >> >>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
> >> >>>>
> >> >>>>
> >> >>>>>
> >> >>>>> Thanks,
> >> >>>>> Yufeng
> >> >>>>>
> >> >>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
> >> >>>>>>
> >> >>>>>> On 11/26/13 12:45, Richard Biener wrote:
> >> >>>>>>>
> >> >>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
> >> >>>>>>> Zhang<Yufeng.Zhang@arm.com>     wrote:
> >> >>>>>>>>
> >> >>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
> >> >>>>>>>>>
> >> >>>>>>>>> The second version of your original patch is ok with me with
> >> >the
> >> >>>>>>>>> following changes.  Sorry for the little side adventure into
> >> >the
> >> >>>>>>>>> next-interp logic; in the end that's going to hurt more than
> >> >it
> >> >>>>>>>>> helps in
> >> >>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
> >> >also for
> >> >>>>>>>>> cleaning up this version to be less intrusive to common
> >> >interfaces; I
> >> >>>>>>>>> appreciate it.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> Thanks a lot for the review.  I've attached an updated patch
> >> >with the
> >> >>>>>>>> suggested changes incorporated.
> >> >>>>>>>>
> >> >>>>>>>> For the next-interp adventure, I was quite happy to do the
> >> >>>>>>>> experiment; it's
> >> >>>>>>>> a good chance of gaining insight into the pass.  Many thanks
> >> >for
> >> >>>>>>>> your prompt
> >> >>>>>>>> replies and patience in guiding!
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
> >> >>>>>>>>> approval,
> >> >>>>>>>>> as I'm not a maintainer.
> >> >>>>
> >> >>>> First a note, I need to check on voting for Bill as the slsr
> >> >maintainer
> >> >>>> from the steering committee.   Voting was in progress just before
> >> >the
> >> >>>> close of stage1 development so I haven't tallied the results :-)
> >> >>>
> >> >>>
> >> >>> Looking forward to some good news! :)
> >> >>>
> >> >>>
> >> >>>>>>
> >> >>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
> >> >shared.
> >> >>>>>>      The cached is introduced mainly because get_alternative_base
> >> >() may
> >> >>>>>> be
> >> >>>>>> called twice on the same 'base' tree, once in the
> >> >>>>>> find_basis_for_candidate () for look-up and the other time in
> >> >>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
> >> >happy
> >> >>>>>> to leave out the cache if you think the benefit is trivial.
> >> >>>>
> >> >>>> Without some sense of how expensive the lookups are vs how often
> >> >the
> >> >>>> cache hits it's awful hard to know if the cache is worth it.
> >> >>>>
> >> >>>> I'd say take it out unless you have some sense it's really saving
> >> >time.
> >> >>>>     It's a pretty minor implementation detail either way.
> >> >>>
> >> >>>
> >> >>> I think the affine tree routines are generally expensive; it is
> >> >worth having
> >> >>> a cache to avoid calling them too many times.  I run the slsr-*.c
> >> >tests
> >> >>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
> >> >from
> >> >>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
> >> >represent
> >> >>> the real world scenario, but they do show the fact that the 'base'
> >> >tree can
> >> >>> be shared to some extent.  So I'd like to have the cache in the
> >> >patch.
> >> >>>
> >> >>>
> >> >>>>
> >> >>>>>>
> >> >>>>>>> +/* { dg-do compile } */
> >> >>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
> >> >>>>>>> +
> >> >>>>>>> +typedef int arr_2[50][50];
> >> >>>>>>> +
> >> >>>>>>> +void foo (arr_2 a2, int v1)
> >> >>>>>>> +{
> >> >>>>>>> +  int i, j;
> >> >>>>>>> +
> >> >>>>>>> +  i = v1 + 5;
> >> >>>>>>> +  j = i;
> >> >>>>>>> +  a2 [i-10] [j] = 2;
> >> >>>>>>> +  a2 [i] [j++] = i;
> >> >>>>>>> +  a2 [i+20] [j++] = i;
> >> >>>>>>> +  a2 [i-3] [i-1] += 1;
> >> >>>>>>> +  return;
> >> >>>>>>> +}
> >> >>>>>>> +
> >> >>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
> >> >>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
> >> >>>>>>>
> >> >>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
> >> >>>>>>> you expect?  I see other slsr testcases do similar non-sensical
> >> >>>>>>> checking which is bad, too.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
> >> >to
> >> >>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
> >> >MEM_REFs
> >> >>>>>> is an effective check.  Alternatively, I can add a follow-up
> >> >patch to
> >> >>>>>> add some dumping facility in replace_ref () to print out the
> >> >replacing
> >> >>>>>> actions when -fdump-tree-slsr-details is on.
> >> >>>>
> >> >>>> I think adding some details to the dump and scanning for them would
> >> >be
> >> >>>> better.  That's the only change that is required for this to move
> >> >forward.
> >> >>>
> >> >>>
> >> >>> I've updated to patch to dump more details when
> >> >-fdump-tree-slsr-details is
> >> >>> on.  The tests have also been updated to scan for these new dumps
> >> >instead of
> >> >>> MEMs.
> >> >>>
> >> >>>
> >> >>>>
> >> >>>> I suggest doing it quickly.  We're well past stage1 close at this
> >> >point.
> >> >>>
> >> >>>
> >> >>> The bootstrapping on x86_64 is still running.  OK to commit if it
> >> >succeeds?
> >> >>
> >> >> I still don't like it.  It's using the wrong and too expensive tools
> >> >to do
> >> >> stuff.  What kind of bases are we ultimately interested in?  Browsing
> >> >> the code it looks like we're having
> >> >>
> >> >>    /* Base expression for the chain of candidates:  often, but not
> >> >>       always, an SSA name.  */
> >> >>    tree base_expr;
> >> >>
> >> >> which isn't really too informative but I suppose they are all
> >> >> kind-of-gimple_val()s?  That said, I wonder if you can simply
> >> >> use get_addr_base_and_unit_offset in place of get_alternative_base
> >> >(),
> >> >> ignoring the returned offset.
> >> >
> >> >'base_expr' is essentially the base address of a handled_component_p,
> >> >e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
> >> >
> >> >the object returned by get_inner_reference ().
> >> >
> >> >Given a test case like the following:
> >> >
> >> >typedef int arr_2[20][20];
> >> >
> >> >void foo (arr_2 a2, int i, int j)
> >> >{
> >> >   a2[i+10][j] = 1;
> >> >   a2[i+10][j+1] = 1;
> >> >   a2[i+20][j] = 1;
> >> >}
> >> >
> >> >The IR before SLSR is (on x86_64):
> >> >
> >> >   _2 = (long unsigned int) i_1(D);
> >> >   _3 = _2 * 80;
> >> >   _4 = _3 + 800;
> >> >   _6 = a2_5(D) + _4;
> >> >   *_6[j_8(D)] = 1;
> >> >   _10 = j_8(D) + 1;
> >> >   *_6[_10] = 1;
> >> >   _12 = _3 + 1600;
> >> >   _13 = a2_5(D) + _12;
> >> >   *_13[j_8(D)] = 1;
> >> >
> >> >The base_expr for the 1st and 2nd memory reference are the same, i.e.
> >> >_6, while the base_expr for a2[i+20][j] is _13.
> >> >
> >> >_13 is essentially (_6 + 800), so all of the three memory references
> >> >essentially share the same base address.  As their strides are also the
> >> >
> >> >same (MULT_EXPR (j, 4)), the three references can all be lowered to
> >> >MEM_REFs.  What this patch does is to use the tree affine tools to help
> >> >
> >> >recognize the underlying base address expression; as it requires
> >> >looking
> >> >into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
> >> >won't help here.
> >> >
> >> >Bill has helped me exploit other ways of achieving this in SLSR, but so
> >> >
> >> >far we think this is the best way to proceed.  The use of tree affine
> >> >routines has been restricted to CAND_REFs only and there is the
> >> >aforementioned cache facility to help reduce the overhead.
> >> >
> >> >Thanks,
> >> >Yufeng
> >> >
> >> >P.S. some more details what the patch does:
> >> >
> >> >The CAND_REF for the three memory references are:
> >> >
> >> >  6  [2] *_6[j_8(D)] = 1;
> >> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >> >      basis: 0  dependent: 8  sibling: 0
> >> >      next-interp: 0  dead-savings: 0
> >> >
> >> >   8  [2] *_6[_10] = 1;
> >> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
> >> >      basis: 6  dependent: 11  sibling: 0
> >> >      next-interp: 0  dead-savings: 0
> >> >
> >> >  11  [2] *_13[j_8(D)] = 1;
> >> >      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >> >      basis: 8  dependent: 0  sibling: 0
> >> >      next-interp: 0  dead-savings: 0
> >> >
> >> >Before the patch, the strength reduction candidate chains for the three
> >> >
> >> >CAND_REFs are:
> >> >
> >> >   _6 -> 6 -> 8
> >> >   _13 -> 11
> >> >
> >> >i.e. SLSR recognizes the first two references share the same basis,
> >> >while the last one is on it own.
> >> >
> >> >With the patch, an extra candidate chain can be recognized:
> >> >
> >> >   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8
> >> >
> >> >i.e. all of the three references are found to have the same basis
> >> >(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
> >> >_6
> >> >or _13, with the immediate offset removed.  The pass is now able to
> >> >lower all of the three references, instead of the first two only, to
> >> >MEM_REFs.
> >>
> >> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
> >
> > I think that's overstating SLSR's current capabilities a bit. :)  We do
> > use get_inner_reference to come up with the base expression for
> > reference candidates (based on some of your suggestions a couple of
> > years back).  However, in the case of multiple levels of array
> > references, we miss opportunities because get_inner_reference stops at
> > an SSA name that could be further expanded by following its definition
> > back to a more fundamental base expression.
> 
> Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
> same problem.
> 
> > Part of the issue here is that reference candidates are basis for a more
> > specific optimization than the mult and add candidates.  The latter have
> > a more general framework for building up a recording of simple affine
> > expressions that can be strength-reduced.  Ultimately we ought to be
> > able to do something similar for reference candidates, building up
> > simple affine expressions from base expressions, so that everything is
> > done in a forward order and the tree-affine interfaces aren't needed.
> > But that will take some more fundamental design changes, and since this
> > provides some good improvements for important cases, I feel it's
> > reasonable to get this into the release.
> 
> But I fail to see what is special about doing the dance to affine and
> then back to trees just to drop the constant offset which would be
> done by get_inner_reference as well and cheaper if you just ignore
> bitpos.

I'm not sure what you're suggesting that he use get_inner_reference on
at this point.  At the point where the affine machinery is invoked, the
memory reference was already expanded with get_inner_reference, and
there was no basis involving the SSA name produced as the base.  The
affine machinery is invoked on that SSA name to see if it is hiding
another base.  There's no additional memory reference to use
get_inner_reference on, just potentially some pointer arithmetic.

That said, if we have real compile-time issues, we should hold off on
this patch for this release.

Yufeng, please time some reasonably large benchmarks (some version of
SPECint or similar) and report back here before the patch goes in.

I will respond in a different part of the thread about the real
underlying problem that needs to be solved in a more general way.

Thanks,
Bill

> 
> ?!
> 
> Richard.
> 
> > Thanks,
> > Bill
> >
> >>
> >> Richard.
> >>
> >>
> >
>
Bill Schmidt Dec. 4, 2013, 1:13 p.m. UTC | #12
On Wed, 2013-12-04 at 11:30 +0100, Richard Biener wrote:
> On Wed, Dec 4, 2013 at 11:26 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
> > On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
> > <wschmidt@linux.vnet.ibm.com> wrote:
> >> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
> >>> Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
> >>> >On 12/03/13 14:20, Richard Biener wrote:
> >>> >> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
> >>> >wrote:
> >>> >>> On 12/03/13 06:48, Jeff Law wrote:
> >>> >>>>
> >>> >>>> On 12/02/13 08:47, Yufeng Zhang wrote:
> >>> >>>>>
> >>> >>>>> Ping~
> >>> >>>>>
> >>> >>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
> >>> >>>>
> >>> >>>>
> >>> >>>>>
> >>> >>>>> Thanks,
> >>> >>>>> Yufeng
> >>> >>>>>
> >>> >>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
> >>> >>>>>>
> >>> >>>>>> On 11/26/13 12:45, Richard Biener wrote:
> >>> >>>>>>>
> >>> >>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
> >>> >>>>>>> Zhang<Yufeng.Zhang@arm.com>     wrote:
> >>> >>>>>>>>
> >>> >>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
> >>> >>>>>>>>>
> >>> >>>>>>>>> The second version of your original patch is ok with me with
> >>> >the
> >>> >>>>>>>>> following changes.  Sorry for the little side adventure into
> >>> >the
> >>> >>>>>>>>> next-interp logic; in the end that's going to hurt more than
> >>> >it
> >>> >>>>>>>>> helps in
> >>> >>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
> >>> >also for
> >>> >>>>>>>>> cleaning up this version to be less intrusive to common
> >>> >interfaces; I
> >>> >>>>>>>>> appreciate it.
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>> Thanks a lot for the review.  I've attached an updated patch
> >>> >with the
> >>> >>>>>>>> suggested changes incorporated.
> >>> >>>>>>>>
> >>> >>>>>>>> For the next-interp adventure, I was quite happy to do the
> >>> >>>>>>>> experiment; it's
> >>> >>>>>>>> a good chance of gaining insight into the pass.  Many thanks
> >>> >for
> >>> >>>>>>>> your prompt
> >>> >>>>>>>> replies and patience in guiding!
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
> >>> >>>>>>>>> approval,
> >>> >>>>>>>>> as I'm not a maintainer.
> >>> >>>>
> >>> >>>> First a note, I need to check on voting for Bill as the slsr
> >>> >maintainer
> >>> >>>> from the steering committee.   Voting was in progress just before
> >>> >the
> >>> >>>> close of stage1 development so I haven't tallied the results :-)
> >>> >>>
> >>> >>>
> >>> >>> Looking forward to some good news! :)
> >>> >>>
> >>> >>>
> >>> >>>>>>
> >>> >>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
> >>> >shared.
> >>> >>>>>>      The cached is introduced mainly because get_alternative_base
> >>> >() may
> >>> >>>>>> be
> >>> >>>>>> called twice on the same 'base' tree, once in the
> >>> >>>>>> find_basis_for_candidate () for look-up and the other time in
> >>> >>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
> >>> >happy
> >>> >>>>>> to leave out the cache if you think the benefit is trivial.
> >>> >>>>
> >>> >>>> Without some sense of how expensive the lookups are vs how often
> >>> >the
> >>> >>>> cache hits it's awful hard to know if the cache is worth it.
> >>> >>>>
> >>> >>>> I'd say take it out unless you have some sense it's really saving
> >>> >time.
> >>> >>>>     It's a pretty minor implementation detail either way.
> >>> >>>
> >>> >>>
> >>> >>> I think the affine tree routines are generally expensive; it is
> >>> >worth having
> >>> >>> a cache to avoid calling them too many times.  I run the slsr-*.c
> >>> >tests
> >>> >>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
> >>> >from
> >>> >>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
> >>> >represent
> >>> >>> the real world scenario, but they do show the fact that the 'base'
> >>> >tree can
> >>> >>> be shared to some extent.  So I'd like to have the cache in the
> >>> >patch.
> >>> >>>
> >>> >>>
> >>> >>>>
> >>> >>>>>>
> >>> >>>>>>> +/* { dg-do compile } */
> >>> >>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
> >>> >>>>>>> +
> >>> >>>>>>> +typedef int arr_2[50][50];
> >>> >>>>>>> +
> >>> >>>>>>> +void foo (arr_2 a2, int v1)
> >>> >>>>>>> +{
> >>> >>>>>>> +  int i, j;
> >>> >>>>>>> +
> >>> >>>>>>> +  i = v1 + 5;
> >>> >>>>>>> +  j = i;
> >>> >>>>>>> +  a2 [i-10] [j] = 2;
> >>> >>>>>>> +  a2 [i] [j++] = i;
> >>> >>>>>>> +  a2 [i+20] [j++] = i;
> >>> >>>>>>> +  a2 [i-3] [i-1] += 1;
> >>> >>>>>>> +  return;
> >>> >>>>>>> +}
> >>> >>>>>>> +
> >>> >>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
> >>> >>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
> >>> >>>>>>>
> >>> >>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
> >>> >>>>>>> you expect?  I see other slsr testcases do similar non-sensical
> >>> >>>>>>> checking which is bad, too.
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
> >>> >to
> >>> >>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
> >>> >MEM_REFs
> >>> >>>>>> is an effective check.  Alternatively, I can add a follow-up
> >>> >patch to
> >>> >>>>>> add some dumping facility in replace_ref () to print out the
> >>> >replacing
> >>> >>>>>> actions when -fdump-tree-slsr-details is on.
> >>> >>>>
> >>> >>>> I think adding some details to the dump and scanning for them would
> >>> >be
> >>> >>>> better.  That's the only change that is required for this to move
> >>> >forward.
> >>> >>>
> >>> >>>
> >>> >>> I've updated to patch to dump more details when
> >>> >-fdump-tree-slsr-details is
> >>> >>> on.  The tests have also been updated to scan for these new dumps
> >>> >instead of
> >>> >>> MEMs.
> >>> >>>
> >>> >>>
> >>> >>>>
> >>> >>>> I suggest doing it quickly.  We're well past stage1 close at this
> >>> >point.
> >>> >>>
> >>> >>>
> >>> >>> The bootstrapping on x86_64 is still running.  OK to commit if it
> >>> >succeeds?
> >>> >>
> >>> >> I still don't like it.  It's using the wrong and too expensive tools
> >>> >to do
> >>> >> stuff.  What kind of bases are we ultimately interested in?  Browsing
> >>> >> the code it looks like we're having
> >>> >>
> >>> >>    /* Base expression for the chain of candidates:  often, but not
> >>> >>       always, an SSA name.  */
> >>> >>    tree base_expr;
> >>> >>
> >>> >> which isn't really too informative but I suppose they are all
> >>> >> kind-of-gimple_val()s?  That said, I wonder if you can simply
> >>> >> use get_addr_base_and_unit_offset in place of get_alternative_base
> >>> >(),
> >>> >> ignoring the returned offset.
> >>> >
> >>> >'base_expr' is essentially the base address of a handled_component_p,
> >>> >e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
> >>> >
> >>> >the object returned by get_inner_reference ().
> >>> >
> >>> >Given a test case like the following:
> >>> >
> >>> >typedef int arr_2[20][20];
> >>> >
> >>> >void foo (arr_2 a2, int i, int j)
> >>> >{
> >>> >   a2[i+10][j] = 1;
> >>> >   a2[i+10][j+1] = 1;
> >>> >   a2[i+20][j] = 1;
> >>> >}
> >>> >
> >>> >The IR before SLSR is (on x86_64):
> >>> >
> >>> >   _2 = (long unsigned int) i_1(D);
> >>> >   _3 = _2 * 80;
> >>> >   _4 = _3 + 800;
> >>> >   _6 = a2_5(D) + _4;
> >>> >   *_6[j_8(D)] = 1;
> >>> >   _10 = j_8(D) + 1;
> >>> >   *_6[_10] = 1;
> >>> >   _12 = _3 + 1600;
> >>> >   _13 = a2_5(D) + _12;
> >>> >   *_13[j_8(D)] = 1;
> >>> >
> >>> >The base_expr for the 1st and 2nd memory reference are the same, i.e.
> >>> >_6, while the base_expr for a2[i+20][j] is _13.
> >>> >
> >>> >_13 is essentially (_6 + 800), so all of the three memory references
> >>> >essentially share the same base address.  As their strides are also the
> >>> >
> >>> >same (MULT_EXPR (j, 4)), the three references can all be lowered to
> >>> >MEM_REFs.  What this patch does is to use the tree affine tools to help
> >>> >
> >>> >recognize the underlying base address expression; as it requires
> >>> >looking
> >>> >into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
> >>> >won't help here.
> >>> >
> >>> >Bill has helped me exploit other ways of achieving this in SLSR, but so
> >>> >
> >>> >far we think this is the best way to proceed.  The use of tree affine
> >>> >routines has been restricted to CAND_REFs only and there is the
> >>> >aforementioned cache facility to help reduce the overhead.
> >>> >
> >>> >Thanks,
> >>> >Yufeng
> >>> >
> >>> >P.S. some more details what the patch does:
> >>> >
> >>> >The CAND_REF for the three memory references are:
> >>> >
> >>> >  6  [2] *_6[j_8(D)] = 1;
> >>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >>> >      basis: 0  dependent: 8  sibling: 0
> >>> >      next-interp: 0  dead-savings: 0
> >>> >
> >>> >   8  [2] *_6[_10] = 1;
> >>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
> >>> >      basis: 6  dependent: 11  sibling: 0
> >>> >      next-interp: 0  dead-savings: 0
> >>> >
> >>> >  11  [2] *_13[j_8(D)] = 1;
> >>> >      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >>> >      basis: 8  dependent: 0  sibling: 0
> >>> >      next-interp: 0  dead-savings: 0
> >>> >
> >>> >Before the patch, the strength reduction candidate chains for the three
> >>> >
> >>> >CAND_REFs are:
> >>> >
> >>> >   _6 -> 6 -> 8
> >>> >   _13 -> 11
> >>> >
> >>> >i.e. SLSR recognizes the first two references share the same basis,
> >>> >while the last one is on it own.
> >>> >
> >>> >With the patch, an extra candidate chain can be recognized:
> >>> >
> >>> >   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8
> >>> >
> >>> >i.e. all of the three references are found to have the same basis
> >>> >(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
> >>> >_6
> >>> >or _13, with the immediate offset removed.  The pass is now able to
> >>> >lower all of the three references, instead of the first two only, to
> >>> >MEM_REFs.
> >>>
> >>> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
> >>
> >> I think that's overstating SLSR's current capabilities a bit. :)  We do
> >> use get_inner_reference to come up with the base expression for
> >> reference candidates (based on some of your suggestions a couple of
> >> years back).  However, in the case of multiple levels of array
> >> references, we miss opportunities because get_inner_reference stops at
> >> an SSA name that could be further expanded by following its definition
> >> back to a more fundamental base expression.
> >
> > Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
> > same problem.
> 
> Oh, you're using affine combination expansion ... which is even more
> expensive.  So why isn't that then done for all ref candidates?  That is,
> why do two different things, get_inner_reference _and_ affine-combination
> dances.  And why build back trees from that instead of storing the
> affine combination.

Well, the original design had no desire to use the expensive machinery
of affine combination expansion.  For what was envisioned, the simpler
mechanisms of get_inner_reference have been plenty.

My thought, and please correct me if I'm wrong, is that once we've
already reduced to an SSA name from get_inner_reference, the affine
machinery will terminate fairly quickly -- we shouldn't get into too
deep a search on underlying pointer arithmetic in most cases.  But
compile time testing will tell us whether this is reasonable.

Bill

> 
> I'll bet we come back with compile-time issues after this patch
> went in.  I'll count on you two to fix them then.
> 
> Richard.
> 
> >> Part of the issue here is that reference candidates are basis for a more
> >> specific optimization than the mult and add candidates.  The latter have
> >> a more general framework for building up a recording of simple affine
> >> expressions that can be strength-reduced.  Ultimately we ought to be
> >> able to do something similar for reference candidates, building up
> >> simple affine expressions from base expressions, so that everything is
> >> done in a forward order and the tree-affine interfaces aren't needed.
> >> But that will take some more fundamental design changes, and since this
> >> provides some good improvements for important cases, I feel it's
> >> reasonable to get this into the release.
> >
> > But I fail to see what is special about doing the dance to affine and
> > then back to trees just to drop the constant offset which would be
> > done by get_inner_reference as well and cheaper if you just ignore
> > bitpos.
> >
> > ?!
> >
> > Richard.
> >
> >> Thanks,
> >> Bill
> >>
> >>>
> >>> Richard.
> >>>
> >>>
> >>
>
Bill Schmidt Dec. 4, 2013, 1:23 p.m. UTC | #13
On Wed, 2013-12-04 at 11:32 +0000, Yufeng Zhang wrote:
> On 12/04/13 10:30, Richard Biener wrote:
> > On Wed, Dec 4, 2013 at 11:26 AM, Richard Biener
> > <richard.guenther@gmail.com>  wrote:
> >> On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
> >> <wschmidt@linux.vnet.ibm.com>  wrote:
> >>> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
> >>>> Yufeng Zhang<Yufeng.Zhang@arm.com>  wrote:
> >>>>> On 12/03/13 14:20, Richard Biener wrote:
> >>>>>> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
> >>>>> wrote:
> >>>>>>> On 12/03/13 06:48, Jeff Law wrote:
> >>>>>>>>
> >>>>>>>> On 12/02/13 08:47, Yufeng Zhang wrote:
> >>>>>>>>>
> >>>>>>>>> Ping~
> >>>>>>>>>
> >>>>>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Yufeng
> >>>>>>>>>
> >>>>>>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 11/26/13 12:45, Richard Biener wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
> >>>>>>>>>>> Zhang<Yufeng.Zhang@arm.com>      wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The second version of your original patch is ok with me with
> >>>>> the
> >>>>>>>>>>>>> following changes.  Sorry for the little side adventure into
> >>>>> the
> >>>>>>>>>>>>> next-interp logic; in the end that's going to hurt more than
> >>>>> it
> >>>>>>>>>>>>> helps in
> >>>>>>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
> >>>>> also for
> >>>>>>>>>>>>> cleaning up this version to be less intrusive to common
> >>>>> interfaces; I
> >>>>>>>>>>>>> appreciate it.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks a lot for the review.  I've attached an updated patch
> >>>>> with the
> >>>>>>>>>>>> suggested changes incorporated.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For the next-interp adventure, I was quite happy to do the
> >>>>>>>>>>>> experiment; it's
> >>>>>>>>>>>> a good chance of gaining insight into the pass.  Many thanks
> >>>>> for
> >>>>>>>>>>>> your prompt
> >>>>>>>>>>>> replies and patience in guiding!
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
> >>>>>>>>>>>>> approval,
> >>>>>>>>>>>>> as I'm not a maintainer.
> >>>>>>>>
> >>>>>>>> First a note, I need to check on voting for Bill as the slsr
> >>>>> maintainer
> >>>>>>>> from the steering committee.   Voting was in progress just before
> >>>>> the
> >>>>>>>> close of stage1 development so I haven't tallied the results :-)
> >>>>>>>
> >>>>>>>
> >>>>>>> Looking forward to some good news! :)
> >>>>>>>
> >>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
> >>>>> shared.
> >>>>>>>>>>       The cached is introduced mainly because get_alternative_base
> >>>>> () may
> >>>>>>>>>> be
> >>>>>>>>>> called twice on the same 'base' tree, once in the
> >>>>>>>>>> find_basis_for_candidate () for look-up and the other time in
> >>>>>>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
> >>>>> happy
> >>>>>>>>>> to leave out the cache if you think the benefit is trivial.
> >>>>>>>>
> >>>>>>>> Without some sense of how expensive the lookups are vs how often
> >>>>> the
> >>>>>>>> cache hits it's awful hard to know if the cache is worth it.
> >>>>>>>>
> >>>>>>>> I'd say take it out unless you have some sense it's really saving
> >>>>> time.
> >>>>>>>>      It's a pretty minor implementation detail either way.
> >>>>>>>
> >>>>>>>
> >>>>>>> I think the affine tree routines are generally expensive; it is
> >>>>> worth having
> >>>>>>> a cache to avoid calling them too many times.  I run the slsr-*.c
> >>>>> tests
> >>>>>>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
> >>>>> from
> >>>>>>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
> >>>>> represent
> >>>>>>> the real world scenario, but they do show the fact that the 'base'
> >>>>> tree can
> >>>>>>> be shared to some extent.  So I'd like to have the cache in the
> >>>>> patch.
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> +/* { dg-do compile } */
> >>>>>>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
> >>>>>>>>>>> +
> >>>>>>>>>>> +typedef int arr_2[50][50];
> >>>>>>>>>>> +
> >>>>>>>>>>> +void foo (arr_2 a2, int v1)
> >>>>>>>>>>> +{
> >>>>>>>>>>> +  int i, j;
> >>>>>>>>>>> +
> >>>>>>>>>>> +  i = v1 + 5;
> >>>>>>>>>>> +  j = i;
> >>>>>>>>>>> +  a2 [i-10] [j] = 2;
> >>>>>>>>>>> +  a2 [i] [j++] = i;
> >>>>>>>>>>> +  a2 [i+20] [j++] = i;
> >>>>>>>>>>> +  a2 [i-3] [i-1] += 1;
> >>>>>>>>>>> +  return;
> >>>>>>>>>>> +}
> >>>>>>>>>>> +
> >>>>>>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
> >>>>>>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
> >>>>>>>>>>>
> >>>>>>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
> >>>>>>>>>>> you expect?  I see other slsr testcases do similar non-sensical
> >>>>>>>>>>> checking which is bad, too.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
> >>>>> to
> >>>>>>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
> >>>>> MEM_REFs
> >>>>>>>>>> is an effective check.  Alternatively, I can add a follow-up
> >>>>> patch to
> >>>>>>>>>> add some dumping facility in replace_ref () to print out the
> >>>>> replacing
> >>>>>>>>>> actions when -fdump-tree-slsr-details is on.
> >>>>>>>>
> >>>>>>>> I think adding some details to the dump and scanning for them would
> >>>>> be
> >>>>>>>> better.  That's the only change that is required for this to move
> >>>>> forward.
> >>>>>>>
> >>>>>>>
> >>>>>>> I've updated to patch to dump more details when
> >>>>> -fdump-tree-slsr-details is
> >>>>>>> on.  The tests have also been updated to scan for these new dumps
> >>>>> instead of
> >>>>>>> MEMs.
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I suggest doing it quickly.  We're well past stage1 close at this
> >>>>> point.
> >>>>>>>
> >>>>>>>
> >>>>>>> The bootstrapping on x86_64 is still running.  OK to commit if it
> >>>>> succeeds?
> >>>>>>
> >>>>>> I still don't like it.  It's using the wrong and too expensive tools
> >>>>> to do
> >>>>>> stuff.  What kind of bases are we ultimately interested in?  Browsing
> >>>>>> the code it looks like we're having
> >>>>>>
> >>>>>>     /* Base expression for the chain of candidates:  often, but not
> >>>>>>        always, an SSA name.  */
> >>>>>>     tree base_expr;
> >>>>>>
> >>>>>> which isn't really too informative but I suppose they are all
> >>>>>> kind-of-gimple_val()s?  That said, I wonder if you can simply
> >>>>>> use get_addr_base_and_unit_offset in place of get_alternative_base
> >>>>> (),
> >>>>>> ignoring the returned offset.
> >>>>>
> >>>>> 'base_expr' is essentially the base address of a handled_component_p,
> >>>>> e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
> >>>>>
> >>>>> the object returned by get_inner_reference ().
> >>>>>
> >>>>> Given a test case like the following:
> >>>>>
> >>>>> typedef int arr_2[20][20];
> >>>>>
> >>>>> void foo (arr_2 a2, int i, int j)
> >>>>> {
> >>>>>    a2[i+10][j] = 1;
> >>>>>    a2[i+10][j+1] = 1;
> >>>>>    a2[i+20][j] = 1;
> >>>>> }
> >>>>>
> >>>>> The IR before SLSR is (on x86_64):
> >>>>>
> >>>>>    _2 = (long unsigned int) i_1(D);
> >>>>>    _3 = _2 * 80;
> >>>>>    _4 = _3 + 800;
> >>>>>    _6 = a2_5(D) + _4;
> >>>>>    *_6[j_8(D)] = 1;
> >>>>>    _10 = j_8(D) + 1;
> >>>>>    *_6[_10] = 1;
> >>>>>    _12 = _3 + 1600;
> >>>>>    _13 = a2_5(D) + _12;
> >>>>>    *_13[j_8(D)] = 1;
> >>>>>
> >>>>> The base_expr for the 1st and 2nd memory reference are the same, i.e.
> >>>>> _6, while the base_expr for a2[i+20][j] is _13.
> >>>>>
> >>>>> _13 is essentially (_6 + 800), so all of the three memory references
> >>>>> essentially share the same base address.  As their strides are also the
> >>>>>
> >>>>> same (MULT_EXPR (j, 4)), the three references can all be lowered to
> >>>>> MEM_REFs.  What this patch does is to use the tree affine tools to help
> >>>>>
> >>>>> recognize the underlying base address expression; as it requires
> >>>>> looking
> >>>>> into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
> >>>>> won't help here.
> >>>>>
> >>>>> Bill has helped me exploit other ways of achieving this in SLSR, but so
> >>>>>
> >>>>> far we think this is the best way to proceed.  The use of tree affine
> >>>>> routines has been restricted to CAND_REFs only and there is the
> >>>>> aforementioned cache facility to help reduce the overhead.
> >>>>>
> >>>>> Thanks,
> >>>>> Yufeng
> >>>>>
> >>>>> P.S. some more details what the patch does:
> >>>>>
> >>>>> The CAND_REF for the three memory references are:
> >>>>>
> >>>>>   6  [2] *_6[j_8(D)] = 1;
> >>>>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >>>>>       basis: 0  dependent: 8  sibling: 0
> >>>>>       next-interp: 0  dead-savings: 0
> >>>>>
> >>>>>    8  [2] *_6[_10] = 1;
> >>>>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
> >>>>>       basis: 6  dependent: 11  sibling: 0
> >>>>>       next-interp: 0  dead-savings: 0
> >>>>>
> >>>>>   11  [2] *_13[j_8(D)] = 1;
> >>>>>       REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >>>>>       basis: 8  dependent: 0  sibling: 0
> >>>>>       next-interp: 0  dead-savings: 0

The Real Problem here is that all of these references really need to be
recorded as based on a2_5(D) as the fundamental base.  Using
get_inner_reference is not giving us the full picture needed to fully
unwrap the arithmetic.

But the information needed is all present in the candidate table, or can
be added to the candidate table, to allow this to be done fully in a
forward order according to the pass's design philosophy.  That's what we
need to concentrate on in the long term.  In the end I do not want the
affine machinery to be how we do this, because this requires us to go
back over ground we should have already been able to cover with the
forward analysis.  But this will require recording some more complicated
expressions in the table than the current infrastructure envisions.

Richard, if you're not comfortable with Yufeng's implementation as a
temporary solution for this release, then we'll have to hold it off.
I'd like to see if the compile-time issue is real or a chimera before we
dismiss it, though.

Thanks,
Bill

> >>>>>
> >>>>> Before the patch, the strength reduction candidate chains for the three
> >>>>>
> >>>>> CAND_REFs are:
> >>>>>
> >>>>>    _6 ->  6 ->  8
> >>>>>    _13 ->  11
> >>>>>
> >>>>> i.e. SLSR recognizes the first two references share the same basis,
> >>>>> while the last one is on it own.
> >>>>>
> >>>>> With the patch, an extra candidate chain can be recognized:
> >>>>>
> >>>>>    a2_5(D) + (sizetype) i_1(D) * 80 ->  6 ->  11 ->  8
> >>>>>
> >>>>> i.e. all of the three references are found to have the same basis
> >>>>> (a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
> >>>>> _6
> >>>>> or _13, with the immediate offset removed.  The pass is now able to
> >>>>> lower all of the three references, instead of the first two only, to
> >>>>> MEM_REFs.
> >>>>
> >>>> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
> >>>
> >>> I think that's overstating SLSR's current capabilities a bit. :)  We do
> >>> use get_inner_reference to come up with the base expression for
> >>> reference candidates (based on some of your suggestions a couple of
> >>> years back).  However, in the case of multiple levels of array
> >>> references, we miss opportunities because get_inner_reference stops at
> >>> an SSA name that could be further expanded by following its definition
> >>> back to a more fundamental base expression.
> >>
> >> Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
> >> same problem.
> >
> > Oh, you're using affine combination expansion ... which is even more
> > expensive.  So why isn't that then done for all ref candidates?  That is,
> > why do two different things, get_inner_reference _and_ affine-combination
> > dances.
> 
> affine-combination is only called from where get_inner_reference stops 
> (an SSA_NAME), rather than on the whole address.
> 
> > And why build back trees from that instead of storing the
> > affine combination.
> 
> I think we can store the affine combination instead, but it would need 
> changes to the infrastructure in slsr.
> 
> >
> > I'll bet we come back with compile-time issues after this patch
> > went in.  I'll count on you two to fix them then.
> 
> I'll do some timing this week; if the result on the compilation time is 
> not good, I'll revert the patch.
> 
> Yufeng
>
Bill Schmidt Dec. 4, 2013, 1:27 p.m. UTC | #14
On Wed, 2013-12-04 at 07:13 -0600, Bill Schmidt wrote:
> On Wed, 2013-12-04 at 11:30 +0100, Richard Biener wrote:
> > On Wed, Dec 4, 2013 at 11:26 AM, Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > > On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
> > > <wschmidt@linux.vnet.ibm.com> wrote:
> > >> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
> > >>> Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
> > >>> >On 12/03/13 14:20, Richard Biener wrote:
> > >>> >> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
> > >>> >wrote:
> > >>> >>> On 12/03/13 06:48, Jeff Law wrote:
> > >>> >>>>
> > >>> >>>> On 12/02/13 08:47, Yufeng Zhang wrote:
> > >>> >>>>>
> > >>> >>>>> Ping~
> > >>> >>>>>
> > >>> >>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
> > >>> >>>>
> > >>> >>>>
> > >>> >>>>>
> > >>> >>>>> Thanks,
> > >>> >>>>> Yufeng
> > >>> >>>>>
> > >>> >>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
> > >>> >>>>>>
> > >>> >>>>>> On 11/26/13 12:45, Richard Biener wrote:
> > >>> >>>>>>>
> > >>> >>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
> > >>> >>>>>>> Zhang<Yufeng.Zhang@arm.com>     wrote:
> > >>> >>>>>>>>
> > >>> >>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
> > >>> >>>>>>>>>
> > >>> >>>>>>>>> The second version of your original patch is ok with me with
> > >>> >the
> > >>> >>>>>>>>> following changes.  Sorry for the little side adventure into
> > >>> >the
> > >>> >>>>>>>>> next-interp logic; in the end that's going to hurt more than
> > >>> >it
> > >>> >>>>>>>>> helps in
> > >>> >>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
> > >>> >also for
> > >>> >>>>>>>>> cleaning up this version to be less intrusive to common
> > >>> >interfaces; I
> > >>> >>>>>>>>> appreciate it.
> > >>> >>>>>>>>
> > >>> >>>>>>>>
> > >>> >>>>>>>>
> > >>> >>>>>>>> Thanks a lot for the review.  I've attached an updated patch
> > >>> >with the
> > >>> >>>>>>>> suggested changes incorporated.
> > >>> >>>>>>>>
> > >>> >>>>>>>> For the next-interp adventure, I was quite happy to do the
> > >>> >>>>>>>> experiment; it's
> > >>> >>>>>>>> a good chance of gaining insight into the pass.  Many thanks
> > >>> >for
> > >>> >>>>>>>> your prompt
> > >>> >>>>>>>> replies and patience in guiding!
> > >>> >>>>>>>>
> > >>> >>>>>>>>
> > >>> >>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
> > >>> >>>>>>>>> approval,
> > >>> >>>>>>>>> as I'm not a maintainer.
> > >>> >>>>
> > >>> >>>> First a note, I need to check on voting for Bill as the slsr
> > >>> >maintainer
> > >>> >>>> from the steering committee.   Voting was in progress just before
> > >>> >the
> > >>> >>>> close of stage1 development so I haven't tallied the results :-)
> > >>> >>>
> > >>> >>>
> > >>> >>> Looking forward to some good news! :)
> > >>> >>>
> > >>> >>>
> > >>> >>>>>>
> > >>> >>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
> > >>> >shared.
> > >>> >>>>>>      The cached is introduced mainly because get_alternative_base
> > >>> >() may
> > >>> >>>>>> be
> > >>> >>>>>> called twice on the same 'base' tree, once in the
> > >>> >>>>>> find_basis_for_candidate () for look-up and the other time in
> > >>> >>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
> > >>> >happy
> > >>> >>>>>> to leave out the cache if you think the benefit is trivial.
> > >>> >>>>
> > >>> >>>> Without some sense of how expensive the lookups are vs how often
> > >>> >the
> > >>> >>>> cache hits it's awful hard to know if the cache is worth it.
> > >>> >>>>
> > >>> >>>> I'd say take it out unless you have some sense it's really saving
> > >>> >time.
> > >>> >>>>     It's a pretty minor implementation detail either way.
> > >>> >>>
> > >>> >>>
> > >>> >>> I think the affine tree routines are generally expensive; it is
> > >>> >worth having
> > >>> >>> a cache to avoid calling them too many times.  I run the slsr-*.c
> > >>> >tests
> > >>> >>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
> > >>> >from
> > >>> >>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
> > >>> >represent
> > >>> >>> the real world scenario, but they do show the fact that the 'base'
> > >>> >tree can
> > >>> >>> be shared to some extent.  So I'd like to have the cache in the
> > >>> >patch.
> > >>> >>>
> > >>> >>>
> > >>> >>>>
> > >>> >>>>>>
> > >>> >>>>>>> +/* { dg-do compile } */
> > >>> >>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
> > >>> >>>>>>> +
> > >>> >>>>>>> +typedef int arr_2[50][50];
> > >>> >>>>>>> +
> > >>> >>>>>>> +void foo (arr_2 a2, int v1)
> > >>> >>>>>>> +{
> > >>> >>>>>>> +  int i, j;
> > >>> >>>>>>> +
> > >>> >>>>>>> +  i = v1 + 5;
> > >>> >>>>>>> +  j = i;
> > >>> >>>>>>> +  a2 [i-10] [j] = 2;
> > >>> >>>>>>> +  a2 [i] [j++] = i;
> > >>> >>>>>>> +  a2 [i+20] [j++] = i;
> > >>> >>>>>>> +  a2 [i-3] [i-1] += 1;
> > >>> >>>>>>> +  return;
> > >>> >>>>>>> +}
> > >>> >>>>>>> +
> > >>> >>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
> > >>> >>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
> > >>> >>>>>>>
> > >>> >>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
> > >>> >>>>>>> you expect?  I see other slsr testcases do similar non-sensical
> > >>> >>>>>>> checking which is bad, too.
> > >>> >>>>>>
> > >>> >>>>>>
> > >>> >>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
> > >>> >to
> > >>> >>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
> > >>> >MEM_REFs
> > >>> >>>>>> is an effective check.  Alternatively, I can add a follow-up
> > >>> >patch to
> > >>> >>>>>> add some dumping facility in replace_ref () to print out the
> > >>> >replacing
> > >>> >>>>>> actions when -fdump-tree-slsr-details is on.
> > >>> >>>>
> > >>> >>>> I think adding some details to the dump and scanning for them would
> > >>> >be
> > >>> >>>> better.  That's the only change that is required for this to move
> > >>> >forward.
> > >>> >>>
> > >>> >>>
> > >>> >>> I've updated to patch to dump more details when
> > >>> >-fdump-tree-slsr-details is
> > >>> >>> on.  The tests have also been updated to scan for these new dumps
> > >>> >instead of
> > >>> >>> MEMs.
> > >>> >>>
> > >>> >>>
> > >>> >>>>
> > >>> >>>> I suggest doing it quickly.  We're well past stage1 close at this
> > >>> >point.
> > >>> >>>
> > >>> >>>
> > >>> >>> The bootstrapping on x86_64 is still running.  OK to commit if it
> > >>> >succeeds?
> > >>> >>
> > >>> >> I still don't like it.  It's using the wrong and too expensive tools
> > >>> >to do
> > >>> >> stuff.  What kind of bases are we ultimately interested in?  Browsing
> > >>> >> the code it looks like we're having
> > >>> >>
> > >>> >>    /* Base expression for the chain of candidates:  often, but not
> > >>> >>       always, an SSA name.  */
> > >>> >>    tree base_expr;
> > >>> >>
> > >>> >> which isn't really too informative but I suppose they are all
> > >>> >> kind-of-gimple_val()s?  That said, I wonder if you can simply
> > >>> >> use get_addr_base_and_unit_offset in place of get_alternative_base
> > >>> >(),
> > >>> >> ignoring the returned offset.
> > >>> >
> > >>> >'base_expr' is essentially the base address of a handled_component_p,
> > >>> >e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
> > >>> >
> > >>> >the object returned by get_inner_reference ().
> > >>> >
> > >>> >Given a test case like the following:
> > >>> >
> > >>> >typedef int arr_2[20][20];
> > >>> >
> > >>> >void foo (arr_2 a2, int i, int j)
> > >>> >{
> > >>> >   a2[i+10][j] = 1;
> > >>> >   a2[i+10][j+1] = 1;
> > >>> >   a2[i+20][j] = 1;
> > >>> >}
> > >>> >
> > >>> >The IR before SLSR is (on x86_64):
> > >>> >
> > >>> >   _2 = (long unsigned int) i_1(D);
> > >>> >   _3 = _2 * 80;
> > >>> >   _4 = _3 + 800;
> > >>> >   _6 = a2_5(D) + _4;
> > >>> >   *_6[j_8(D)] = 1;
> > >>> >   _10 = j_8(D) + 1;
> > >>> >   *_6[_10] = 1;
> > >>> >   _12 = _3 + 1600;
> > >>> >   _13 = a2_5(D) + _12;
> > >>> >   *_13[j_8(D)] = 1;
> > >>> >
> > >>> >The base_expr for the 1st and 2nd memory reference are the same, i.e.
> > >>> >_6, while the base_expr for a2[i+20][j] is _13.
> > >>> >
> > >>> >_13 is essentially (_6 + 800), so all of the three memory references
> > >>> >essentially share the same base address.  As their strides are also the
> > >>> >
> > >>> >same (MULT_EXPR (j, 4)), the three references can all be lowered to
> > >>> >MEM_REFs.  What this patch does is to use the tree affine tools to help
> > >>> >
> > >>> >recognize the underlying base address expression; as it requires
> > >>> >looking
> > >>> >into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
> > >>> >won't help here.
> > >>> >
> > >>> >Bill has helped me exploit other ways of achieving this in SLSR, but so
> > >>> >
> > >>> >far we think this is the best way to proceed.  The use of tree affine
> > >>> >routines has been restricted to CAND_REFs only and there is the
> > >>> >aforementioned cache facility to help reduce the overhead.
> > >>> >
> > >>> >Thanks,
> > >>> >Yufeng
> > >>> >
> > >>> >P.S. some more details what the patch does:
> > >>> >
> > >>> >The CAND_REF for the three memory references are:
> > >>> >
> > >>> >  6  [2] *_6[j_8(D)] = 1;
> > >>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> > >>> >      basis: 0  dependent: 8  sibling: 0
> > >>> >      next-interp: 0  dead-savings: 0
> > >>> >
> > >>> >   8  [2] *_6[_10] = 1;
> > >>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
> > >>> >      basis: 6  dependent: 11  sibling: 0
> > >>> >      next-interp: 0  dead-savings: 0
> > >>> >
> > >>> >  11  [2] *_13[j_8(D)] = 1;
> > >>> >      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> > >>> >      basis: 8  dependent: 0  sibling: 0
> > >>> >      next-interp: 0  dead-savings: 0
> > >>> >
> > >>> >Before the patch, the strength reduction candidate chains for the three
> > >>> >
> > >>> >CAND_REFs are:
> > >>> >
> > >>> >   _6 -> 6 -> 8
> > >>> >   _13 -> 11
> > >>> >
> > >>> >i.e. SLSR recognizes the first two references share the same basis,
> > >>> >while the last one is on it own.
> > >>> >
> > >>> >With the patch, an extra candidate chain can be recognized:
> > >>> >
> > >>> >   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8
> > >>> >
> > >>> >i.e. all of the three references are found to have the same basis
> > >>> >(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
> > >>> >_6
> > >>> >or _13, with the immediate offset removed.  The pass is now able to
> > >>> >lower all of the three references, instead of the first two only, to
> > >>> >MEM_REFs.
> > >>>
> > >>> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
> > >>
> > >> I think that's overstating SLSR's current capabilities a bit. :)  We do
> > >> use get_inner_reference to come up with the base expression for
> > >> reference candidates (based on some of your suggestions a couple of
> > >> years back).  However, in the case of multiple levels of array
> > >> references, we miss opportunities because get_inner_reference stops at
> > >> an SSA name that could be further expanded by following its definition
> > >> back to a more fundamental base expression.
> > >
> > > Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
> > > same problem.
> > 
> > Oh, you're using affine combination expansion ... which is even more
> > expensive.  So why isn't that then done for all ref candidates?  That is,
> > why do two different things, get_inner_reference _and_ affine-combination
> > dances.  And why build back trees from that instead of storing the
> > affine combination.
> 
> Well, the original design had no desire to use the expensive machinery
> of affine combination expansion.  For what was envisioned, the simpler
> mechanisms of get_inner_reference have been plenty.
> 
> My thought, and please correct me if I'm wrong, is that once we've
> already reduced to an SSA name from get_inner_reference, the affine
> machinery will terminate fairly quickly -- we shouldn't get into too
> deep a search on underlying pointer arithmetic in most cases.  But
> compile time testing will tell us whether this is reasonable.

As a middle ground, may I suggest that we only do the extra tree_affine
expansion at -O2 and above?  Any extra compile time should be a blip at
those levels.  At -O1 there could be legitimate issues, though.

Bill

> 
> Bill
> 
> > 
> > I'll bet we come back with compile-time issues after this patch
> > went in.  I'll count on you two to fix them then.
> > 
> > Richard.
> > 
> > >> Part of the issue here is that reference candidates are basis for a more
> > >> specific optimization than the mult and add candidates.  The latter have
> > >> a more general framework for building up a recording of simple affine
> > >> expressions that can be strength-reduced.  Ultimately we ought to be
> > >> able to do something similar for reference candidates, building up
> > >> simple affine expressions from base expressions, so that everything is
> > >> done in a forward order and the tree-affine interfaces aren't needed.
> > >> But that will take some more fundamental design changes, and since this
> > >> provides some good improvements for important cases, I feel it's
> > >> reasonable to get this into the release.
> > >
> > > But I fail to see what is special about doing the dance to affine and
> > > then back to trees just to drop the constant offset which would be
> > > done by get_inner_reference as well and cheaper if you just ignore
> > > bitpos.
> > >
> > > ?!
> > >
> > > Richard.
> > >
> > >> Thanks,
> > >> Bill
> > >>
> > >>>
> > >>> Richard.
> > >>>
> > >>>
> > >>
> > 
>
Richard Biener Dec. 5, 2013, 8:49 a.m. UTC | #15
On Wed, Dec 4, 2013 at 2:27 PM, Bill Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
> On Wed, 2013-12-04 at 07:13 -0600, Bill Schmidt wrote:
>> On Wed, 2013-12-04 at 11:30 +0100, Richard Biener wrote:
>> > On Wed, Dec 4, 2013 at 11:26 AM, Richard Biener
>> > <richard.guenther@gmail.com> wrote:
>> > > On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
>> > > <wschmidt@linux.vnet.ibm.com> wrote:
>> > >> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
>> > >>> Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
>> > >>> >On 12/03/13 14:20, Richard Biener wrote:
>> > >>> >> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
>> > >>> >wrote:
>> > >>> >>> On 12/03/13 06:48, Jeff Law wrote:
>> > >>> >>>>
>> > >>> >>>> On 12/02/13 08:47, Yufeng Zhang wrote:
>> > >>> >>>>>
>> > >>> >>>>> Ping~
>> > >>> >>>>>
>> > >>> >>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>> > >>> >>>>
>> > >>> >>>>
>> > >>> >>>>>
>> > >>> >>>>> Thanks,
>> > >>> >>>>> Yufeng
>> > >>> >>>>>
>> > >>> >>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>> > >>> >>>>>>
>> > >>> >>>>>> On 11/26/13 12:45, Richard Biener wrote:
>> > >>> >>>>>>>
>> > >>> >>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>> > >>> >>>>>>> Zhang<Yufeng.Zhang@arm.com>     wrote:
>> > >>> >>>>>>>>
>> > >>> >>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>> > >>> >>>>>>>>>
>> > >>> >>>>>>>>> The second version of your original patch is ok with me with
>> > >>> >the
>> > >>> >>>>>>>>> following changes.  Sorry for the little side adventure into
>> > >>> >the
>> > >>> >>>>>>>>> next-interp logic; in the end that's going to hurt more than
>> > >>> >it
>> > >>> >>>>>>>>> helps in
>> > >>> >>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
>> > >>> >also for
>> > >>> >>>>>>>>> cleaning up this version to be less intrusive to common
>> > >>> >interfaces; I
>> > >>> >>>>>>>>> appreciate it.
>> > >>> >>>>>>>>
>> > >>> >>>>>>>>
>> > >>> >>>>>>>>
>> > >>> >>>>>>>> Thanks a lot for the review.  I've attached an updated patch
>> > >>> >with the
>> > >>> >>>>>>>> suggested changes incorporated.
>> > >>> >>>>>>>>
>> > >>> >>>>>>>> For the next-interp adventure, I was quite happy to do the
>> > >>> >>>>>>>> experiment; it's
>> > >>> >>>>>>>> a good chance of gaining insight into the pass.  Many thanks
>> > >>> >for
>> > >>> >>>>>>>> your prompt
>> > >>> >>>>>>>> replies and patience in guiding!
>> > >>> >>>>>>>>
>> > >>> >>>>>>>>
>> > >>> >>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>> > >>> >>>>>>>>> approval,
>> > >>> >>>>>>>>> as I'm not a maintainer.
>> > >>> >>>>
>> > >>> >>>> First a note, I need to check on voting for Bill as the slsr
>> > >>> >maintainer
>> > >>> >>>> from the steering committee.   Voting was in progress just before
>> > >>> >the
>> > >>> >>>> close of stage1 development so I haven't tallied the results :-)
>> > >>> >>>
>> > >>> >>>
>> > >>> >>> Looking forward to some good news! :)
>> > >>> >>>
>> > >>> >>>
>> > >>> >>>>>>
>> > >>> >>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
>> > >>> >shared.
>> > >>> >>>>>>      The cached is introduced mainly because get_alternative_base
>> > >>> >() may
>> > >>> >>>>>> be
>> > >>> >>>>>> called twice on the same 'base' tree, once in the
>> > >>> >>>>>> find_basis_for_candidate () for look-up and the other time in
>> > >>> >>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
>> > >>> >happy
>> > >>> >>>>>> to leave out the cache if you think the benefit is trivial.
>> > >>> >>>>
>> > >>> >>>> Without some sense of how expensive the lookups are vs how often
>> > >>> >the
>> > >>> >>>> cache hits it's awful hard to know if the cache is worth it.
>> > >>> >>>>
>> > >>> >>>> I'd say take it out unless you have some sense it's really saving
>> > >>> >time.
>> > >>> >>>>     It's a pretty minor implementation detail either way.
>> > >>> >>>
>> > >>> >>>
>> > >>> >>> I think the affine tree routines are generally expensive; it is
>> > >>> >worth having
>> > >>> >>> a cache to avoid calling them too many times.  I run the slsr-*.c
>> > >>> >tests
>> > >>> >>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
>> > >>> >from
>> > >>> >>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
>> > >>> >represent
>> > >>> >>> the real world scenario, but they do show the fact that the 'base'
>> > >>> >tree can
>> > >>> >>> be shared to some extent.  So I'd like to have the cache in the
>> > >>> >patch.
>> > >>> >>>
>> > >>> >>>
>> > >>> >>>>
>> > >>> >>>>>>
>> > >>> >>>>>>> +/* { dg-do compile } */
>> > >>> >>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>> > >>> >>>>>>> +
>> > >>> >>>>>>> +typedef int arr_2[50][50];
>> > >>> >>>>>>> +
>> > >>> >>>>>>> +void foo (arr_2 a2, int v1)
>> > >>> >>>>>>> +{
>> > >>> >>>>>>> +  int i, j;
>> > >>> >>>>>>> +
>> > >>> >>>>>>> +  i = v1 + 5;
>> > >>> >>>>>>> +  j = i;
>> > >>> >>>>>>> +  a2 [i-10] [j] = 2;
>> > >>> >>>>>>> +  a2 [i] [j++] = i;
>> > >>> >>>>>>> +  a2 [i+20] [j++] = i;
>> > >>> >>>>>>> +  a2 [i-3] [i-1] += 1;
>> > >>> >>>>>>> +  return;
>> > >>> >>>>>>> +}
>> > >>> >>>>>>> +
>> > >>> >>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>> > >>> >>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>> > >>> >>>>>>>
>> > >>> >>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>> > >>> >>>>>>> you expect?  I see other slsr testcases do similar non-sensical
>> > >>> >>>>>>> checking which is bad, too.
>> > >>> >>>>>>
>> > >>> >>>>>>
>> > >>> >>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
>> > >>> >to
>> > >>> >>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
>> > >>> >MEM_REFs
>> > >>> >>>>>> is an effective check.  Alternatively, I can add a follow-up
>> > >>> >patch to
>> > >>> >>>>>> add some dumping facility in replace_ref () to print out the
>> > >>> >replacing
>> > >>> >>>>>> actions when -fdump-tree-slsr-details is on.
>> > >>> >>>>
>> > >>> >>>> I think adding some details to the dump and scanning for them would
>> > >>> >be
>> > >>> >>>> better.  That's the only change that is required for this to move
>> > >>> >forward.
>> > >>> >>>
>> > >>> >>>
>> > >>> >>> I've updated to patch to dump more details when
>> > >>> >-fdump-tree-slsr-details is
>> > >>> >>> on.  The tests have also been updated to scan for these new dumps
>> > >>> >instead of
>> > >>> >>> MEMs.
>> > >>> >>>
>> > >>> >>>
>> > >>> >>>>
>> > >>> >>>> I suggest doing it quickly.  We're well past stage1 close at this
>> > >>> >point.
>> > >>> >>>
>> > >>> >>>
>> > >>> >>> The bootstrapping on x86_64 is still running.  OK to commit if it
>> > >>> >succeeds?
>> > >>> >>
>> > >>> >> I still don't like it.  It's using the wrong and too expensive tools
>> > >>> >to do
>> > >>> >> stuff.  What kind of bases are we ultimately interested in?  Browsing
>> > >>> >> the code it looks like we're having
>> > >>> >>
>> > >>> >>    /* Base expression for the chain of candidates:  often, but not
>> > >>> >>       always, an SSA name.  */
>> > >>> >>    tree base_expr;
>> > >>> >>
>> > >>> >> which isn't really too informative but I suppose they are all
>> > >>> >> kind-of-gimple_val()s?  That said, I wonder if you can simply
>> > >>> >> use get_addr_base_and_unit_offset in place of get_alternative_base
>> > >>> >(),
>> > >>> >> ignoring the returned offset.
>> > >>> >
>> > >>> >'base_expr' is essentially the base address of a handled_component_p,
>> > >>> >e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
>> > >>> >
>> > >>> >the object returned by get_inner_reference ().
>> > >>> >
>> > >>> >Given a test case like the following:
>> > >>> >
>> > >>> >typedef int arr_2[20][20];
>> > >>> >
>> > >>> >void foo (arr_2 a2, int i, int j)
>> > >>> >{
>> > >>> >   a2[i+10][j] = 1;
>> > >>> >   a2[i+10][j+1] = 1;
>> > >>> >   a2[i+20][j] = 1;
>> > >>> >}
>> > >>> >
>> > >>> >The IR before SLSR is (on x86_64):
>> > >>> >
>> > >>> >   _2 = (long unsigned int) i_1(D);
>> > >>> >   _3 = _2 * 80;
>> > >>> >   _4 = _3 + 800;
>> > >>> >   _6 = a2_5(D) + _4;
>> > >>> >   *_6[j_8(D)] = 1;
>> > >>> >   _10 = j_8(D) + 1;
>> > >>> >   *_6[_10] = 1;
>> > >>> >   _12 = _3 + 1600;
>> > >>> >   _13 = a2_5(D) + _12;
>> > >>> >   *_13[j_8(D)] = 1;
>> > >>> >
>> > >>> >The base_expr for the 1st and 2nd memory reference are the same, i.e.
>> > >>> >_6, while the base_expr for a2[i+20][j] is _13.
>> > >>> >
>> > >>> >_13 is essentially (_6 + 800), so all of the three memory references
>> > >>> >essentially share the same base address.  As their strides are also the
>> > >>> >
>> > >>> >same (MULT_EXPR (j, 4)), the three references can all be lowered to
>> > >>> >MEM_REFs.  What this patch does is to use the tree affine tools to help
>> > >>> >
>> > >>> >recognize the underlying base address expression; as it requires
>> > >>> >looking
>> > >>> >into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
>> > >>> >won't help here.
>> > >>> >
>> > >>> >Bill has helped me exploit other ways of achieving this in SLSR, but so
>> > >>> >
>> > >>> >far we think this is the best way to proceed.  The use of tree affine
>> > >>> >routines has been restricted to CAND_REFs only and there is the
>> > >>> >aforementioned cache facility to help reduce the overhead.
>> > >>> >
>> > >>> >Thanks,
>> > >>> >Yufeng
>> > >>> >
>> > >>> >P.S. some more details what the patch does:
>> > >>> >
>> > >>> >The CAND_REF for the three memory references are:
>> > >>> >
>> > >>> >  6  [2] *_6[j_8(D)] = 1;
>> > >>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>> > >>> >      basis: 0  dependent: 8  sibling: 0
>> > >>> >      next-interp: 0  dead-savings: 0
>> > >>> >
>> > >>> >   8  [2] *_6[_10] = 1;
>> > >>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
>> > >>> >      basis: 6  dependent: 11  sibling: 0
>> > >>> >      next-interp: 0  dead-savings: 0
>> > >>> >
>> > >>> >  11  [2] *_13[j_8(D)] = 1;
>> > >>> >      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>> > >>> >      basis: 8  dependent: 0  sibling: 0
>> > >>> >      next-interp: 0  dead-savings: 0
>> > >>> >
>> > >>> >Before the patch, the strength reduction candidate chains for the three
>> > >>> >
>> > >>> >CAND_REFs are:
>> > >>> >
>> > >>> >   _6 -> 6 -> 8
>> > >>> >   _13 -> 11
>> > >>> >
>> > >>> >i.e. SLSR recognizes the first two references share the same basis,
>> > >>> >while the last one is on it own.
>> > >>> >
>> > >>> >With the patch, an extra candidate chain can be recognized:
>> > >>> >
>> > >>> >   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8
>> > >>> >
>> > >>> >i.e. all of the three references are found to have the same basis
>> > >>> >(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
>> > >>> >_6
>> > >>> >or _13, with the immediate offset removed.  The pass is now able to
>> > >>> >lower all of the three references, instead of the first two only, to
>> > >>> >MEM_REFs.
>> > >>>
>> > >>> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
>> > >>
>> > >> I think that's overstating SLSR's current capabilities a bit. :)  We do
>> > >> use get_inner_reference to come up with the base expression for
>> > >> reference candidates (based on some of your suggestions a couple of
>> > >> years back).  However, in the case of multiple levels of array
>> > >> references, we miss opportunities because get_inner_reference stops at
>> > >> an SSA name that could be further expanded by following its definition
>> > >> back to a more fundamental base expression.
>> > >
>> > > Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
>> > > same problem.
>> >
>> > Oh, you're using affine combination expansion ... which is even more
>> > expensive.  So why isn't that then done for all ref candidates?  That is,
>> > why do two different things, get_inner_reference _and_ affine-combination
>> > dances.  And why build back trees from that instead of storing the
>> > affine combination.
>>
>> Well, the original design had no desire to use the expensive machinery
>> of affine combination expansion.  For what was envisioned, the simpler
>> mechanisms of get_inner_reference have been plenty.
>>
>> My thought, and please correct me if I'm wrong, is that once we've
>> already reduced to an SSA name from get_inner_reference, the affine
>> machinery will terminate fairly quickly -- we shouldn't get into too
>> deep a search on underlying pointer arithmetic in most cases.  But
>> compile time testing will tell us whether this is reasonable.

Indeed.  Doing this still feels backward and odd - the pass should be
able to determine this globally instead of repeating local analysis
(even with a cache).

> As a middle ground, may I suggest that we only do the extra tree_affine
> expansion at -O2 and above?  Any extra compile time should be a blip at
> those levels.  At -O1 there could be legitimate issues, though.

You should check flag_expensive_optimizations here I think.

Richard.

> Bill
>
>>
>> Bill
>>
>> >
>> > I'll bet we come back with compile-time issues after this patch
>> > went in.  I'll count on you two to fix them then.
>> >
>> > Richard.
>> >
>> > >> Part of the issue here is that reference candidates are basis for a more
>> > >> specific optimization than the mult and add candidates.  The latter have
>> > >> a more general framework for building up a recording of simple affine
>> > >> expressions that can be strength-reduced.  Ultimately we ought to be
>> > >> able to do something similar for reference candidates, building up
>> > >> simple affine expressions from base expressions, so that everything is
>> > >> done in a forward order and the tree-affine interfaces aren't needed.
>> > >> But that will take some more fundamental design changes, and since this
>> > >> provides some good improvements for important cases, I feel it's
>> > >> reasonable to get this into the release.
>> > >
>> > > But I fail to see what is special about doing the dance to affine and
>> > > then back to trees just to drop the constant offset which would be
>> > > done by get_inner_reference as well and cheaper if you just ignore
>> > > bitpos.
>> > >
>> > > ?!
>> > >
>> > > Richard.
>> > >
>> > >> Thanks,
>> > >> Bill
>> > >>
>> > >>>
>> > >>> Richard.
>> > >>>
>> > >>>
>> > >>
>> >
>>
>
>
diff mbox

Patch

diff --git a/gcc/gimple-ssa-strength-reduction.c b/gcc/gimple-ssa-strength-reduction.c
index 88afc91..bf3362f 100644
--- a/gcc/gimple-ssa-strength-reduction.c
+++ b/gcc/gimple-ssa-strength-reduction.c
@@ -53,6 +53,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "params.h"
 #include "hash-table.h"
 #include "tree-ssa-address.h"
+#include "tree-affine.h"
 
 /* Information about a strength reduction candidate.  Each statement
    in the candidate table represents an expression of one of the
@@ -420,6 +421,42 @@  cand_chain_hasher::equal (const value_type *chain1, const compare_type *chain2)
 /* Hash table embodying a mapping from base exprs to chains of candidates.  */
 static hash_table <cand_chain_hasher> base_cand_map;
 
+/* Pointer map used by tree_to_aff_combination_expand.  */
+static struct pointer_map_t *name_expansions;
+/* Pointer map embodying a mapping from bases to alternative bases.  */
+static struct pointer_map_t *alt_base_map;
+
+/* Given BASE, use the tree affine combiniation facilities to
+   find the underlying tree expression for BASE, with any
+   immediate offset excluded.  */
+
+static tree
+get_alternative_base (tree base)
+{
+  tree *result = (tree *) pointer_map_contains (alt_base_map, base);
+
+  if (result == NULL)
+    {
+      tree expr;
+      aff_tree aff;
+
+      tree_to_aff_combination_expand (base, TREE_TYPE (base),
+				      &aff, &name_expansions);
+      aff.offset = tree_to_double_int (integer_zero_node);
+      expr = aff_combination_to_tree (&aff);
+
+      result = (tree *) pointer_map_insert (alt_base_map, base);
+      gcc_assert (!*result);
+
+      if (expr == base)
+	*result = NULL;
+      else
+	*result = expr;
+    }
+
+  return *result;
+}
+
 /* Look in the candidate table for a CAND_PHI that defines BASE and
    return it if found; otherwise return NULL.  */
 
@@ -440,8 +477,9 @@  find_phi_def (tree base)
 }
 
 /* Helper routine for find_basis_for_candidate.  May be called twice:
-   once for the candidate's base expr, and optionally again for the
-   candidate's phi definition.  */
+   once for the candidate's base expr, and optionally again either for
+   the candidate's phi definition or for a CAND_REF's alternative base
+   expression.  */
 
 static slsr_cand_t
 find_basis_for_base_expr (slsr_cand_t c, tree base_expr)
@@ -518,6 +556,13 @@  find_basis_for_candidate (slsr_cand_t c)
 	}
     }
 
+  if (!basis && c->kind == CAND_REF)
+    {
+      tree alt_base_expr = get_alternative_base (c->base_expr);
+      if (alt_base_expr)
+	basis = find_basis_for_base_expr (c, alt_base_expr);
+    }
+
   if (basis)
     {
       c->sibling = basis->dependent;
@@ -528,17 +573,21 @@  find_basis_for_candidate (slsr_cand_t c)
   return 0;
 }
 
-/* Record a mapping from the base expression of C to C itself, indicating that
-   C may potentially serve as a basis using that base expression.  */
+/* Record a mapping from BASE to C, indicating that C may potentially serve
+   as a basis using that base expression.  BASE may be the same as
+   C->BASE_EXPR; alternatively BASE can be a different tree that share the
+   underlining expression of C->BASE_EXPR.  */
 
 static void
-record_potential_basis (slsr_cand_t c)
+record_potential_basis (slsr_cand_t c, tree base)
 {
   cand_chain_t node;
   cand_chain **slot;
 
+  gcc_assert (base);
+
   node = (cand_chain_t) obstack_alloc (&chain_obstack, sizeof (cand_chain));
-  node->base_expr = c->base_expr;
+  node->base_expr = base;
   node->cand = c;
   node->next = NULL;
   slot = base_cand_map.find_slot (node, INSERT);
@@ -554,10 +603,18 @@  record_potential_basis (slsr_cand_t c)
 }
 
 /* Allocate storage for a new candidate and initialize its fields.
-   Attempt to find a basis for the candidate.  */
+   Attempt to find a basis for the candidate.
+
+   For CAND_REF, an alternative base may also be recorded and used
+   to find a basis.  This helps cases where the expression hidden
+   behind BASE (which is usually an SSA_NAME) has immediate offset,
+   e.g.
+
+     a2[i][j] = 1;
+     a2[i + 20][j] = 2;  */
 
 static slsr_cand_t
-alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base, 
+alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
 			   double_int index, tree stride, tree ctype,
 			   unsigned savings)
 {
@@ -583,7 +640,13 @@  alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
   else
     c->basis = find_basis_for_candidate (c);
 
-  record_potential_basis (c);
+  record_potential_basis (c, base);
+  if (kind == CAND_REF)
+    {
+      tree alt_base = get_alternative_base (base);
+      if (alt_base)
+	record_potential_basis (c, alt_base);
+    }
 
   return c;
 }
@@ -1843,6 +1906,12 @@  replace_ref (tree *expr, slsr_cand_t c)
 static void
 replace_refs (slsr_cand_t c)
 {
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fputs ("Replacing reference: ", dump_file);
+      print_gimple_stmt (dump_file, c->cand_stmt, 0, 0);
+    }
+
   if (gimple_vdef (c->cand_stmt))
     {
       tree *lhs = gimple_assign_lhs_ptr (c->cand_stmt);
@@ -1854,6 +1923,13 @@  replace_refs (slsr_cand_t c)
       replace_ref (rhs, c);
     }
 
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fputs ("With: ", dump_file);
+      print_gimple_stmt (dump_file, c->cand_stmt, 0, 0);
+      fputs ("\n", dump_file);
+    }
+
   if (c->sibling)
     replace_refs (lookup_cand (c->sibling));
 
@@ -3524,6 +3600,9 @@  execute_strength_reduction (void)
   /* Allocate the mapping from base expressions to candidate chains.  */
   base_cand_map.create (500);
 
+  /* Allocate the mapping from bases to alternative bases.  */
+  alt_base_map = pointer_map_create ();
+
   /* Initialize the loop optimizer.  We need to detect flow across
      back edges, and this gives us dominator information as well.  */
   loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
@@ -3539,6 +3618,9 @@  execute_strength_reduction (void)
       dump_cand_chains ();
     }
 
+  pointer_map_destroy (alt_base_map);
+  free_affine_expand_cache (&name_expansions);
+
   /* Analyze costs and make appropriate replacements.  */
   analyze_candidates_and_replace ();
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/slsr-39.c b/gcc/testsuite/gcc.dg/tree-ssa/slsr-39.c
index 8cc2798..c146219 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/slsr-39.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/slsr-39.c
@@ -6,7 +6,7 @@ 
     *PINDEX:   C1 + (C2 * C3) + C4  */
 
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-slsr" } */
+/* { dg-options "-O2 -fdump-tree-slsr-details" } */
 
 typedef int arr_2[50][50];
 
@@ -22,5 +22,5 @@  void foo (arr_2 a2, int v1)
   return;
 }
 
-/* { dg-final { scan-tree-dump-times "MEM" 4 "slsr" } } */
+/* { dg-final { scan-tree-dump-times "Replacing reference: " 4 "slsr" } } */
 /* { dg-final { cleanup-tree-dump "slsr" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
new file mode 100644
index 0000000..2c9d908
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
@@ -0,0 +1,24 @@ 
+/* Verify straight-line strength reduction in using
+   alternative base expr to record and look for the
+   potential candidate.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-slsr-details" } */
+
+typedef int arr_2[50][50];
+
+void foo (arr_2 a2, int v1)
+{
+  int i, j;
+
+  i = v1 + 5;
+  j = i;
+  a2 [i-10] [j] = 2;
+  a2 [i] [j++] = i;
+  a2 [i+20] [j++] = i;
+  a2 [i-3] [i-1] += 1;
+  return;
+}
+
+/* { dg-final { scan-tree-dump-times "Replacing reference: " 5 "slsr" } } */
+/* { dg-final { cleanup-tree-dump "slsr" } } */