diff mbox

[PING] Optional alternative base_expr in finding basis for CAND_REFs

Message ID 52A06B62.3050508@arm.com
State New
Headers show

Commit Message

Yufeng Zhang Dec. 5, 2013, 12:02 p.m. UTC
On 12/04/13 13:08, Bill Schmidt wrote:
> On Wed, 2013-12-04 at 11:26 +0100, Richard Biener wrote:
>> On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
>> <wschmidt@linux.vnet.ibm.com>  wrote:
>>> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
>>>> Yufeng Zhang<Yufeng.Zhang@arm.com>  wrote:
>>>>> On 12/03/13 14:20, Richard Biener wrote:
>>>>>> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
>>>>> wrote:
>>>>>>> On 12/03/13 06:48, Jeff Law wrote:
>>>>>>>>
>>>>>>>> On 12/02/13 08:47, Yufeng Zhang wrote:
>>>>>>>>>
>>>>>>>>> Ping~
>>>>>>>>>
>>>>>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yufeng
>>>>>>>>>
>>>>>>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>>>>>>>>>>
>>>>>>>>>> On 11/26/13 12:45, Richard Biener wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>>>>>>>>>>> Zhang<Yufeng.Zhang@arm.com>      wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> The second version of your original patch is ok with me with
>>>>> the
>>>>>>>>>>>>> following changes.  Sorry for the little side adventure into
>>>>> the
>>>>>>>>>>>>> next-interp logic; in the end that's going to hurt more than
>>>>> it
>>>>>>>>>>>>> helps in
>>>>>>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
>>>>> also for
>>>>>>>>>>>>> cleaning up this version to be less intrusive to common
>>>>> interfaces; I
>>>>>>>>>>>>> appreciate it.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks a lot for the review.  I've attached an updated patch
>>>>> with the
>>>>>>>>>>>> suggested changes incorporated.
>>>>>>>>>>>>
>>>>>>>>>>>> For the next-interp adventure, I was quite happy to do the
>>>>>>>>>>>> experiment; it's
>>>>>>>>>>>> a good chance of gaining insight into the pass.  Many thanks
>>>>> for
>>>>>>>>>>>> your prompt
>>>>>>>>>>>> replies and patience in guiding!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>>>>>>>>>>>>> approval,
>>>>>>>>>>>>> as I'm not a maintainer.
>>>>>>>>
>>>>>>>> First a note, I need to check on voting for Bill as the slsr
>>>>> maintainer
>>>>>>>> from the steering committee.   Voting was in progress just before
>>>>> the
>>>>>>>> close of stage1 development so I haven't tallied the results :-)
>>>>>>>
>>>>>>>
>>>>>>> Looking forward to some good news! :)
>>>>>>>
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
>>>>> shared.
>>>>>>>>>>       The cached is introduced mainly because get_alternative_base
>>>>> () may
>>>>>>>>>> be
>>>>>>>>>> called twice on the same 'base' tree, once in the
>>>>>>>>>> find_basis_for_candidate () for look-up and the other time in
>>>>>>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
>>>>> happy
>>>>>>>>>> to leave out the cache if you think the benefit is trivial.
>>>>>>>>
>>>>>>>> Without some sense of how expensive the lookups are vs how often
>>>>> the
>>>>>>>> cache hits it's awful hard to know if the cache is worth it.
>>>>>>>>
>>>>>>>> I'd say take it out unless you have some sense it's really saving
>>>>> time.
>>>>>>>>      It's a pretty minor implementation detail either way.
>>>>>>>
>>>>>>>
>>>>>>> I think the affine tree routines are generally expensive; it is
>>>>> worth having
>>>>>>> a cache to avoid calling them too many times.  I run the slsr-*.c
>>>>> tests
>>>>>>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
>>>>> from
>>>>>>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
>>>>> represent
>>>>>>> the real world scenario, but they do show the fact that the 'base'
>>>>> tree can
>>>>>>> be shared to some extent.  So I'd like to have the cache in the
>>>>> patch.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> +/* { dg-do compile } */
>>>>>>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>>>>>>>>>>> +
>>>>>>>>>>> +typedef int arr_2[50][50];
>>>>>>>>>>> +
>>>>>>>>>>> +void foo (arr_2 a2, int v1)
>>>>>>>>>>> +{
>>>>>>>>>>> +  int i, j;
>>>>>>>>>>> +
>>>>>>>>>>> +  i = v1 + 5;
>>>>>>>>>>> +  j = i;
>>>>>>>>>>> +  a2 [i-10] [j] = 2;
>>>>>>>>>>> +  a2 [i] [j++] = i;
>>>>>>>>>>> +  a2 [i+20] [j++] = i;
>>>>>>>>>>> +  a2 [i-3] [i-1] += 1;
>>>>>>>>>>> +  return;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>>>>>>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>>>>>>>>>>>
>>>>>>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>>>>>>>>>>> you expect?  I see other slsr testcases do similar non-sensical
>>>>>>>>>>> checking which is bad, too.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
>>>>> to
>>>>>>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
>>>>> MEM_REFs
>>>>>>>>>> is an effective check.  Alternatively, I can add a follow-up
>>>>> patch to
>>>>>>>>>> add some dumping facility in replace_ref () to print out the
>>>>> replacing
>>>>>>>>>> actions when -fdump-tree-slsr-details is on.
>>>>>>>>
>>>>>>>> I think adding some details to the dump and scanning for them would
>>>>> be
>>>>>>>> better.  That's the only change that is required for this to move
>>>>> forward.
>>>>>>>
>>>>>>>
>>>>>>> I've updated to patch to dump more details when
>>>>> -fdump-tree-slsr-details is
>>>>>>> on.  The tests have also been updated to scan for these new dumps
>>>>> instead of
>>>>>>> MEMs.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I suggest doing it quickly.  We're well past stage1 close at this
>>>>> point.
>>>>>>>
>>>>>>>
>>>>>>> The bootstrapping on x86_64 is still running.  OK to commit if it
>>>>> succeeds?
>>>>>>
>>>>>> I still don't like it.  It's using the wrong and too expensive tools
>>>>> to do
>>>>>> stuff.  What kind of bases are we ultimately interested in?  Browsing
>>>>>> the code it looks like we're having
>>>>>>
>>>>>>     /* Base expression for the chain of candidates:  often, but not
>>>>>>        always, an SSA name.  */
>>>>>>     tree base_expr;
>>>>>>
>>>>>> which isn't really too informative but I suppose they are all
>>>>>> kind-of-gimple_val()s?  That said, I wonder if you can simply
>>>>>> use get_addr_base_and_unit_offset in place of get_alternative_base
>>>>> (),
>>>>>> ignoring the returned offset.
>>>>>
>>>>> 'base_expr' is essentially the base address of a handled_component_p,
>>>>> e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
>>>>>
>>>>> the object returned by get_inner_reference ().
>>>>>
>>>>> Given a test case like the following:
>>>>>
>>>>> typedef int arr_2[20][20];
>>>>>
>>>>> void foo (arr_2 a2, int i, int j)
>>>>> {
>>>>>    a2[i+10][j] = 1;
>>>>>    a2[i+10][j+1] = 1;
>>>>>    a2[i+20][j] = 1;
>>>>> }
>>>>>
>>>>> The IR before SLSR is (on x86_64):
>>>>>
>>>>>    _2 = (long unsigned int) i_1(D);
>>>>>    _3 = _2 * 80;
>>>>>    _4 = _3 + 800;
>>>>>    _6 = a2_5(D) + _4;
>>>>>    *_6[j_8(D)] = 1;
>>>>>    _10 = j_8(D) + 1;
>>>>>    *_6[_10] = 1;
>>>>>    _12 = _3 + 1600;
>>>>>    _13 = a2_5(D) + _12;
>>>>>    *_13[j_8(D)] = 1;
>>>>>
>>>>> The base_expr for the 1st and 2nd memory reference are the same, i.e.
>>>>> _6, while the base_expr for a2[i+20][j] is _13.
>>>>>
>>>>> _13 is essentially (_6 + 800), so all of the three memory references
>>>>> essentially share the same base address.  As their strides are also the
>>>>>
>>>>> same (MULT_EXPR (j, 4)), the three references can all be lowered to
>>>>> MEM_REFs.  What this patch does is to use the tree affine tools to help
>>>>>
>>>>> recognize the underlying base address expression; as it requires
>>>>> looking
>>>>> into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
>>>>> won't help here.
>>>>>
>>>>> Bill has helped me exploit other ways of achieving this in SLSR, but so
>>>>>
>>>>> far we think this is the best way to proceed.  The use of tree affine
>>>>> routines has been restricted to CAND_REFs only and there is the
>>>>> aforementioned cache facility to help reduce the overhead.
>>>>>
>>>>> Thanks,
>>>>> Yufeng
>>>>>
>>>>> P.S. some more details what the patch does:
>>>>>
>>>>> The CAND_REF for the three memory references are:
>>>>>
>>>>>   6  [2] *_6[j_8(D)] = 1;
>>>>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>>>>>       basis: 0  dependent: 8  sibling: 0
>>>>>       next-interp: 0  dead-savings: 0
>>>>>
>>>>>    8  [2] *_6[_10] = 1;
>>>>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
>>>>>       basis: 6  dependent: 11  sibling: 0
>>>>>       next-interp: 0  dead-savings: 0
>>>>>
>>>>>   11  [2] *_13[j_8(D)] = 1;
>>>>>       REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>>>>>       basis: 8  dependent: 0  sibling: 0
>>>>>       next-interp: 0  dead-savings: 0
>>>>>
>>>>> Before the patch, the strength reduction candidate chains for the three
>>>>>
>>>>> CAND_REFs are:
>>>>>
>>>>>    _6 ->  6 ->  8
>>>>>    _13 ->  11
>>>>>
>>>>> i.e. SLSR recognizes the first two references share the same basis,
>>>>> while the last one is on it own.
>>>>>
>>>>> With the patch, an extra candidate chain can be recognized:
>>>>>
>>>>>    a2_5(D) + (sizetype) i_1(D) * 80 ->  6 ->  11 ->  8
>>>>>
>>>>> i.e. all of the three references are found to have the same basis
>>>>> (a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
>>>>> _6
>>>>> or _13, with the immediate offset removed.  The pass is now able to
>>>>> lower all of the three references, instead of the first two only, to
>>>>> MEM_REFs.
>>>>
>>>> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
>>>
>>> I think that's overstating SLSR's current capabilities a bit. :)  We do
>>> use get_inner_reference to come up with the base expression for
>>> reference candidates (based on some of your suggestions a couple of
>>> years back).  However, in the case of multiple levels of array
>>> references, we miss opportunities because get_inner_reference stops at
>>> an SSA name that could be further expanded by following its definition
>>> back to a more fundamental base expression.
>>
>> Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
>> same problem.
>>
>>> Part of the issue here is that reference candidates are basis for a more
>>> specific optimization than the mult and add candidates.  The latter have
>>> a more general framework for building up a recording of simple affine
>>> expressions that can be strength-reduced.  Ultimately we ought to be
>>> able to do something similar for reference candidates, building up
>>> simple affine expressions from base expressions, so that everything is
>>> done in a forward order and the tree-affine interfaces aren't needed.
>>> But that will take some more fundamental design changes, and since this
>>> provides some good improvements for important cases, I feel it's
>>> reasonable to get this into the release.
>>
>> But I fail to see what is special about doing the dance to affine and
>> then back to trees just to drop the constant offset which would be
>> done by get_inner_reference as well and cheaper if you just ignore
>> bitpos.
>
> I'm not sure what you're suggesting that he use get_inner_reference on
> at this point.  At the point where the affine machinery is invoked, the
> memory reference was already expanded with get_inner_reference, and
> there was no basis involving the SSA name produced as the base.  The
> affine machinery is invoked on that SSA name to see if it is hiding
> another base.  There's no additional memory reference to use
> get_inner_reference on, just potentially some pointer arithmetic.
>
> That said, if we have real compile-time issues, we should hold off on
> this patch for this release.
>
> Yufeng, please time some reasonably large benchmarks (some version of
> SPECint or similar) and report back here before the patch goes in.

I've got some build time data for SPEC2Kint.

On x86_64 the -O3 builds take about 4m7.5s with or without the patch 
(consistent over 3 samples).  The difference of the -O3 build time on 
arm cortex-a15 is also within 2 seconds.

The bootstrapping time on x86_64 is 134m48.040s without the patch and 
134m46.889s with the patch; this data is preliminary as I only sampled 
once, but the difference of the bootstrapping time on arm cortex-a15 is 
also within 5 seconds.

I can further time SPEC2006int if necessary.

I've also prepared a patch to further reduce the number of calls to 
tree-affine expansion, by checking whether or not the passed-in BASE in 
get_alternative_base () is simply an SSA_NAME of a declared variable. 
Please see the inlined patch below.

Thanks,
Yufeng


        gcc_assert (!*result);

Comments

Bill Schmidt Dec. 5, 2013, 1:21 p.m. UTC | #1
On Thu, 2013-12-05 at 12:02 +0000, Yufeng Zhang wrote:
> On 12/04/13 13:08, Bill Schmidt wrote:
> > On Wed, 2013-12-04 at 11:26 +0100, Richard Biener wrote:
> >> On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
> >> <wschmidt@linux.vnet.ibm.com>  wrote:
> >>> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
> >>>> Yufeng Zhang<Yufeng.Zhang@arm.com>  wrote:
> >>>>> On 12/03/13 14:20, Richard Biener wrote:
> >>>>>> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
> >>>>> wrote:
> >>>>>>> On 12/03/13 06:48, Jeff Law wrote:
> >>>>>>>>
> >>>>>>>> On 12/02/13 08:47, Yufeng Zhang wrote:
> >>>>>>>>>
> >>>>>>>>> Ping~
> >>>>>>>>>
> >>>>>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Yufeng
> >>>>>>>>>
> >>>>>>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 11/26/13 12:45, Richard Biener wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
> >>>>>>>>>>> Zhang<Yufeng.Zhang@arm.com>      wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The second version of your original patch is ok with me with
> >>>>> the
> >>>>>>>>>>>>> following changes.  Sorry for the little side adventure into
> >>>>> the
> >>>>>>>>>>>>> next-interp logic; in the end that's going to hurt more than
> >>>>> it
> >>>>>>>>>>>>> helps in
> >>>>>>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
> >>>>> also for
> >>>>>>>>>>>>> cleaning up this version to be less intrusive to common
> >>>>> interfaces; I
> >>>>>>>>>>>>> appreciate it.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks a lot for the review.  I've attached an updated patch
> >>>>> with the
> >>>>>>>>>>>> suggested changes incorporated.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For the next-interp adventure, I was quite happy to do the
> >>>>>>>>>>>> experiment; it's
> >>>>>>>>>>>> a good chance of gaining insight into the pass.  Many thanks
> >>>>> for
> >>>>>>>>>>>> your prompt
> >>>>>>>>>>>> replies and patience in guiding!
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
> >>>>>>>>>>>>> approval,
> >>>>>>>>>>>>> as I'm not a maintainer.
> >>>>>>>>
> >>>>>>>> First a note, I need to check on voting for Bill as the slsr
> >>>>> maintainer
> >>>>>>>> from the steering committee.   Voting was in progress just before
> >>>>> the
> >>>>>>>> close of stage1 development so I haven't tallied the results :-)
> >>>>>>>
> >>>>>>>
> >>>>>>> Looking forward to some good news! :)
> >>>>>>>
> >>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
> >>>>> shared.
> >>>>>>>>>>       The cached is introduced mainly because get_alternative_base
> >>>>> () may
> >>>>>>>>>> be
> >>>>>>>>>> called twice on the same 'base' tree, once in the
> >>>>>>>>>> find_basis_for_candidate () for look-up and the other time in
> >>>>>>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
> >>>>> happy
> >>>>>>>>>> to leave out the cache if you think the benefit is trivial.
> >>>>>>>>
> >>>>>>>> Without some sense of how expensive the lookups are vs how often
> >>>>> the
> >>>>>>>> cache hits it's awful hard to know if the cache is worth it.
> >>>>>>>>
> >>>>>>>> I'd say take it out unless you have some sense it's really saving
> >>>>> time.
> >>>>>>>>      It's a pretty minor implementation detail either way.
> >>>>>>>
> >>>>>>>
> >>>>>>> I think the affine tree routines are generally expensive; it is
> >>>>> worth having
> >>>>>>> a cache to avoid calling them too many times.  I run the slsr-*.c
> >>>>> tests
> >>>>>>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
> >>>>> from
> >>>>>>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
> >>>>> represent
> >>>>>>> the real world scenario, but they do show the fact that the 'base'
> >>>>> tree can
> >>>>>>> be shared to some extent.  So I'd like to have the cache in the
> >>>>> patch.
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> +/* { dg-do compile } */
> >>>>>>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
> >>>>>>>>>>> +
> >>>>>>>>>>> +typedef int arr_2[50][50];
> >>>>>>>>>>> +
> >>>>>>>>>>> +void foo (arr_2 a2, int v1)
> >>>>>>>>>>> +{
> >>>>>>>>>>> +  int i, j;
> >>>>>>>>>>> +
> >>>>>>>>>>> +  i = v1 + 5;
> >>>>>>>>>>> +  j = i;
> >>>>>>>>>>> +  a2 [i-10] [j] = 2;
> >>>>>>>>>>> +  a2 [i] [j++] = i;
> >>>>>>>>>>> +  a2 [i+20] [j++] = i;
> >>>>>>>>>>> +  a2 [i-3] [i-1] += 1;
> >>>>>>>>>>> +  return;
> >>>>>>>>>>> +}
> >>>>>>>>>>> +
> >>>>>>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
> >>>>>>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
> >>>>>>>>>>>
> >>>>>>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
> >>>>>>>>>>> you expect?  I see other slsr testcases do similar non-sensical
> >>>>>>>>>>> checking which is bad, too.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
> >>>>> to
> >>>>>>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
> >>>>> MEM_REFs
> >>>>>>>>>> is an effective check.  Alternatively, I can add a follow-up
> >>>>> patch to
> >>>>>>>>>> add some dumping facility in replace_ref () to print out the
> >>>>> replacing
> >>>>>>>>>> actions when -fdump-tree-slsr-details is on.
> >>>>>>>>
> >>>>>>>> I think adding some details to the dump and scanning for them would
> >>>>> be
> >>>>>>>> better.  That's the only change that is required for this to move
> >>>>> forward.
> >>>>>>>
> >>>>>>>
> >>>>>>> I've updated to patch to dump more details when
> >>>>> -fdump-tree-slsr-details is
> >>>>>>> on.  The tests have also been updated to scan for these new dumps
> >>>>> instead of
> >>>>>>> MEMs.
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I suggest doing it quickly.  We're well past stage1 close at this
> >>>>> point.
> >>>>>>>
> >>>>>>>
> >>>>>>> The bootstrapping on x86_64 is still running.  OK to commit if it
> >>>>> succeeds?
> >>>>>>
> >>>>>> I still don't like it.  It's using the wrong and too expensive tools
> >>>>> to do
> >>>>>> stuff.  What kind of bases are we ultimately interested in?  Browsing
> >>>>>> the code it looks like we're having
> >>>>>>
> >>>>>>     /* Base expression for the chain of candidates:  often, but not
> >>>>>>        always, an SSA name.  */
> >>>>>>     tree base_expr;
> >>>>>>
> >>>>>> which isn't really too informative but I suppose they are all
> >>>>>> kind-of-gimple_val()s?  That said, I wonder if you can simply
> >>>>>> use get_addr_base_and_unit_offset in place of get_alternative_base
> >>>>> (),
> >>>>>> ignoring the returned offset.
> >>>>>
> >>>>> 'base_expr' is essentially the base address of a handled_component_p,
> >>>>> e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
> >>>>>
> >>>>> the object returned by get_inner_reference ().
> >>>>>
> >>>>> Given a test case like the following:
> >>>>>
> >>>>> typedef int arr_2[20][20];
> >>>>>
> >>>>> void foo (arr_2 a2, int i, int j)
> >>>>> {
> >>>>>    a2[i+10][j] = 1;
> >>>>>    a2[i+10][j+1] = 1;
> >>>>>    a2[i+20][j] = 1;
> >>>>> }
> >>>>>
> >>>>> The IR before SLSR is (on x86_64):
> >>>>>
> >>>>>    _2 = (long unsigned int) i_1(D);
> >>>>>    _3 = _2 * 80;
> >>>>>    _4 = _3 + 800;
> >>>>>    _6 = a2_5(D) + _4;
> >>>>>    *_6[j_8(D)] = 1;
> >>>>>    _10 = j_8(D) + 1;
> >>>>>    *_6[_10] = 1;
> >>>>>    _12 = _3 + 1600;
> >>>>>    _13 = a2_5(D) + _12;
> >>>>>    *_13[j_8(D)] = 1;
> >>>>>
> >>>>> The base_expr for the 1st and 2nd memory reference are the same, i.e.
> >>>>> _6, while the base_expr for a2[i+20][j] is _13.
> >>>>>
> >>>>> _13 is essentially (_6 + 800), so all of the three memory references
> >>>>> essentially share the same base address.  As their strides are also the
> >>>>>
> >>>>> same (MULT_EXPR (j, 4)), the three references can all be lowered to
> >>>>> MEM_REFs.  What this patch does is to use the tree affine tools to help
> >>>>>
> >>>>> recognize the underlying base address expression; as it requires
> >>>>> looking
> >>>>> into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
> >>>>> won't help here.
> >>>>>
> >>>>> Bill has helped me exploit other ways of achieving this in SLSR, but so
> >>>>>
> >>>>> far we think this is the best way to proceed.  The use of tree affine
> >>>>> routines has been restricted to CAND_REFs only and there is the
> >>>>> aforementioned cache facility to help reduce the overhead.
> >>>>>
> >>>>> Thanks,
> >>>>> Yufeng
> >>>>>
> >>>>> P.S. some more details what the patch does:
> >>>>>
> >>>>> The CAND_REF for the three memory references are:
> >>>>>
> >>>>>   6  [2] *_6[j_8(D)] = 1;
> >>>>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >>>>>       basis: 0  dependent: 8  sibling: 0
> >>>>>       next-interp: 0  dead-savings: 0
> >>>>>
> >>>>>    8  [2] *_6[_10] = 1;
> >>>>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
> >>>>>       basis: 6  dependent: 11  sibling: 0
> >>>>>       next-interp: 0  dead-savings: 0
> >>>>>
> >>>>>   11  [2] *_13[j_8(D)] = 1;
> >>>>>       REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >>>>>       basis: 8  dependent: 0  sibling: 0
> >>>>>       next-interp: 0  dead-savings: 0
> >>>>>
> >>>>> Before the patch, the strength reduction candidate chains for the three
> >>>>>
> >>>>> CAND_REFs are:
> >>>>>
> >>>>>    _6 ->  6 ->  8
> >>>>>    _13 ->  11
> >>>>>
> >>>>> i.e. SLSR recognizes the first two references share the same basis,
> >>>>> while the last one is on it own.
> >>>>>
> >>>>> With the patch, an extra candidate chain can be recognized:
> >>>>>
> >>>>>    a2_5(D) + (sizetype) i_1(D) * 80 ->  6 ->  11 ->  8
> >>>>>
> >>>>> i.e. all of the three references are found to have the same basis
> >>>>> (a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
> >>>>> _6
> >>>>> or _13, with the immediate offset removed.  The pass is now able to
> >>>>> lower all of the three references, instead of the first two only, to
> >>>>> MEM_REFs.
> >>>>
> >>>> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
> >>>
> >>> I think that's overstating SLSR's current capabilities a bit. :)  We do
> >>> use get_inner_reference to come up with the base expression for
> >>> reference candidates (based on some of your suggestions a couple of
> >>> years back).  However, in the case of multiple levels of array
> >>> references, we miss opportunities because get_inner_reference stops at
> >>> an SSA name that could be further expanded by following its definition
> >>> back to a more fundamental base expression.
> >>
> >> Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
> >> same problem.
> >>
> >>> Part of the issue here is that reference candidates are basis for a more
> >>> specific optimization than the mult and add candidates.  The latter have
> >>> a more general framework for building up a recording of simple affine
> >>> expressions that can be strength-reduced.  Ultimately we ought to be
> >>> able to do something similar for reference candidates, building up
> >>> simple affine expressions from base expressions, so that everything is
> >>> done in a forward order and the tree-affine interfaces aren't needed.
> >>> But that will take some more fundamental design changes, and since this
> >>> provides some good improvements for important cases, I feel it's
> >>> reasonable to get this into the release.
> >>
> >> But I fail to see what is special about doing the dance to affine and
> >> then back to trees just to drop the constant offset which would be
> >> done by get_inner_reference as well and cheaper if you just ignore
> >> bitpos.
> >
> > I'm not sure what you're suggesting that he use get_inner_reference on
> > at this point.  At the point where the affine machinery is invoked, the
> > memory reference was already expanded with get_inner_reference, and
> > there was no basis involving the SSA name produced as the base.  The
> > affine machinery is invoked on that SSA name to see if it is hiding
> > another base.  There's no additional memory reference to use
> > get_inner_reference on, just potentially some pointer arithmetic.
> >
> > That said, if we have real compile-time issues, we should hold off on
> > this patch for this release.
> >
> > Yufeng, please time some reasonably large benchmarks (some version of
> > SPECint or similar) and report back here before the patch goes in.
> 
> I've got some build time data for SPEC2Kint.
> 
> On x86_64 the -O3 builds take about 4m7.5s with or without the patch 
> (consistent over 3 samples).  The difference of the -O3 build time on 
> arm cortex-a15 is also within 2 seconds.
> 
> The bootstrapping time on x86_64 is 134m48.040s without the patch and 
> 134m46.889s with the patch; this data is preliminary as I only sampled 
> once, but the difference of the bootstrapping time on arm cortex-a15 is 
> also within 5 seconds.
> 
> I can further time SPEC2006int if necessary.
> 
> I've also prepared a patch to further reduce the number of calls to 
> tree-affine expansion, by checking whether or not the passed-in BASE in 
> get_alternative_base () is simply an SSA_NAME of a declared variable. 
> Please see the inlined patch below.
> 
> Thanks,
> Yufeng
> 
> 
> diff --git a/gcc/gimple-ssa-strength-reduction.c 
> b/gcc/gimple-ssa-strength-reduction.c
> index 26502c3..2984f06 100644
> --- a/gcc/gimple-ssa-strength-reduction.c
> +++ b/gcc/gimple-ssa-strength-reduction.c
> @@ -437,13 +437,22 @@ get_alternative_base (tree base)
> 
>     if (result == NULL)
>       {
> -      tree expr;
> -      aff_tree aff;
> +      tree expr = NULL;
> +      gimple def = NULL;
> 
> -      tree_to_aff_combination_expand (base, TREE_TYPE (base),
> -				      &aff, &name_expansions);
> -      aff.offset = tree_to_double_int (integer_zero_node);
> -      expr = aff_combination_to_tree (&aff);
> +      if (TREE_CODE (base) == SSA_NAME)
> +	def = SSA_NAME_DEF_STMT (base);
> +
> +      /* Avoid calling expensive tree-affine expansion if BASE
> +         is just an SSA_NAME of, e.g. a para_decl.  */
> +      if (!def || (is_gimple_assign (def) && gimple_assign_lhs (def) == 
> base))

Well, that just isn't right.  !def indicates you have a parameter, so
why call tree_to_aff_combination_expand in that case?  Just forget this
addition and check for flag_expensive_optimizations as Richard suggested
in another branch of this thread.

Previous version of the patch is ok with this change, and with a comment
added that we should eliminate this backtracking with better forward
analysis in a future release.

Thanks,
Bill

> +	{
> +	  aff_tree aff;
> +	  tree_to_aff_combination_expand (base, TREE_TYPE (base),
> +					  &aff, &name_expansions);
> +	  aff.offset = tree_to_double_int (integer_zero_node);
> +	  expr = aff_combination_to_tree (&aff);
> +	}
> 
>         result = (tree *) pointer_map_insert (alt_base_map, base);
>         gcc_assert (!*result);
>
diff mbox

Patch

diff --git a/gcc/gimple-ssa-strength-reduction.c 
b/gcc/gimple-ssa-strength-reduction.c
index 26502c3..2984f06 100644
--- a/gcc/gimple-ssa-strength-reduction.c
+++ b/gcc/gimple-ssa-strength-reduction.c
@@ -437,13 +437,22 @@  get_alternative_base (tree base)

    if (result == NULL)
      {
-      tree expr;
-      aff_tree aff;
+      tree expr = NULL;
+      gimple def = NULL;

-      tree_to_aff_combination_expand (base, TREE_TYPE (base),
-				      &aff, &name_expansions);
-      aff.offset = tree_to_double_int (integer_zero_node);
-      expr = aff_combination_to_tree (&aff);
+      if (TREE_CODE (base) == SSA_NAME)
+	def = SSA_NAME_DEF_STMT (base);
+
+      /* Avoid calling expensive tree-affine expansion if BASE
+         is just an SSA_NAME of, e.g. a para_decl.  */
+      if (!def || (is_gimple_assign (def) && gimple_assign_lhs (def) == 
base))
+	{
+	  aff_tree aff;
+	  tree_to_aff_combination_expand (base, TREE_TYPE (base),
+					  &aff, &name_expansions);
+	  aff.offset = tree_to_double_int (integer_zero_node);
+	  expr = aff_combination_to_tree (&aff);
+	}

        result = (tree *) pointer_map_insert (alt_base_map, base);