diff mbox

Optional alternative base_expr in finding basis for CAND_REFs

Message ID 5282ACE5.8020304@arm.com
State New
Headers show

Commit Message

Yufeng Zhang Nov. 12, 2013, 10:34 p.m. UTC
Hi Bill,

Many thanks for the review.

I find your suggestion on using the next_interp field quite 
enlightening.  I prepared a patch which adds changes without modifying 
the framework.  With the patch, the slsr pass now tries to create a 
second candidate for each memory accessing gimple statement, and chain 
it to the first one via the next_interp field.

There are two implications in this approach though:

1) For each memory accessing gimple statement, there can be two 
candidates, and these two candidates can be part of different dependency 
graphs respectively (based on different base expr).  Only one of the 
dependency graph should be traversed to do replace_refs.  Most of the 
changes in the patch is to handle this implication.

I am aware that you suggest to follow the next-interp chain only when 
the searching fails for the first interpretation.  However, that doesn't 
work very well, as it can result in worse code-gen.  Taking a varied 
form of the added test slsr-41.c for example:

i1:  a2 [i] [j] = 1;
i2:  a2 [i] [j+1] = 2;
i3:  a2 [i+20] [j] = i;

With the 2nd interpretation created conditionally, the following two 
dependency chains will be established:

   i1 --> i2  (base expr is an SSA_NAME defined as (a2 + i * 200))
   i1 --> i3  (base expr is a tree expression of (a2 + i * 200))

the result is that three gimples will be lowered to MEM_REFs differently 
(as the candidates have different base_exprs); the later passes can get 
confused, generating worse code.

What this patch does is to create two interpretations where possible (if 
different base exprs exist); the following dependency chains will be 
produced:

   i1 --> i2  (base expr is an SSA_NAME defined as (a2 + i * 200))
   i1 --> i2 --> i3  (base expr is a tree expression of (a2 + i * 200))

In analyze_candidates_and_replace, a new function preferred_ref_cand is 
called to analyze a root CAND_REF and replace_refs is only called if 
this root CAND_REF is found to be part of a larger dependency graph (or 
longer dependency chain in simple cases).  In the example above, the 2nd 
dependency chain will be picked up to do replace_refs.

2) The 2nd implication is that the alternative candidate may expose the 
underlying tree expression of a base expr, which can cause more 
aggressive extraction and folding of immediate offsets.  Taking the new 
test slsr-41 for example, the code-gen difference on x86_64 with the 
original patch and this patch is (-O2):

-       leal    5(%rsi), %edx
+       leal    5(%rsi), %eax
         movslq  %esi, %rsi
-       salq    $2, %rsi
-       movslq  %edx, %rax
-       leaq    (%rax,%rax,4), %rax
-       leaq    (%rax,%rax,4), %rcx
-       salq    $3, %rcx
-       leaq    (%rdi,%rcx), %rax
-       addq    %rsi, %rax
-       movl    $2, -1980(%rax)
-       movl    %edx, 20(%rax)
-       movl    %edx, 4024(%rax)
-       leaq    -600(%rdi,%rcx), %rax
-       addl    $1, 16(%rsi,%rax)
+       imulq   $204, %rsi, %rsi
+       addq    %rsi, %rdi
+       movl    $2, -980(%rdi)
+       movl    %eax, 1020(%rdi)
+       movl    %eax, 5024(%rdi)
+       addl    $1, 416(%rdi)
         ret

As you can see, the larger offsets are produced as the affine expander 
is able to look deep into the tree expression.  This raises concern that 
larger immediates can cause worse code-gen when the immediates are out 
of the supported range on a target.  On x86_64 it is not obvious (as it 
allows larger ranges), on arm cortex-a15 the load with the immediate 
5024 will be done by

         movw    r2, #5024
         str     r3, [r0, r2]

which is not optimal.  Things can get worse when there are multiple 
loads/stores with large immediates as each one may require an extra mov 
immediate instruction.  One thing can potentially be done is to reduce 
the strength of multiple large immediates later on in some RTL pass by 
doing an initial offsetting first?  What do you think?  Are you 
particularly concerned about this issue?

The patch passes the bootstrapping on arm and x86_64; the regtest is 
still running.

Here is the changelog:

gcc/

         * gimple-ssa-strength-reduction.c: Include tree-affine.h.
         (name_expansions): New static variable.
         (get_alternative_base): New function.
         (restructure_reference): Add new local variables 'alt_base' and
         'delta'; call get_alternative_base and alloc_cand_and_find_basis
         to create an alternative interpretation.
         (num_of_dependents): New function.
         (preferred_ref_cand): Ditto.
         (analyze_candidates_and_replace): Call preferred_ref_cand for
         CAND_REF and skip replace_refs if the returned value is 
differerent.
         (execute_strength_reduction): call free_affine_expand_cache with
         &name_expansions.

gcc/testsuite/

         * gcc.dg/tree-ssa/slsr-41.c: New test.


For your consideration, I've also attached another patch which is an 
improvement to the original patch.  This patch improves the original one 
by reducing the number of changes to the existing framework, e.g. 
leaving find_basis_for_base_expr unchanged.  While it still slightly 
modifies the interfaces (find_basis_for_candidate and 
record_potential_basis), it has advantage over the 1st patch attached 
here: its impact on the code-gen is much smaller, as it enables more 
ARRAY_REFs to be lowered without handing over the underlying tree 
expression to replace_ref.  It creates the following dependency chains 
for the aforementioned example:

   i1 --> i2  (base expr is an SSA_NAME defined as (a2 + i * 200))
   i1 --> i2 --> i3  (base expr is a tree expression of (a2 + i * 200))

While they look the same as what the 1st patch does, only one candidate 
is generated for each memory accessing gimple statement; some candiates 
are chained twice, once to a cand_chain with a base_expr of an SSA_NAME 
and the other to a cand_chain with the underlying tree expression as its 
base_expr.  In other words, it produces two different dependency graphs 
without creating different interpretations, by utilizing the existing 
framework of cand_chain and find_basis_for_base_expr.

The patch passes the bootstrapping on arm and x86_64, as well as regtest 
on x86_64.  The following is the changelog entry:

gcc/

         * gimple-ssa-strength-reduction.c: Include tree-affine.h.
         (name_expansions): New static variable.
         (alt_base_map): Ditto.
         (get_alternative_base): New function.
         (find_basis_for_candidate): For CAND_REF, optionally call
         find_basis_for_base_expr with the returned value from
         get_alternative_base.
         (record_potential_basis): Add new parameter 'base' of type 'tree';
         return if base == NULL; use base to set node->base_expr.
         (alloc_cand_and_find_basis): Update; call 
record_potential_basis for
         CAND_REF with the returned value from get_alternative_base.
         (execute_strength_reduction): Call pointer_map_create for 
alt_base_map;
         call free_affine_expand_cache with &name_expansions.

gcc/testsuite/

         * gcc.dg/tree-ssa/slsr-41.c: New test.


Which patch do you like more?

If you have any question on either of the patch, please let me know.

Regards,
Yufeng


On 11/11/13 17:09, Bill Schmidt wrote:
> Hi Yufeng,
>
> The idea is a good one but I don't like your implementation of adding an
> extra expression parameter to look at on the find_basis_for_candidate
> lookup.  This goes against the design of the pass and may not be
> sufficiently general (might there be situations where a third possible
> basis could exist?).
>
> The overall design is set up to have alternate interpretations of
> candidates in the candidate table to handle this sort of ambiguity.  The
> goal for your example is create a second candidate (chained to the first
> one by way of the next_interp field) so that the candidate table looks
> like this:
>
>     8  [2] *_10[j_7(D)] = 2;
>        REF  : _10 + ((sizetype) j_7(D) * 4) + 0 : int[20] *
>        basis: 0  dependent: 0  sibling: 0
>        next-interp: 9  dead-savings: 0
>
>     9  [2] *_10[j_7(D)] = 2;
>        REF  : _5 + ((sizetype) j_7(D) * 4) + 800 : int[20] *
>        basis: 5  dependent: 0  sibling: 0
>        next-interp: 0  dead-savings: 0
>
> This will in turn allow subsequent candidates to be seen in terms of
> either _5 or _10, which may be necessary to avoid missed opportunities.
> There may be a subsequent REF _15 +... that can be an affine expression
> of either of these, for example.
>
> If you fail to find a basis for a candidate with its first
> interpretation, you can then follow the next-interp chain to look for a
> basis for the next one, without the messy passing of extra possibilities
> to the find-basis routine.
>
> I haven't read the patch in detail, but I think this should give you
> enough to work with to re-design the idea to fit better with the
> existing framework.  Please let me know if you need more information, or
> if you feel I've misunderstood something.
>
> Thanks,
> Bill
>
> On Mon, 2013-11-04 at 18:41 +0000, Yufeng Zhang wrote:
>> Hi,
>>
>> This patch extends the slsr pass to optionally use an alternative base
>> expression in finding basis for CAND_REFs.  Currently the pass uses
>> hash-based algorithm to match the base_expr in a candidate.  Given a
>> test case like the following, slsr will not be able to recognize the two
>> CAND_REFs have the same basis, as their base_expr are of different
>> SSA_NAMEs:
>>
>> typedef int arr_2[20][20];
>>
>> void foo (arr_2 a2, int i, int j)
>> {
>>     a2[i][j] = 1;
>>     a2[i + 10][j] = 2;
>> }
>>
>> The gimple dump before slsr is like the following (using an
>> arm-none-eabi gcc):
>>
>>     i.0_2 = (unsigned int) i_1(D);
>>     _3 = i.0_2 * 80;
>>     _5 = a2_4(D) + _3;
>>     *_5[j_7(D)] = 1;<----
>>     _9 = _3 + 800;
>>     _10 = a2_4(D) + _9;
>>     *_10[j_7(D)] = 2;<----
>>
>> Here are the dumps for the two CAND_REFs generated for the two
>> statements pointed by the arrows:
>>
>>
>>     4  [2] _5 = a2_4(D) + _3;
>>        ADD  : a2_4(D) + (80 * i_1(D)) : int[20] *
>>        basis: 0  dependent: 0  sibling: 0
>>        next-interp: 0  dead-savings: 0
>>
>>     8  [2] *_10[j_7(D)] = 2;
>>        REF  : _10 + ((sizetype) j_7(D) * 4) + 0 : int[20] *
>>        basis: 5  dependent: 0  sibling: 0
>>        next-interp: 0  dead-savings: 0
>>
>> As mentioned previously, slsr cannot establish that candidate 4 is the
>> basis for the candidate 8, as they have different base_exprs: a2_4(D)
>> and _10, respectively.  However, the two references actually only differ
>> by an immediate offset (800).
>>
>> This patch uses the tree affine combination facilities to create an
>> optional alternative base expression to be used in finding (as well as
>> recording) the basis.  It calls tree_to_aff_combination_expand on
>> base_expr, reset the offset field of the generated aff_tree to 0 and
>> generate a tree from it by calling aff_combination_to_tree.
>>
>> The new tree is recorded as a potential basis, and when
>> find_basis_for_candidate fails to find a basis for a CAND_REF in its
>> normal approach, it searches again using a tree expanded in such way.
>> Such an expanded tree usually discloses the expression behind an
>> SSA_NAME.  In the example above, instead of seeing the strength
>> reduction candidate chains like this:
>>
>>     _5 ->  5
>>     _10 ->  8
>>
>> we are now having:
>>
>>     _5 ->  5
>>     _10 ->  8
>>     a2_4(D) + (sizetype) i_1(D) * 80 ->  5 ->  8
>>
>> With the candidates 5 and 8 linked to the same tree expression (a2_4(D)
>> + (sizetype) i_1(D) * 80), slsr is now able to establish that 5 is the
>> basis of 8.
>>
>> The patch doesn't attempt to change the content of any CAND_REF though.
>>    It only enables CAND_REFs which (1) have the same stride and (2) have
>> the underlying expressions of their base_expr only differ in immediate
>> offsets,  to be recognized to have the same basis.  The statements with
>> such CAND_REFs will be lowered to MEM_REFs, and later on the RTL
>> expander shall be able to fold and re-associate the immediate offsets to
>> the rightmost side of the addressing expression, and therefore exposes
>> the common sub-expression successfully.
>>
>> The code-gen difference of the example code on arm with -O2
>> -mcpu=cortex-15 is:
>>
>>           mov     r3, r1, asl #6
>> -       add     ip, r0, r2, asl #2
>>           str     lr, [sp, #-4]!
>> +       mov     ip, #1
>> +       mov     lr, #2
>>           add     r1, r3, r1, asl #4
>> -       mov     lr, #1
>> -       mov     r3, #2
>>           add     r0, r0, r1
>> -       add     r0, r0, #800
>> -       str     lr, [ip, r1]
>> -       str     r3, [r0, r2, asl #2]
>> +       add     r3, r0, r2, asl #2
>> +       str     ip, [r0, r2, asl #2]
>> +       str     lr, [r3, #800]
>>           ldr     pc, [sp], #4
>>
>> One fewer instruction in this simple case.
>>
>> The example used in illustration is too simple to show code-gen
>> difference on x86_64, but the included test case will show the benefit
>> of the patch quite obviously.
>>
>> The patch has passed
>>
>> * bootstrapping on arm and x86_64
>> * regtest on arm-none-eabi,  aarch64-none-elf and x86_64
>>
>> There is no regression in SPEC2K on arm or x86_64.
>>
>> OK to commit to the trunk?
>>
>> Any comment is welcomed!
>>
>> Thanks,
>> Yufeng
>>
>>
>> gcc/
>>
>>           * gimple-ssa-strength-reduction.c: Include tree-affine.h.
>>           (find_basis_for_base_expr): Update comment.
>>           (find_basis_for_candidate): Add new parameter 'alt_base_expr' of
>>           type 'tree'.  Optionally call find_basis_for_base_expr with
>>           'alt_base_expr'.
>>           (record_potential_basis): Add new parameter 'alt_base_expr' of
>>           type 'tree'; set node->base_expr with 'alt_base_expr' if it is
>>           not NULL.
>>           (name_expansions): New static variable.
>>           (get_alternative_base): New function.
>>           (alloc_cand_and_find_basis): Call get_alternative_base for
>> CAND_REF.
>>           Update calls to find_basis_for_candidate and
>> record_potential_basis.
>>           (execute_strength_reduction): Call free_affine_expand_cache with
>>           &name_expansions.
>>
>> gcc/testsuite/
>>
>>           * gcc.dg/tree-ssa/slsr-41.c: New test.
>
>

Comments

Bill Schmidt Nov. 13, 2013, 6:04 p.m. UTC | #1
Hi Yufeng,

On Tue, 2013-11-12 at 22:34 +0000, Yufeng Zhang wrote:
> Hi Bill,
> 
> Many thanks for the review.
> 
> I find your suggestion on using the next_interp field quite 
> enlightening.  I prepared a patch which adds changes without modifying 
> the framework.  With the patch, the slsr pass now tries to create a 
> second candidate for each memory accessing gimple statement, and chain 
> it to the first one via the next_interp field.
> 
> There are two implications in this approach though:
> 
> 1) For each memory accessing gimple statement, there can be two 
> candidates, and these two candidates can be part of different dependency 
> graphs respectively (based on different base expr).  Only one of the 
> dependency graph should be traversed to do replace_refs.  Most of the 
> changes in the patch is to handle this implication.
> 
> I am aware that you suggest to follow the next-interp chain only when 
> the searching fails for the first interpretation.  However, that doesn't 
> work very well, as it can result in worse code-gen.  Taking a varied 
> form of the added test slsr-41.c for example:
> 
> i1:  a2 [i] [j] = 1;
> i2:  a2 [i] [j+1] = 2;
> i3:  a2 [i+20] [j] = i;
> 
> With the 2nd interpretation created conditionally, the following two 
> dependency chains will be established:
> 
>    i1 --> i2  (base expr is an SSA_NAME defined as (a2 + i * 200))
>    i1 --> i3  (base expr is a tree expression of (a2 + i * 200))

So it seems to me that really what needs to happen is to unify those two
base_exprs.  We don't currently have logic in this pass to look up an
SSA name based on {base, index, stride, cand_type}, but that could be
done with a hash table.  For now to save processing time it would make
sense to only do that for MEM candidates, though the cand_type should be
included in the hash to allow this to be used for other candidate types
if necessary.  Of course, the SSA name definition must dominate the
candidate to be eligible as a basis, and that should be checked, but
this should generally be the case.

The goal should be for all of these references to have the same base
expr so that i3 can choose either i1 or i2 as a basis.  (For now the
logic in the pass chooses the most dominating basis, but eventually I
would like to add heuristics to make better choices.)

If all three of these use the same base expr, that should eliminate your
concerns, right?

> 
> the result is that three gimples will be lowered to MEM_REFs differently 
> (as the candidates have different base_exprs); the later passes can get 
> confused, generating worse code.
> 
> What this patch does is to create two interpretations where possible (if 
> different base exprs exist); the following dependency chains will be 
> produced:
> 
>    i1 --> i2  (base expr is an SSA_NAME defined as (a2 + i * 200))
>    i1 --> i2 --> i3  (base expr is a tree expression of (a2 + i * 200))
> 
> In analyze_candidates_and_replace, a new function preferred_ref_cand is 
> called to analyze a root CAND_REF and replace_refs is only called if 
> this root CAND_REF is found to be part of a larger dependency graph (or 
> longer dependency chain in simple cases).  In the example above, the 2nd 
> dependency chain will be picked up to do replace_refs.
> 
> 2) The 2nd implication is that the alternative candidate may expose the 
> underlying tree expression of a base expr, which can cause more 
> aggressive extraction and folding of immediate offsets.  Taking the new 
> test slsr-41 for example, the code-gen difference on x86_64 with the 
> original patch and this patch is (-O2):
> 
> -       leal    5(%rsi), %edx
> +       leal    5(%rsi), %eax
>          movslq  %esi, %rsi
> -       salq    $2, %rsi
> -       movslq  %edx, %rax
> -       leaq    (%rax,%rax,4), %rax
> -       leaq    (%rax,%rax,4), %rcx
> -       salq    $3, %rcx
> -       leaq    (%rdi,%rcx), %rax
> -       addq    %rsi, %rax
> -       movl    $2, -1980(%rax)
> -       movl    %edx, 20(%rax)
> -       movl    %edx, 4024(%rax)
> -       leaq    -600(%rdi,%rcx), %rax
> -       addl    $1, 16(%rsi,%rax)
> +       imulq   $204, %rsi, %rsi
> +       addq    %rsi, %rdi
> +       movl    $2, -980(%rdi)
> +       movl    %eax, 1020(%rdi)
> +       movl    %eax, 5024(%rdi)
> +       addl    $1, 416(%rdi)
>          ret
> 
> As you can see, the larger offsets are produced as the affine expander 
> is able to look deep into the tree expression.  This raises concern that 
> larger immediates can cause worse code-gen when the immediates are out 
> of the supported range on a target.  On x86_64 it is not obvious (as it 
> allows larger ranges), on arm cortex-a15 the load with the immediate 
> 5024 will be done by
> 
>          movw    r2, #5024
>          str     r3, [r0, r2]
> 
> which is not optimal.  Things can get worse when there are multiple 
> loads/stores with large immediates as each one may require an extra mov 
> immediate instruction.  One thing can potentially be done is to reduce 
> the strength of multiple large immediates later on in some RTL pass by 
> doing an initial offsetting first?  What do you think?  Are you 
> particularly concerned about this issue?

To me, this seems like something that the middle end should not concern
itself about, but that may be oversimplifying.  I would think this is
not the only pass that can create such issues, and the overall code
generation should usually be improved anyway.  Richard, would you care
to weigh in here?

A couple of quick comments on the next_interp patch:

 * You don't need num_of_dependents ().  You should be able to add a
forward declaration for count_candidates () and use it.
 * Your new test case is missing a final newline, so your patch doesn't
apply cleanly.

Please look into unifying the base expressions, as I believe you should
not need the preferred_ref_cand logic if you do that.

I still prefer the approach of using next_interp for its generality and
expandibility.

Thanks,
Bill
Yufeng Zhang Nov. 13, 2013, 7:32 p.m. UTC | #2
Hi Bill,

On 11/13/13 18:04, Bill Schmidt wrote:
> Hi Yufeng,
>
> On Tue, 2013-11-12 at 22:34 +0000, Yufeng Zhang wrote:
>> Hi Bill,
>>
>> Many thanks for the review.
>>
>> I find your suggestion on using the next_interp field quite
>> enlightening.  I prepared a patch which adds changes without modifying
>> the framework.  With the patch, the slsr pass now tries to create a
>> second candidate for each memory accessing gimple statement, and chain
>> it to the first one via the next_interp field.
>>
>> There are two implications in this approach though:
>>
>> 1) For each memory accessing gimple statement, there can be two
>> candidates, and these two candidates can be part of different dependency
>> graphs respectively (based on different base expr).  Only one of the
>> dependency graph should be traversed to do replace_refs.  Most of the
>> changes in the patch is to handle this implication.
>>
>> I am aware that you suggest to follow the next-interp chain only when
>> the searching fails for the first interpretation.  However, that doesn't
>> work very well, as it can result in worse code-gen.  Taking a varied
>> form of the added test slsr-41.c for example:
>>
>> i1:  a2 [i] [j] = 1;
>> i2:  a2 [i] [j+1] = 2;
>> i3:  a2 [i+20] [j] = i;
>>
>> With the 2nd interpretation created conditionally, the following two
>> dependency chains will be established:
>>
>>     i1 -->  i2  (base expr is an SSA_NAME defined as (a2 + i * 200))
>>     i1 -->  i3  (base expr is a tree expression of (a2 + i * 200))
>
> So it seems to me that really what needs to happen is to unify those two
> base_exprs.  We don't currently have logic in this pass to look up an
> SSA name based on {base, index, stride, cand_type}, but that could be
> done with a hash table.  For now to save processing time it would make
> sense to only do that for MEM candidates, though the cand_type should be
> included in the hash to allow this to be used for other candidate types
> if necessary.  Of course, the SSA name definition must dominate the
> candidate to be eligible as a basis, and that should be checked, but
> this should generally be the case.

I'm not quite sure if the SSA_NAME look-up works; maybe I haven't fully 
understood what you suggest.

For i1 --> i3, the base_expr is the tree expression (a2 + i * 200), 
which is the result of a sequence of operations (conversion to affine, 
immediate offset removal and conversion to tree), with another SSA_NAME 
as the input.  In other words, there are two SSA_NAMEs involved in the 
example:

   _s1: (a2 + i * 200).
   _s2: (a2 + (i * 200 + 4000))

their strides and indexes are different.

I guess what you suggest is that given the tree expression (a2 + i * 
200), look up an SSA_NAME and return _s1.  If that is the case, the 
challenge will be how to analyze the tree expression and get the 
information on its {base, index, stride, cand_type}.  While it would be 
too specific and narrative to check for a POINTER_PLUS_EXPR expression, 
the existing framework (e.g. create_add_ssa_cand) seems to assume that 
the analyzed tree represent a genuine gimple statement.

Moreover, there may not be an SSA_NAME exists, for example in the 
following case:

   i1:  a2 [i+1] [j] = 1;
   i2:  a2 [i+1] [j+1] = 2;
   i3:  a2 [i+20] [j] = i;

you wouldn't be able to find an SSA_NAME for (a2 + i * 200).

[snip]
> A couple of quick comments on the next_interp patch:
>
>   * You don't need num_of_dependents ().  You should be able to add a
> forward declaration for count_candidates () and use it.

Missed count_candidates (); thanks!

>   * Your new test case is missing a final newline, so your patch doesn't
> apply cleanly.

I'll fix it.

> Please look into unifying the base expressions, as I believe you should
> not need the preferred_ref_cand logic if you do that.

I would also like to live without preferred_ref_cand if feasible . :)

> I still prefer the approach of using next_interp for its generality and
> expandibility.

Sure; this approach indeed fit the framework better.


Regards,
Yufeng
Bill Schmidt Nov. 13, 2013, 8:11 p.m. UTC | #3
Hi Yufeng,

On Wed, 2013-11-13 at 19:32 +0000, Yufeng Zhang wrote:
> Hi Bill,
> 
> On 11/13/13 18:04, Bill Schmidt wrote:
> > Hi Yufeng,
> >
> > On Tue, 2013-11-12 at 22:34 +0000, Yufeng Zhang wrote:
> >> Hi Bill,
> >>
> >> Many thanks for the review.
> >>
> >> I find your suggestion on using the next_interp field quite
> >> enlightening.  I prepared a patch which adds changes without modifying
> >> the framework.  With the patch, the slsr pass now tries to create a
> >> second candidate for each memory accessing gimple statement, and chain
> >> it to the first one via the next_interp field.
> >>
> >> There are two implications in this approach though:
> >>
> >> 1) For each memory accessing gimple statement, there can be two
> >> candidates, and these two candidates can be part of different dependency
> >> graphs respectively (based on different base expr).  Only one of the
> >> dependency graph should be traversed to do replace_refs.  Most of the
> >> changes in the patch is to handle this implication.
> >>
> >> I am aware that you suggest to follow the next-interp chain only when
> >> the searching fails for the first interpretation.  However, that doesn't
> >> work very well, as it can result in worse code-gen.  Taking a varied
> >> form of the added test slsr-41.c for example:
> >>
> >> i1:  a2 [i] [j] = 1;
> >> i2:  a2 [i] [j+1] = 2;
> >> i3:  a2 [i+20] [j] = i;
> >>
> >> With the 2nd interpretation created conditionally, the following two
> >> dependency chains will be established:
> >>
> >>     i1 -->  i2  (base expr is an SSA_NAME defined as (a2 + i * 200))
> >>     i1 -->  i3  (base expr is a tree expression of (a2 + i * 200))
> >
> > So it seems to me that really what needs to happen is to unify those two
> > base_exprs.  We don't currently have logic in this pass to look up an
> > SSA name based on {base, index, stride, cand_type}, but that could be
> > done with a hash table.  For now to save processing time it would make
> > sense to only do that for MEM candidates, though the cand_type should be
> > included in the hash to allow this to be used for other candidate types
> > if necessary.  Of course, the SSA name definition must dominate the
> > candidate to be eligible as a basis, and that should be checked, but
> > this should generally be the case.
> 
> I'm not quite sure if the SSA_NAME look-up works; maybe I haven't fully 
> understood what you suggest.
> 
> For i1 --> i3, the base_expr is the tree expression (a2 + i * 200), 
> which is the result of a sequence of operations (conversion to affine, 
> immediate offset removal and conversion to tree), with another SSA_NAME 
> as the input.  In other words, there are two SSA_NAMEs involved in the 
> example:
> 
>    _s1: (a2 + i * 200).
>    _s2: (a2 + (i * 200 + 4000))
> 
> their strides and indexes are different.
> 
> I guess what you suggest is that given the tree expression (a2 + i * 
> 200), look up an SSA_NAME and return _s1.  If that is the case, the 
> challenge will be how to analyze the tree expression and get the 
> information on its {base, index, stride, cand_type}.  While it would be 
> too specific and narrative to check for a POINTER_PLUS_EXPR expression, 
> the existing framework (e.g. create_add_ssa_cand) seems to assume that 
> the analyzed tree represent a genuine gimple statement.
> 
> Moreover, there may not be an SSA_NAME exists, for example in the 
> following case:
> 
>    i1:  a2 [i+1] [j] = 1;
>    i2:  a2 [i+1] [j+1] = 2;
>    i3:  a2 [i+20] [j] = i;
> 
> you wouldn't be able to find an SSA_NAME for (a2 + i * 200).

Ok.  It is probably too much to hope for to get a sufficiently general
approach to handle all of these cases cleanly.

Bleah.  The whole preferred_ref_cand business seems very ad hoc to me,
and to some extent is closing the barn door after the cows have escaped.
Perhaps we can't use the next-interpretation infrastructure to solve
this problem ideally, in which case I apologize for leading you down
this path.  The alternate patch at least keeps the candidate tree in a
straightforward state, and the new version is less intrusive than the
original.

Let me look that version over more carefully and I'll get back to you.
Thanks for your patience.

Bill

> 
> [snip]
> > A couple of quick comments on the next_interp patch:
> >
> >   * You don't need num_of_dependents ().  You should be able to add a
> > forward declaration for count_candidates () and use it.
> 
> Missed count_candidates (); thanks!
> 
> >   * Your new test case is missing a final newline, so your patch doesn't
> > apply cleanly.
> 
> I'll fix it.
> 
> > Please look into unifying the base expressions, as I believe you should
> > not need the preferred_ref_cand logic if you do that.
> 
> I would also like to live without preferred_ref_cand if feasible . :)
> 
> > I still prefer the approach of using next_interp for its generality and
> > expandibility.
> 
> Sure; this approach indeed fit the framework better.
> 
> 
> Regards,
> Yufeng
>
Bill Schmidt Nov. 13, 2013, 8:54 p.m. UTC | #4
Hi Yufeng,

The second version of your original patch is ok with me with the
following changes.  Sorry for the little side adventure into the
next-interp logic; in the end that's going to hurt more than it helps in
this case.  Thanks for having a look at it, anyway.  Thanks also for
cleaning up this version to be less intrusive to common interfaces; I
appreciate it.


>diff --git a/gcc/gimple-ssa-strength-reduction.c b/gcc/gimple-ssa-strength-reduction.c
>index 88afc91..d069246 100644
>--- a/gcc/gimple-ssa-strength-reduction.c
>+++ b/gcc/gimple-ssa-strength-reduction.c
>@@ -53,6 +53,7 @@ along with GCC; see the file COPYING3.  If not see
> #include "params.h"
> #include "hash-table.h"
> #include "tree-ssa-address.h"
>+#include "tree-affine.h"
> 
> /* Information about a strength reduction candidate.  Each statement
>    in the candidate table represents an expression of one of the
>@@ -420,6 +421,42 @@ cand_chain_hasher::equal (const value_type *chain1, const compare_type *chain2)
> /* Hash table embodying a mapping from base exprs to chains of candidates.  */
> static hash_table <cand_chain_hasher> base_cand_map;
> 
>+/* Pointer map used by tree_to_aff_combination_expand.  */
>+static struct pointer_map_t *name_expansions;
>+/* Pointer map embodying a mapping from bases to alternative bases.  */
>+static struct pointer_map_t *alt_base_map;
>+
>+/* Given BASE, use the tree affine combiniation facilities to
>+   find the underlying tree expression for BASE, with any
>+   immediate offset excluded.  */
>+
>+static tree
>+get_alternative_base (tree base)
>+{
>+  tree *result = (tree *) pointer_map_contains (alt_base_map, base);
>+
>+  if (result == NULL)
>+    {
>+      tree expr;
>+      aff_tree aff;
>+
>+      tree_to_aff_combination_expand (base, TREE_TYPE (base),
>+				      &aff, &name_expansions);
>+      aff.offset = tree_to_double_int (integer_zero_node);
>+      expr = aff_combination_to_tree (&aff);
>+
>+      result = (tree *) pointer_map_insert (alt_base_map, base);
>+      gcc_assert (!*result);
>+
>+      if (expr == base)
>+	*result = NULL;
>+      else
>+	*result = expr;
>+    }
>+
>+  return *result;
>+}
>+
> /* Look in the candidate table for a CAND_PHI that defines BASE and
>    return it if found; otherwise return NULL.  */
> 
>@@ -439,9 +476,10 @@ find_phi_def (tree base)
>   return c->cand_num;
> }
> 
>-/* Helper routine for find_basis_for_candidate.  May be called twice:
>+/* Helper routine for find_basis_for_candidate.  May be called three times:
>    once for the candidate's base expr, and optionally again for the
>-   candidate's phi definition.  */
>+   candidate's phi definition, as well as for an alternative base expr
>+   in the case of CAND_REF.  */

Technically this will never be called three times.  It will be called
once for the candidate's base expression, and optionally either for the
candidate's phi definition or for a CAND_REF's alternative base
expression.  (There is no phi processing for a CAND_REF.)

> 
> static slsr_cand_t
> find_basis_for_base_expr (slsr_cand_t c, tree base_expr)
>@@ -518,6 +556,13 @@ find_basis_for_candidate (slsr_cand_t c)
> 	}
>     }
> 
>+  if (!basis && c->kind == CAND_REF)
>+    {
>+      tree alt_base_expr = get_alternative_base (c->base_expr);
>+      if (alt_base_expr)
>+	basis = find_basis_for_base_expr (c, alt_base_expr);
>+    }
>+
>   if (basis)
>     {
>       c->sibling = basis->dependent;
>@@ -528,17 +573,22 @@ find_basis_for_candidate (slsr_cand_t c)
>   return 0;
> }
> 
>-/* Record a mapping from the base expression of C to C itself, indicating that
>-   C may potentially serve as a basis using that base expression.  */
>+/* Record a mapping from BASE to C, indicating that C may potentially serve
>+   as a basis using that base expression.  BASE may be the same as
>+   C->BASE_EXPR; alternatively BASE can be a different tree that share the
>+   underlining expression of C->BASE_EXPR.  */
> 
> static void
>-record_potential_basis (slsr_cand_t c)
>+record_potential_basis (slsr_cand_t c, tree base)
> {
>   cand_chain_t node;
>   cand_chain **slot;
> 
>+  if (base == NULL)
>+    return;

Please do this check outside the common code; it's not necessary except
for CAND_REFs.  Replace with:

  gcc_assert (base);

>+
>   node = (cand_chain_t) obstack_alloc (&chain_obstack, sizeof (cand_chain));
>-  node->base_expr = c->base_expr;
>+  node->base_expr = base;
>   node->cand = c;
>   node->next = NULL;
>   slot = base_cand_map.find_slot (node, INSERT);
>@@ -554,10 +604,18 @@ record_potential_basis (slsr_cand_t c)
> }
> 
> /* Allocate storage for a new candidate and initialize its fields.
>-   Attempt to find a basis for the candidate.  */
>+   Attempt to find a basis for the candidate.
>+
>+   For CAND_REF, an alternative base may also be recorded and used
>+   to find a basis.  This helps cases where the expression hidden
>+   behind BASE (which is usually an SSA_NAME) has immediate offset,
>+   e.g.
>+
>+     a2[i][j] = 1;
>+     a2[i + 20][j] = 2;  */
> 
> static slsr_cand_t
>-alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base, 
>+alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
> 			   double_int index, tree stride, tree ctype,
> 			   unsigned savings)
> {
>@@ -583,7 +641,9 @@ alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
>   else
>     c->basis = find_basis_for_candidate (c);
> 
>-  record_potential_basis (c);
>+  record_potential_basis (c, base);
>+  if (kind == CAND_REF)
>+    record_potential_basis (c, get_alternative_base (base));

Tied to the above change:

if (kind == CAND_REF)
  {
    tree alt_base = get_alternative_base (base);
    if (alt_base)
      record_potential_basis (c, alt_base);
  }

> 
>   return c;
> }
>@@ -3524,6 +3584,9 @@ execute_strength_reduction (void)
>   /* Allocate the mapping from base expressions to candidate chains.  */
>   base_cand_map.create (500);
> 
>+  /* Allocate the mapping from bases to alternative bases.  */
>+  alt_base_map = pointer_map_create ();
>+
>   /* Initialize the loop optimizer.  We need to detect flow across
>      back edges, and this gives us dominator information as well.  */
>   loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
>@@ -3539,6 +3602,9 @@ execute_strength_reduction (void)
>       dump_cand_chains ();
>     }
> 
>+  pointer_map_destroy (alt_base_map);
>+  free_affine_expand_cache (&name_expansions);
>+
>   /* Analyze costs and make appropriate replacements.  */
>   analyze_candidates_and_replace ();
> 
>diff --git a/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
>new file mode 100644
>index 0000000..870d714
>--- /dev/null
>+++ b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
>@@ -0,0 +1,24 @@
>+/* Verify straight-line strength reduction in using
>+   alternative base expr to record and look for the
>+   potential candidate.  */
>+
>+/* { dg-do compile } */
>+/* { dg-options "-O2 -fdump-tree-slsr" } */
>+
>+typedef int arr_2[50][50];
>+
>+void foo (arr_2 a2, int v1)
>+{
>+  int i, j;
>+
>+  i = v1 + 5;
>+  j = i;
>+  a2 [i-10] [j] = 2;
>+  a2 [i] [j++] = i;
>+  a2 [i+20] [j++] = i;
>+  a2 [i-3] [i-1] += 1;
>+  return;
>+}
>+
>+/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>+/* { dg-final { cleanup-tree-dump "slsr" } } */

As mentioned previously, please add the missing newline at EOF.

Everything else looks OK to me.  Please ask Richard for final approval,
as I'm not a maintainer.

Thanks,
Bill

(P.S. I prefer inline patches rather than attachments; it makes it
easier to reply with markup.)
diff mbox

Patch

diff --git a/gcc/gimple-ssa-strength-reduction.c b/gcc/gimple-ssa-strength-reduction.c
index 88afc91..d069246 100644
--- a/gcc/gimple-ssa-strength-reduction.c
+++ b/gcc/gimple-ssa-strength-reduction.c
@@ -53,6 +53,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "params.h"
 #include "hash-table.h"
 #include "tree-ssa-address.h"
+#include "tree-affine.h"
 
 /* Information about a strength reduction candidate.  Each statement
    in the candidate table represents an expression of one of the
@@ -420,6 +421,42 @@  cand_chain_hasher::equal (const value_type *chain1, const compare_type *chain2)
 /* Hash table embodying a mapping from base exprs to chains of candidates.  */
 static hash_table <cand_chain_hasher> base_cand_map;
 
+/* Pointer map used by tree_to_aff_combination_expand.  */
+static struct pointer_map_t *name_expansions;
+/* Pointer map embodying a mapping from bases to alternative bases.  */
+static struct pointer_map_t *alt_base_map;
+
+/* Given BASE, use the tree affine combiniation facilities to
+   find the underlying tree expression for BASE, with any
+   immediate offset excluded.  */
+
+static tree
+get_alternative_base (tree base)
+{
+  tree *result = (tree *) pointer_map_contains (alt_base_map, base);
+
+  if (result == NULL)
+    {
+      tree expr;
+      aff_tree aff;
+
+      tree_to_aff_combination_expand (base, TREE_TYPE (base),
+				      &aff, &name_expansions);
+      aff.offset = tree_to_double_int (integer_zero_node);
+      expr = aff_combination_to_tree (&aff);
+
+      result = (tree *) pointer_map_insert (alt_base_map, base);
+      gcc_assert (!*result);
+
+      if (expr == base)
+	*result = NULL;
+      else
+	*result = expr;
+    }
+
+  return *result;
+}
+
 /* Look in the candidate table for a CAND_PHI that defines BASE and
    return it if found; otherwise return NULL.  */
 
@@ -439,9 +476,10 @@  find_phi_def (tree base)
   return c->cand_num;
 }
 
-/* Helper routine for find_basis_for_candidate.  May be called twice:
+/* Helper routine for find_basis_for_candidate.  May be called three times:
    once for the candidate's base expr, and optionally again for the
-   candidate's phi definition.  */
+   candidate's phi definition, as well as for an alternative base expr
+   in the case of CAND_REF.  */
 
 static slsr_cand_t
 find_basis_for_base_expr (slsr_cand_t c, tree base_expr)
@@ -518,6 +556,13 @@  find_basis_for_candidate (slsr_cand_t c)
 	}
     }
 
+  if (!basis && c->kind == CAND_REF)
+    {
+      tree alt_base_expr = get_alternative_base (c->base_expr);
+      if (alt_base_expr)
+	basis = find_basis_for_base_expr (c, alt_base_expr);
+    }
+
   if (basis)
     {
       c->sibling = basis->dependent;
@@ -528,17 +573,22 @@  find_basis_for_candidate (slsr_cand_t c)
   return 0;
 }
 
-/* Record a mapping from the base expression of C to C itself, indicating that
-   C may potentially serve as a basis using that base expression.  */
+/* Record a mapping from BASE to C, indicating that C may potentially serve
+   as a basis using that base expression.  BASE may be the same as
+   C->BASE_EXPR; alternatively BASE can be a different tree that share the
+   underlining expression of C->BASE_EXPR.  */
 
 static void
-record_potential_basis (slsr_cand_t c)
+record_potential_basis (slsr_cand_t c, tree base)
 {
   cand_chain_t node;
   cand_chain **slot;
 
+  if (base == NULL)
+    return;
+
   node = (cand_chain_t) obstack_alloc (&chain_obstack, sizeof (cand_chain));
-  node->base_expr = c->base_expr;
+  node->base_expr = base;
   node->cand = c;
   node->next = NULL;
   slot = base_cand_map.find_slot (node, INSERT);
@@ -554,10 +604,18 @@  record_potential_basis (slsr_cand_t c)
 }
 
 /* Allocate storage for a new candidate and initialize its fields.
-   Attempt to find a basis for the candidate.  */
+   Attempt to find a basis for the candidate.
+
+   For CAND_REF, an alternative base may also be recorded and used
+   to find a basis.  This helps cases where the expression hidden
+   behind BASE (which is usually an SSA_NAME) has immediate offset,
+   e.g.
+
+     a2[i][j] = 1;
+     a2[i + 20][j] = 2;  */
 
 static slsr_cand_t
-alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base, 
+alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
 			   double_int index, tree stride, tree ctype,
 			   unsigned savings)
 {
@@ -583,7 +641,9 @@  alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
   else
     c->basis = find_basis_for_candidate (c);
 
-  record_potential_basis (c);
+  record_potential_basis (c, base);
+  if (kind == CAND_REF)
+    record_potential_basis (c, get_alternative_base (base));
 
   return c;
 }
@@ -3524,6 +3584,9 @@  execute_strength_reduction (void)
   /* Allocate the mapping from base expressions to candidate chains.  */
   base_cand_map.create (500);
 
+  /* Allocate the mapping from bases to alternative bases.  */
+  alt_base_map = pointer_map_create ();
+
   /* Initialize the loop optimizer.  We need to detect flow across
      back edges, and this gives us dominator information as well.  */
   loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
@@ -3539,6 +3602,9 @@  execute_strength_reduction (void)
       dump_cand_chains ();
     }
 
+  pointer_map_destroy (alt_base_map);
+  free_affine_expand_cache (&name_expansions);
+
   /* Analyze costs and make appropriate replacements.  */
   analyze_candidates_and_replace ();
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
new file mode 100644
index 0000000..870d714
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
@@ -0,0 +1,24 @@ 
+/* Verify straight-line strength reduction in using
+   alternative base expr to record and look for the
+   potential candidate.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-slsr" } */
+
+typedef int arr_2[50][50];
+
+void foo (arr_2 a2, int v1)
+{
+  int i, j;
+
+  i = v1 + 5;
+  j = i;
+  a2 [i-10] [j] = 2;
+  a2 [i] [j++] = i;
+  a2 [i+20] [j++] = i;
+  a2 [i-3] [i-1] += 1;
+  return;
+}
+
+/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
+/* { dg-final { cleanup-tree-dump "slsr" } } */