Optional alternative base_expr in finding basis for CAND_REFs

Hi Bill,

Many thanks for the review.

I find your suggestion on using the next_interp field quite 
enlightening.  I prepared a patch which adds changes without modifying 
the framework.  With the patch, the slsr pass now tries to create a 
second candidate for each memory accessing gimple statement, and chain 
it to the first one via the next_interp field.

There are two implications in this approach though:

1) For each memory accessing gimple statement, there can be two 
candidates, and these two candidates can be part of different dependency 
graphs respectively (based on different base expr).  Only one of the 
dependency graph should be traversed to do replace_refs.  Most of the 
changes in the patch is to handle this implication.

I am aware that you suggest to follow the next-interp chain only when 
the searching fails for the first interpretation.  However, that doesn't 
work very well, as it can result in worse code-gen.  Taking a varied 
form of the added test slsr-41.c for example:

i1:  a2 [i] [j] = 1;
i2:  a2 [i] [j+1] = 2;
i3:  a2 [i+20] [j] = i;

With the 2nd interpretation created conditionally, the following two 
dependency chains will be established:

   i1 --> i2  (base expr is an SSA_NAME defined as (a2 + i * 200))
   i1 --> i3  (base expr is a tree expression of (a2 + i * 200))

the result is that three gimples will be lowered to MEM_REFs differently 
(as the candidates have different base_exprs); the later passes can get 
confused, generating worse code.

What this patch does is to create two interpretations where possible (if 
different base exprs exist); the following dependency chains will be 
produced:

   i1 --> i2  (base expr is an SSA_NAME defined as (a2 + i * 200))
   i1 --> i2 --> i3  (base expr is a tree expression of (a2 + i * 200))

In analyze_candidates_and_replace, a new function preferred_ref_cand is 
called to analyze a root CAND_REF and replace_refs is only called if 
this root CAND_REF is found to be part of a larger dependency graph (or 
longer dependency chain in simple cases).  In the example above, the 2nd 
dependency chain will be picked up to do replace_refs.

2) The 2nd implication is that the alternative candidate may expose the 
underlying tree expression of a base expr, which can cause more 
aggressive extraction and folding of immediate offsets.  Taking the new 
test slsr-41 for example, the code-gen difference on x86_64 with the 
original patch and this patch is (-O2):

-       leal    5(%rsi), %edx
+       leal    5(%rsi), %eax
         movslq  %esi, %rsi
-       salq    $2, %rsi
-       movslq  %edx, %rax
-       leaq    (%rax,%rax,4), %rax
-       leaq    (%rax,%rax,4), %rcx
-       salq    $3, %rcx
-       leaq    (%rdi,%rcx), %rax
-       addq    %rsi, %rax
-       movl    $2, -1980(%rax)
-       movl    %edx, 20(%rax)
-       movl    %edx, 4024(%rax)
-       leaq    -600(%rdi,%rcx), %rax
-       addl    $1, 16(%rsi,%rax)
+       imulq   $204, %rsi, %rsi
+       addq    %rsi, %rdi
+       movl    $2, -980(%rdi)
+       movl    %eax, 1020(%rdi)
+       movl    %eax, 5024(%rdi)
+       addl    $1, 416(%rdi)
         ret

As you can see, the larger offsets are produced as the affine expander 
is able to look deep into the tree expression.  This raises concern that 
larger immediates can cause worse code-gen when the immediates are out 
of the supported range on a target.  On x86_64 it is not obvious (as it 
allows larger ranges), on arm cortex-a15 the load with the immediate 
5024 will be done by

         movw    r2, #5024
         str     r3, [r0, r2]

which is not optimal.  Things can get worse when there are multiple 
loads/stores with large immediates as each one may require an extra mov 
immediate instruction.  One thing can potentially be done is to reduce 
the strength of multiple large immediates later on in some RTL pass by 
doing an initial offsetting first?  What do you think?  Are you 
particularly concerned about this issue?

The patch passes the bootstrapping on arm and x86_64; the regtest is 
still running.

Here is the changelog:

gcc/

         * gimple-ssa-strength-reduction.c: Include tree-affine.h.
         (name_expansions): New static variable.
         (get_alternative_base): New function.
         (restructure_reference): Add new local variables 'alt_base' and
         'delta'; call get_alternative_base and alloc_cand_and_find_basis
         to create an alternative interpretation.
         (num_of_dependents): New function.
         (preferred_ref_cand): Ditto.
         (analyze_candidates_and_replace): Call preferred_ref_cand for
         CAND_REF and skip replace_refs if the returned value is 
differerent.
         (execute_strength_reduction): call free_affine_expand_cache with
         &name_expansions.

gcc/testsuite/

         * gcc.dg/tree-ssa/slsr-41.c: New test.

For your consideration, I've also attached another patch which is an 
improvement to the original patch.  This patch improves the original one 
by reducing the number of changes to the existing framework, e.g. 
leaving find_basis_for_base_expr unchanged.  While it still slightly 
modifies the interfaces (find_basis_for_candidate and 
record_potential_basis), it has advantage over the 1st patch attached 
here: its impact on the code-gen is much smaller, as it enables more 
ARRAY_REFs to be lowered without handing over the underlying tree 
expression to replace_ref.  It creates the following dependency chains 
for the aforementioned example:

   i1 --> i2  (base expr is an SSA_NAME defined as (a2 + i * 200))
   i1 --> i2 --> i3  (base expr is a tree expression of (a2 + i * 200))

While they look the same as what the 1st patch does, only one candidate 
is generated for each memory accessing gimple statement; some candiates 
are chained twice, once to a cand_chain with a base_expr of an SSA_NAME 
and the other to a cand_chain with the underlying tree expression as its 
base_expr.  In other words, it produces two different dependency graphs 
without creating different interpretations, by utilizing the existing 
framework of cand_chain and find_basis_for_base_expr.

The patch passes the bootstrapping on arm and x86_64, as well as regtest 
on x86_64.  The following is the changelog entry:

gcc/

         * gimple-ssa-strength-reduction.c: Include tree-affine.h.
         (name_expansions): New static variable.
         (alt_base_map): Ditto.
         (get_alternative_base): New function.
         (find_basis_for_candidate): For CAND_REF, optionally call
         find_basis_for_base_expr with the returned value from
         get_alternative_base.
         (record_potential_basis): Add new parameter 'base' of type 'tree';
         return if base == NULL; use base to set node->base_expr.
         (alloc_cand_and_find_basis): Update; call 
record_potential_basis for
         CAND_REF with the returned value from get_alternative_base.
         (execute_strength_reduction): Call pointer_map_create for 
alt_base_map;
         call free_affine_expand_cache with &name_expansions.

gcc/testsuite/

         * gcc.dg/tree-ssa/slsr-41.c: New test.

Which patch do you like more?

If you have any question on either of the patch, please let me know.

Regards,
Yufeng

On 11/11/13 17:09, Bill Schmidt wrote:
> Hi Yufeng,
>
> The idea is a good one but I don't like your implementation of adding an
> extra expression parameter to look at on the find_basis_for_candidate
> lookup.  This goes against the design of the pass and may not be
> sufficiently general (might there be situations where a third possible
> basis could exist?).
>
> The overall design is set up to have alternate interpretations of
> candidates in the candidate table to handle this sort of ambiguity.  The
> goal for your example is create a second candidate (chained to the first
> one by way of the next_interp field) so that the candidate table looks
> like this:
>
>     8  [2] *_10[j_7(D)] = 2;
>        REF  : _10 + ((sizetype) j_7(D) * 4) + 0 : int[20] *
>        basis: 0  dependent: 0  sibling: 0
>        next-interp: 9  dead-savings: 0
>
>     9  [2] *_10[j_7(D)] = 2;
>        REF  : _5 + ((sizetype) j_7(D) * 4) + 800 : int[20] *
>        basis: 5  dependent: 0  sibling: 0
>        next-interp: 0  dead-savings: 0
>
> This will in turn allow subsequent candidates to be seen in terms of
> either _5 or _10, which may be necessary to avoid missed opportunities.
> There may be a subsequent REF _15 +... that can be an affine expression
> of either of these, for example.
>
> If you fail to find a basis for a candidate with its first
> interpretation, you can then follow the next-interp chain to look for a
> basis for the next one, without the messy passing of extra possibilities
> to the find-basis routine.
>
> I haven't read the patch in detail, but I think this should give you
> enough to work with to re-design the idea to fit better with the
> existing framework.  Please let me know if you need more information, or
> if you feel I've misunderstood something.
>
> Thanks,
> Bill
>
> On Mon, 2013-11-04 at 18:41 +0000, Yufeng Zhang wrote:
>> Hi,
>>
>> This patch extends the slsr pass to optionally use an alternative base
>> expression in finding basis for CAND_REFs.  Currently the pass uses
>> hash-based algorithm to match the base_expr in a candidate.  Given a
>> test case like the following, slsr will not be able to recognize the two
>> CAND_REFs have the same basis, as their base_expr are of different
>> SSA_NAMEs:
>>
>> typedef int arr_2[20][20];
>>
>> void foo (arr_2 a2, int i, int j)
>> {
>>     a2[i][j] = 1;
>>     a2[i + 10][j] = 2;
>> }
>>
>> The gimple dump before slsr is like the following (using an
>> arm-none-eabi gcc):
>>
>>     i.0_2 = (unsigned int) i_1(D);
>>     _3 = i.0_2 * 80;
>>     _5 = a2_4(D) + _3;
>>     *_5[j_7(D)] = 1;<----
>>     _9 = _3 + 800;
>>     _10 = a2_4(D) + _9;
>>     *_10[j_7(D)] = 2;<----
>>
>> Here are the dumps for the two CAND_REFs generated for the two
>> statements pointed by the arrows:
>>
>>
>>     4  [2] _5 = a2_4(D) + _3;
>>        ADD  : a2_4(D) + (80 * i_1(D)) : int[20] *
>>        basis: 0  dependent: 0  sibling: 0
>>        next-interp: 0  dead-savings: 0
>>
>>     8  [2] *_10[j_7(D)] = 2;
>>        REF  : _10 + ((sizetype) j_7(D) * 4) + 0 : int[20] *
>>        basis: 5  dependent: 0  sibling: 0
>>        next-interp: 0  dead-savings: 0
>>
>> As mentioned previously, slsr cannot establish that candidate 4 is the
>> basis for the candidate 8, as they have different base_exprs: a2_4(D)
>> and _10, respectively.  However, the two references actually only differ
>> by an immediate offset (800).
>>
>> This patch uses the tree affine combination facilities to create an
>> optional alternative base expression to be used in finding (as well as
>> recording) the basis.  It calls tree_to_aff_combination_expand on
>> base_expr, reset the offset field of the generated aff_tree to 0 and
>> generate a tree from it by calling aff_combination_to_tree.
>>
>> The new tree is recorded as a potential basis, and when
>> find_basis_for_candidate fails to find a basis for a CAND_REF in its
>> normal approach, it searches again using a tree expanded in such way.
>> Such an expanded tree usually discloses the expression behind an
>> SSA_NAME.  In the example above, instead of seeing the strength
>> reduction candidate chains like this:
>>
>>     _5 ->  5
>>     _10 ->  8
>>
>> we are now having:
>>
>>     _5 ->  5
>>     _10 ->  8
>>     a2_4(D) + (sizetype) i_1(D) * 80 ->  5 ->  8
>>
>> With the candidates 5 and 8 linked to the same tree expression (a2_4(D)
>> + (sizetype) i_1(D) * 80), slsr is now able to establish that 5 is the
>> basis of 8.
>>
>> The patch doesn't attempt to change the content of any CAND_REF though.
>>    It only enables CAND_REFs which (1) have the same stride and (2) have
>> the underlying expressions of their base_expr only differ in immediate
>> offsets,  to be recognized to have the same basis.  The statements with
>> such CAND_REFs will be lowered to MEM_REFs, and later on the RTL
>> expander shall be able to fold and re-associate the immediate offsets to
>> the rightmost side of the addressing expression, and therefore exposes
>> the common sub-expression successfully.
>>
>> The code-gen difference of the example code on arm with -O2
>> -mcpu=cortex-15 is:
>>
>>           mov     r3, r1, asl #6
>> -       add     ip, r0, r2, asl #2
>>           str     lr, [sp, #-4]!
>> +       mov     ip, #1
>> +       mov     lr, #2
>>           add     r1, r3, r1, asl #4
>> -       mov     lr, #1
>> -       mov     r3, #2
>>           add     r0, r0, r1
>> -       add     r0, r0, #800
>> -       str     lr, [ip, r1]
>> -       str     r3, [r0, r2, asl #2]
>> +       add     r3, r0, r2, asl #2
>> +       str     ip, [r0, r2, asl #2]
>> +       str     lr, [r3, #800]
>>           ldr     pc, [sp], #4
>>
>> One fewer instruction in this simple case.
>>
>> The example used in illustration is too simple to show code-gen
>> difference on x86_64, but the included test case will show the benefit
>> of the patch quite obviously.
>>
>> The patch has passed
>>
>> * bootstrapping on arm and x86_64
>> * regtest on arm-none-eabi,  aarch64-none-elf and x86_64
>>
>> There is no regression in SPEC2K on arm or x86_64.
>>
>> OK to commit to the trunk?
>>
>> Any comment is welcomed!
>>
>> Thanks,
>> Yufeng
>>
>>
>> gcc/
>>
>>           * gimple-ssa-strength-reduction.c: Include tree-affine.h.
>>           (find_basis_for_base_expr): Update comment.
>>           (find_basis_for_candidate): Add new parameter 'alt_base_expr' of
>>           type 'tree'.  Optionally call find_basis_for_base_expr with
>>           'alt_base_expr'.
>>           (record_potential_basis): Add new parameter 'alt_base_expr' of
>>           type 'tree'; set node->base_expr with 'alt_base_expr' if it is
>>           not NULL.
>>           (name_expansions): New static variable.
>>           (get_alternative_base): New function.
>>           (alloc_cand_and_find_basis): Call get_alternative_base for
>> CAND_REF.
>>           Update calls to find_basis_for_candidate and
>> record_potential_basis.
>>           (execute_strength_reduction): Call free_affine_expand_cache with
>>           &name_expansions.
>>
>> gcc/testsuite/
>>
>>           * gcc.dg/tree-ssa/slsr-41.c: New test.
>
>

Optional alternative base_expr in finding basis for CAND_REFs

Commit Message

Comments

Patch