[PR68030/PR69710,RFC] Introduce a simple local CSE interface and use it in vectorizer

Message ID	DB5PR08MB1144AA1F16901D9CDDF918E4E7400@DB5PR08MB1144.eurprd08.prod.outlook.com
State	New
Headers	show Return-Path: <gcc-patches-return-428262-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:mime-version:content-type; q=dns; s=default; b=bIwotR932nZQC14dE0OP/UF79O6TWhofJwuvpz9tcQ9EuFQXOF LFFkLENiLq0qn9CZybkWAHe6wMJFQeDeTCBbTvyOLk2KdRm63zbwfSCLwtQ5vSep C8OVL/2OFdsy/dEMXgnCMiFCYrTQi8Mtstfc99y6keSrg3WKp/W6xj+Xg= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org From: Bin Cheng <Bin.Cheng@arm.com> To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org> CC: nd <nd@arm.com> Subject: [PATCH PR68030/PR69710][RFC]Introduce a simple local CSE interface and use it in vectorizer Date: Wed, 25 May 2016 11:22:23 +0000 Message-ID: <DB5PR08MB1144AA1F16901D9CDDF918E4E7400@DB5PR08MB1144.eurprd08.prod.outlook.com> x-ms-office365-filtering-correlation-id: 8118ac21-c9b1-41a1-6477-08d3848ed865 x-microsoft-exchange-diagnostics: 1; DB5PR08MB1144; 5:eccxhGSOmzpP4s3srgvvGkoVHO/AUoAzmc11/yk53DL1gOufLHYKlHanMHXlVslK/Fxa7cBwrTy+ajg5j7IRSvbXiAwGmNjuoSLLb61Tc9IOr5eiDFzZ7VJ98vjVLyCNiiJS9PKj5ZIzPymPCZdiGw==; 24:/DxMZNdAsEKkqZiTRqkAnps8H7vUPQ5T47AHdZVFwO9Vpg0QG15/k0RVHP/lbVjo3YXhRyEuV8UBDBNwPUiFV7qVlgJGq47PlpvrMKaX6zc=; 7:+MDFtkEa1DMJ/3nxPDG7P9do2MqJkhsR0+hHabxK1SRoSWZ2LXGyvC0hBcfwp2lwJ2Kxa+m6FRG1ISYs1UsFPyXy56N/ciIgiO13wX0KzxEHRBOlt+YTwQr+HIFAbCaL3BUulClOoZGrXaLkeqk5hCxFF6LTYRDvwa6HcwjXnGi5+/MZB2sXWpGzmQu9ZHix; 20:B2oTlN5LR7sNpY/utt1XC7f9Qi/1bm1zJEulBR4HpYHwBpJGgPkuF+Lwm37xL8BpCoZ6kV2SSIf83JZA0ZHmxsaldNL7t0E47xFCaliuemRv3lbUAH70/7RYjs5xHob7U63SG1Cq+bQWdN0GAvD2/Ef91O7mrBV+TNSlDawDm54= nodisclaimer: True spamdiagnosticoutput: 1:23 spamdiagnosticmetadata: NSPM MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="_002_DB5PR08MB1144AA1F16901D9CDDF918E4E7400DB5PR08MB1144eurp_"

Bin Cheng May 25, 2016, 11:22 a.m. UTC

Hi,
As analyzed in PR68303 and PR69710, vectorizer generates duplicated computations in loop's pre-header basic block when creating base address for vector reference to the same memory object. Because the duplicated code is out of loop, IVOPT fails to track base object for these vector references, resulting in missed strength reduction.
It's agreed that vectorizer should be improved to generate optimal (IVOPT-friendly) code, the difficult part is we want a generic infrastructure. After investigation, I tried to introduce a generic/simple local CSE interface by reusing existing algorithm/data-structure from tree-ssa-dom (tree-ssa-scopedtables). The interface runs local CSE for each basic block in a bitmap, customers of this interface only need to record basic blocks in the bitmap when necessary. Note we don't need scopedtables' unwinding facility since the interface runs only for single basic block, this should be good in terms of compilation time.
Besides CSE issue, this patch also re-associates address expressions in vect_create_addr_base_for_vector_ref, specifically, it splits constant offset and adds it back near the expression root in IR. This is necessary because GCC only handles re-association for commutative operators in CSE.

I checked its impact on various test cases.
With this patch, PR68030's generated assembly is reduced from ~750 lines to ~580 lines on x86_64, with both pre-header and loop body simplified. But,
1) It doesn't fix all the problem on x86_64. Root cause is computation for base address of the first reference is somehow moved outside of loop's pre-header, local CSE can't help in this case. Though split_constant_offset can back track ssa def chain, it causes possible redundant when there is no CSE opportunities in pre-header.
2) It causes regression for PR68030 on AArch64. I think the regression is caused by IVOPT issues which are exposed by this patch. Checks on offset validity in get_address_cost is wrong/inaccurate now. It considers an offset as valid if it's within the maximum offset range that backend supports. This is not true, for example, AArch64 requires aligned offset additionally. For example, LDR [base + 2060] is invalid for V4SFmode, although "2060" is within the maximum offset range. Another issue is also in get_address_cost. Inaccurate cost is computed for "base + offset + INDEX" address expression. When register pressure is low, "base+offset" can be hoisted out and we can use [base + INDEX] addressing mode, whichhis is current behavior.

Bootstrap and test on x86_64 and AArch64. Any comments appreciated.

Thanks,
bin

2016-05-17 Bin Cheng <bin.cheng@arm.com>

PR tree-optimization/68030
PR tree-optimization/69710
* tree-ssa-dom.c (cse_bbs): New function.
* tree-ssa-dom.h (cse_bbs): New declaration.
* tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref):
Re-associate address by splitting constant offset.
(vect_create_data_ref_ptr, vect_setup_realignment): Record changed
basic block.
* tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): Record
changed basic block.
* tree-vectorizer.c (tree-ssa-dom.h): Include header file.
(changed_bbs): New variable.
(vectorize_loops): Allocate and free CHANGED_BBS. Call cse_bbs.
* tree-vectorizer.h (changed_bbs): New declaration.

Richard Biener May 27, 2016, 10:45 a.m. UTC | #1

On Wed, May 25, 2016 at 1:22 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> Hi,
> As analyzed in PR68303 and PR69710, vectorizer generates duplicated computations in loop's pre-header basic block when creating base address for vector reference to the same memory object.  Because the duplicated code is out of loop, IVOPT fails to track base object for these vector references, resulting in missed strength reduction.
> It's agreed that vectorizer should be improved to generate optimal (IVOPT-friendly) code, the difficult part is we want a generic infrastructure.  After investigation, I tried to introduce a generic/simple local CSE interface by reusing existing algorithm/data-structure from tree-ssa-dom (tree-ssa-scopedtables).  The interface runs local CSE for each basic block in a bitmap, customers of this interface only need to record basic blocks in the bitmap when necessary.  Note we don't need scopedtables' unwinding facility since the interface runs only for single basic block, this should be good in terms of compilation time.
> Besides CSE issue, this patch also re-associates address expressions in vect_create_addr_base_for_vector_ref, specifically, it splits constant offset and adds it back near the expression root in IR.  This is necessary because GCC only handles re-association for commutative operators in CSE.
>
> I checked its impact on various test cases.
> With this patch, PR68030's generated assembly is reduced from ~750 lines to ~580 lines on x86_64, with both pre-header and loop body simplified.  But,
> 1) It doesn't fix all the problem on x86_64.  Root cause is computation for base address of the first reference is somehow moved outside of loop's pre-header, local CSE can't help in this case.  Though split_constant_offset can back track ssa def chain, it causes possible redundant when there is no CSE opportunities in pre-header.
> 2) It causes regression for PR68030 on AArch64.  I think the regression is caused by IVOPT issues which are exposed by this patch.  Checks on offset validity in get_address_cost is wrong/inaccurate now.  It considers an offset as valid if it's within the maximum offset range that backend supports.  This is not true, for example, AArch64 requires aligned offset additionally.  For example, LDR [base + 2060] is invalid for V4SFmode, although "2060" is within the maximum offset range.  Another issue is also in get_address_cost.  Inaccurate cost is computed for "base + offset + INDEX" address expression.  When register pressure is low, "base+offset" can be hoisted out and we can use [base + INDEX] addressing mode, whichhis is current behavior.
>
> Bootstrap and test on x86_64 and AArch64.  Any comments appreciated.

It looks quite straight-forward with the caveat that it has one
obvious piece that is not in the order
of the complexity of a basic-block.  threadedge_initialize_values
creates the SSA value array
which is zero-initialized (upon use).  That's probably a non-issue for
the use you propose for
the vectorizer (call cse_bbs once per function).  As Ideally I would
like this facility to replace
the loop unrollers own propagate_constants_for_unrolling it might
become an issue though.
In that regard the unroller facility is also more powerful because it
handles regions (including
PHIs).

Note that in particular for SLP vectorization the vectorizer itself
may end up creating quite
some redundancies so I wonder if it's worth to CSE the vectorized loop
body as well
(and given we have PHIs eventually CSE the whole vectorized loop with
pre-header as a region...)

Thanks,
Richard.

> Thanks,
> bin
>
> 2016-05-17 Bin Cheng  <bin.cheng@arm.com>
>
>         PR tree-optimization/68030
>         PR tree-optimization/69710
>         * tree-ssa-dom.c (cse_bbs): New function.
>         * tree-ssa-dom.h (cse_bbs): New declaration.
>         * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref):
>         Re-associate address by splitting constant offset.
>         (vect_create_data_ref_ptr, vect_setup_realignment): Record changed
>         basic block.
>         * tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): Record
>         changed basic block.
>         * tree-vectorizer.c (tree-ssa-dom.h): Include header file.
>         (changed_bbs): New variable.
>         (vectorize_loops): Allocate and free CHANGED_BBS.  Call cse_bbs.
>         * tree-vectorizer.h (changed_bbs): New declaration.

Bin.Cheng May 27, 2016, 11:11 a.m. UTC | #2

On Fri, May 27, 2016 at 11:45 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Wed, May 25, 2016 at 1:22 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>> Hi,
>> As analyzed in PR68303 and PR69710, vectorizer generates duplicated computations in loop's pre-header basic block when creating base address for vector reference to the same memory object.  Because the duplicated code is out of loop, IVOPT fails to track base object for these vector references, resulting in missed strength reduction.
>> It's agreed that vectorizer should be improved to generate optimal (IVOPT-friendly) code, the difficult part is we want a generic infrastructure.  After investigation, I tried to introduce a generic/simple local CSE interface by reusing existing algorithm/data-structure from tree-ssa-dom (tree-ssa-scopedtables).  The interface runs local CSE for each basic block in a bitmap, customers of this interface only need to record basic blocks in the bitmap when necessary.  Note we don't need scopedtables' unwinding facility since the interface runs only for single basic block, this should be good in terms of compilation time.
>> Besides CSE issue, this patch also re-associates address expressions in vect_create_addr_base_for_vector_ref, specifically, it splits constant offset and adds it back near the expression root in IR.  This is necessary because GCC only handles re-association for commutative operators in CSE.
>>
>> I checked its impact on various test cases.
>> With this patch, PR68030's generated assembly is reduced from ~750 lines to ~580 lines on x86_64, with both pre-header and loop body simplified.  But,
>> 1) It doesn't fix all the problem on x86_64.  Root cause is computation for base address of the first reference is somehow moved outside of loop's pre-header, local CSE can't help in this case.  Though split_constant_offset can back track ssa def chain, it causes possible redundant when there is no CSE opportunities in pre-header.
>> 2) It causes regression for PR68030 on AArch64.  I think the regression is caused by IVOPT issues which are exposed by this patch.  Checks on offset validity in get_address_cost is wrong/inaccurate now.  It considers an offset as valid if it's within the maximum offset range that backend supports.  This is not true, for example, AArch64 requires aligned offset additionally.  For example, LDR [base + 2060] is invalid for V4SFmode, although "2060" is within the maximum offset range.  Another issue is also in get_address_cost.  Inaccurate cost is computed for "base + offset + INDEX" address expression.  When register pressure is low, "base+offset" can be hoisted out and we can use [base + INDEX] addressing mode, whichhis is current behavior.
>>
>> Bootstrap and test on x86_64 and AArch64.  Any comments appreciated.
>
> It looks quite straight-forward with the caveat that it has one
> obvious piece that is not in the order
> of the complexity of a basic-block.  threadedge_initialize_values
> creates the SSA value array
I noticed this too, and think it's better to get rid of this init/fini
functions by some kind re-design.  I found it's quite weird to call
threadege_X in tree-vrp.c.  I will keep investigating this.
> which is zero-initialized (upon use).  That's probably a non-issue for
> the use you propose for
> the vectorizer (call cse_bbs once per function).  As Ideally I would
> like this facility to replace
> the loop unrollers own propagate_constants_for_unrolling it might
> become an issue though.
> In that regard the unroller facility is also more powerful because it
> handles regions (including
> PHIs).
With the current data-structure, I think it's not very hard to extend
the interface to regions.  I will keep investigating this too.  BTW,
if it's okay, I tend to do that in following patches.
>
> Note that in particular for SLP vectorization the vectorizer itself
> may end up creating quite
> some redundancies so I wonder if it's worth to CSE the vectorized loop
> body as well
Maybe.  The next step is condition block created by vectorizer.  Both
prune_runtime_alias_test_list and generated alias checks are
sub-optimal now, even worse, somehow computations in alias checks can
be propagated to loop pre-header.  With help of this interface, alias
checks (and later code) can be simplified.

> (and given we have PHIs eventually CSE the whole vectorized loop with
> pre-header as a region...)
>
> Thanks,
> Richard.
>
>> Thanks,
>> bin
>>
>> 2016-05-17 Bin Cheng  <bin.cheng@arm.com>
>>
>>         PR tree-optimization/68030
>>         PR tree-optimization/69710
>>         * tree-ssa-dom.c (cse_bbs): New function.
>>         * tree-ssa-dom.h (cse_bbs): New declaration.
>>         * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref):
>>         Re-associate address by splitting constant offset.
>>         (vect_create_data_ref_ptr, vect_setup_realignment): Record changed
>>         basic block.
>>         * tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): Record
>>         changed basic block.
>>         * tree-vectorizer.c (tree-ssa-dom.h): Include header file.
>>         (changed_bbs): New variable.
>>         (vectorize_loops): Allocate and free CHANGED_BBS.  Call cse_bbs.
>>         * tree-vectorizer.h (changed_bbs): New declaration.

Richard Biener May 27, 2016, 11:56 a.m. UTC | #3

On Fri, May 27, 2016 at 1:11 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Fri, May 27, 2016 at 11:45 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Wed, May 25, 2016 at 1:22 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>> Hi,
>>> As analyzed in PR68303 and PR69710, vectorizer generates duplicated computations in loop's pre-header basic block when creating base address for vector reference to the same memory object.  Because the duplicated code is out of loop, IVOPT fails to track base object for these vector references, resulting in missed strength reduction.
>>> It's agreed that vectorizer should be improved to generate optimal (IVOPT-friendly) code, the difficult part is we want a generic infrastructure.  After investigation, I tried to introduce a generic/simple local CSE interface by reusing existing algorithm/data-structure from tree-ssa-dom (tree-ssa-scopedtables).  The interface runs local CSE for each basic block in a bitmap, customers of this interface only need to record basic blocks in the bitmap when necessary.  Note we don't need scopedtables' unwinding facility since the interface runs only for single basic block, this should be good in terms of compilation time.
>>> Besides CSE issue, this patch also re-associates address expressions in vect_create_addr_base_for_vector_ref, specifically, it splits constant offset and adds it back near the expression root in IR.  This is necessary because GCC only handles re-association for commutative operators in CSE.
>>>
>>> I checked its impact on various test cases.
>>> With this patch, PR68030's generated assembly is reduced from ~750 lines to ~580 lines on x86_64, with both pre-header and loop body simplified.  But,
>>> 1) It doesn't fix all the problem on x86_64.  Root cause is computation for base address of the first reference is somehow moved outside of loop's pre-header, local CSE can't help in this case.  Though split_constant_offset can back track ssa def chain, it causes possible redundant when there is no CSE opportunities in pre-header.
>>> 2) It causes regression for PR68030 on AArch64.  I think the regression is caused by IVOPT issues which are exposed by this patch.  Checks on offset validity in get_address_cost is wrong/inaccurate now.  It considers an offset as valid if it's within the maximum offset range that backend supports.  This is not true, for example, AArch64 requires aligned offset additionally.  For example, LDR [base + 2060] is invalid for V4SFmode, although "2060" is within the maximum offset range.  Another issue is also in get_address_cost.  Inaccurate cost is computed for "base + offset + INDEX" address expression.  When register pressure is low, "base+offset" can be hoisted out and we can use [base + INDEX] addressing mode, whichhis is current behavior.
>>>
>>> Bootstrap and test on x86_64 and AArch64.  Any comments appreciated.
>>
>> It looks quite straight-forward with the caveat that it has one
>> obvious piece that is not in the order
>> of the complexity of a basic-block.  threadedge_initialize_values
>> creates the SSA value array
> I noticed this too, and think it's better to get rid of this init/fini
> functions by some kind re-design.  I found it's quite weird to call
> threadege_X in tree-vrp.c.  I will keep investigating this.
>> which is zero-initialized (upon use).  That's probably a non-issue for
>> the use you propose for
>> the vectorizer (call cse_bbs once per function).  As Ideally I would
>> like this facility to replace
>> the loop unrollers own propagate_constants_for_unrolling it might
>> become an issue though.
>> In that regard the unroller facility is also more powerful because it
>> handles regions (including
>> PHIs).
> With the current data-structure, I think it's not very hard to extend
> the interface to regions.  I will keep investigating this too.  BTW,
> if it's okay, I tend to do that in following patches.

I'm fine with doing enhancements to this in followup patches (adding Jeff
to CC for his opinions).

>>
>> Note that in particular for SLP vectorization the vectorizer itself
>> may end up creating quite
>> some redundancies so I wonder if it's worth to CSE the vectorized loop
>> body as well
> Maybe.  The next step is condition block created by vectorizer.  Both
> prune_runtime_alias_test_list and generated alias checks are
> sub-optimal now, even worse, somehow computations in alias checks can
> be propagated to loop pre-header.  With help of this interface, alias
> checks (and later code) can be simplified.

Yeah.

Thanks,
Richard.

>> (and given we have PHIs eventually CSE the whole vectorized loop with
>> pre-header as a region...)
>>
>> Thanks,
>> Richard.
>>
>>> Thanks,
>>> bin
>>>
>>> 2016-05-17 Bin Cheng  <bin.cheng@arm.com>
>>>
>>>         PR tree-optimization/68030
>>>         PR tree-optimization/69710
>>>         * tree-ssa-dom.c (cse_bbs): New function.
>>>         * tree-ssa-dom.h (cse_bbs): New declaration.
>>>         * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref):
>>>         Re-associate address by splitting constant offset.
>>>         (vect_create_data_ref_ptr, vect_setup_realignment): Record changed
>>>         basic block.
>>>         * tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): Record
>>>         changed basic block.
>>>         * tree-vectorizer.c (tree-ssa-dom.h): Include header file.
>>>         (changed_bbs): New variable.
>>>         (vectorize_loops): Allocate and free CHANGED_BBS.  Call cse_bbs.
>>>         * tree-vectorizer.h (changed_bbs): New declaration.

Bin.Cheng June 6, 2016, 11:23 a.m. UTC | #4

On Fri, May 27, 2016 at 12:56 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, May 27, 2016 at 1:11 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Fri, May 27, 2016 at 11:45 AM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Wed, May 25, 2016 at 1:22 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>> Hi,
>>>> As analyzed in PR68303 and PR69710, vectorizer generates duplicated computations in loop's pre-header basic block when creating base address for vector reference to the same memory object.  Because the duplicated code is out of loop, IVOPT fails to track base object for these vector references, resulting in missed strength reduction.
>>>> It's agreed that vectorizer should be improved to generate optimal (IVOPT-friendly) code, the difficult part is we want a generic infrastructure.  After investigation, I tried to introduce a generic/simple local CSE interface by reusing existing algorithm/data-structure from tree-ssa-dom (tree-ssa-scopedtables).  The interface runs local CSE for each basic block in a bitmap, customers of this interface only need to record basic blocks in the bitmap when necessary.  Note we don't need scopedtables' unwinding facility since the interface runs only for single basic block, this should be good in terms of compilation time.
>>>> Besides CSE issue, this patch also re-associates address expressions in vect_create_addr_base_for_vector_ref, specifically, it splits constant offset and adds it back near the expression root in IR.  This is necessary because GCC only handles re-association for commutative operators in CSE.
>>>>
>>>> I checked its impact on various test cases.
>>>> With this patch, PR68030's generated assembly is reduced from ~750 lines to ~580 lines on x86_64, with both pre-header and loop body simplified.  But,
>>>> 1) It doesn't fix all the problem on x86_64.  Root cause is computation for base address of the first reference is somehow moved outside of loop's pre-header, local CSE can't help in this case.  Though split_constant_offset can back track ssa def chain, it causes possible redundant when there is no CSE opportunities in pre-header.
>>>> 2) It causes regression for PR68030 on AArch64.  I think the regression is caused by IVOPT issues which are exposed by this patch.  Checks on offset validity in get_address_cost is wrong/inaccurate now.  It considers an offset as valid if it's within the maximum offset range that backend supports.  This is not true, for example, AArch64 requires aligned offset additionally.  For example, LDR [base + 2060] is invalid for V4SFmode, although "2060" is within the maximum offset range.  Another issue is also in get_address_cost.  Inaccurate cost is computed for "base + offset + INDEX" address expression.  When register pressure is low, "base+offset" can be hoisted out and we can use [base + INDEX] addressing mode, whichhis is current behavior.
>>>>
>>>> Bootstrap and test on x86_64 and AArch64.  Any comments appreciated.
>>>
>>> It looks quite straight-forward with the caveat that it has one
>>> obvious piece that is not in the order
>>> of the complexity of a basic-block.  threadedge_initialize_values
>>> creates the SSA value array
>> I noticed this too, and think it's better to get rid of this init/fini
>> functions by some kind re-design.  I found it's quite weird to call
>> threadege_X in tree-vrp.c.  I will keep investigating this.
>>> which is zero-initialized (upon use).  That's probably a non-issue for
>>> the use you propose for
>>> the vectorizer (call cse_bbs once per function).  As Ideally I would
>>> like this facility to replace
>>> the loop unrollers own propagate_constants_for_unrolling it might
>>> become an issue though.
>>> In that regard the unroller facility is also more powerful because it
>>> handles regions (including
>>> PHIs).
>> With the current data-structure, I think it's not very hard to extend
>> the interface to regions.  I will keep investigating this too.  BTW,
>> if it's okay, I tend to do that in following patches.
>
> I'm fine with doing enhancements to this in followup patches (adding Jeff
> to CC for his opinions).
Hi,
I further investigated the impact of this patch, and believe my
previous conclusion still holds.  On targets that don't support [base
+ index + offset] addressing mode, it may causes IVO generating worse
code than now.  I found and investigated a new case showing exactly
the same problem.  Take AArch64 as an example, before this patch, IVO
computes cost and chooses candidate as below:

Loop preheader:
    vect_p_1 = ...
    vect_p_2 = ...
Loop header:
    MEM[vect_p_1 + VAR]
    MEM[vect_p_2 + VAR]

After this patch, cse opportunities are realized between vect_p_1 and
vect_p_2, and IVO computes cost for the second MEM_REF differently as
below:
Loop preheader:
    vect_p_1 = ....
    vect_p_2 = vect_1_p + offset
Loop header:
    MEM[vect_p_1 + VAR]
    MEM[vect_p_1 + VAR + offset]

On x86, it's optimal because the addressing mode is supported.  On
AArch64, cost computed for [vect_p_1 + VAR + offset] is too large.  As
a result, VAR is not chosen, and more candidates are chosen.  The
inaccurate cost is computed because IVO relies on LEGITIMIZE_ADDRESS
hook, as well as use buffering when constructing rtx address
expression.  IVO should be aware of that "vect_p_1 + offset" can be
computed as an invariant just like before this patch.
I am doing experiments to rewrite address computation in IVO, in the
meantime, I collected various benchmarks data on AArch64 and there is
no big regression caused by the mentioned issue.

Hi Jeff,
What's your opinion on this (and how to extend it to a region based
interface)?  I will update this patch for further review if you are
okay.

Thanks,
bin
>
>>>
>>> Note that in particular for SLP vectorization the vectorizer itself
>>> may end up creating quite
>>> some redundancies so I wonder if it's worth to CSE the vectorized loop
>>> body as well
>> Maybe.  The next step is condition block created by vectorizer.  Both
>> prune_runtime_alias_test_list and generated alias checks are
>> sub-optimal now, even worse, somehow computations in alias checks can
>> be propagated to loop pre-header.  With help of this interface, alias
>> checks (and later code) can be simplified.
>
> Yeah.
>
> Thanks,
> Richard.
>
>>> (and given we have PHIs eventually CSE the whole vectorized loop with
>>> pre-header as a region...)
>>>
>>> Thanks,
>>> Richard.
>>>
>>>> Thanks,
>>>> bin
>>>>
>>>> 2016-05-17 Bin Cheng  <bin.cheng@arm.com>
>>>>
>>>>         PR tree-optimization/68030
>>>>         PR tree-optimization/69710
>>>>         * tree-ssa-dom.c (cse_bbs): New function.
>>>>         * tree-ssa-dom.h (cse_bbs): New declaration.
>>>>         * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref):
>>>>         Re-associate address by splitting constant offset.
>>>>         (vect_create_data_ref_ptr, vect_setup_realignment): Record changed
>>>>         basic block.
>>>>         * tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): Record
>>>>         changed basic block.
>>>>         * tree-vectorizer.c (tree-ssa-dom.h): Include header file.
>>>>         (changed_bbs): New variable.
>>>>         (vectorize_loops): Allocate and free CHANGED_BBS.  Call cse_bbs.
>>>>         * tree-vectorizer.h (changed_bbs): New declaration.

Jeff Law June 21, 2016, 9:54 p.m. UTC | #5

On 05/27/2016 05:56 AM, Richard Biener wrote:
> On Fri, May 27, 2016 at 1:11 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Fri, May 27, 2016 at 11:45 AM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Wed, May 25, 2016 at 1:22 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>> Hi,
>>>> As analyzed in PR68303 and PR69710, vectorizer generates duplicated computations in loop's pre-header basic block when creating base address for vector reference to the same memory object.  Because the duplicated code is out of loop, IVOPT fails to track base object for these vector references, resulting in missed strength reduction.
>>>> It's agreed that vectorizer should be improved to generate optimal (IVOPT-friendly) code, the difficult part is we want a generic infrastructure.  After investigation, I tried to introduce a generic/simple local CSE interface by reusing existing algorithm/data-structure from tree-ssa-dom (tree-ssa-scopedtables).  The interface runs local CSE for each basic block in a bitmap, customers of this interface only need to record basic blocks in the bitmap when necessary.  Note we don't need scopedtables' unwinding facility since the interface runs only for single basic block, this should be good in terms of compilation time.
>>>> Besides CSE issue, this patch also re-associates address expressions in vect_create_addr_base_for_vector_ref, specifically, it splits constant offset and adds it back near the expression root in IR.  This is necessary because GCC only handles re-association for commutative operators in CSE.
>>>>
>>>> I checked its impact on various test cases.
>>>> With this patch, PR68030's generated assembly is reduced from ~750 lines to ~580 lines on x86_64, with both pre-header and loop body simplified.  But,
>>>> 1) It doesn't fix all the problem on x86_64.  Root cause is computation for base address of the first reference is somehow moved outside of loop's pre-header, local CSE can't help in this case.  Though split_constant_offset can back track ssa def chain, it causes possible redundant when there is no CSE opportunities in pre-header.
>>>> 2) It causes regression for PR68030 on AArch64.  I think the regression is caused by IVOPT issues which are exposed by this patch.  Checks on offset validity in get_address_cost is wrong/inaccurate now.  It considers an offset as valid if it's within the maximum offset range that backend supports.  This is not true, for example, AArch64 requires aligned offset additionally.  For example, LDR [base + 2060] is invalid for V4SFmode, although "2060" is within the maximum offset range.  Another issue is also in get_address_cost.  Inaccurate cost is computed for "base + offset + INDEX" address expression.  When register pressure is low, "base+offset" can be hoisted out and we can use [base + INDEX] addressing mode, whichhis is current behavior.
>>>>
>>>> Bootstrap and test on x86_64 and AArch64.  Any comments appreciated.
>>>
>>> It looks quite straight-forward with the caveat that it has one
>>> obvious piece that is not in the order
>>> of the complexity of a basic-block.  threadedge_initialize_values
>>> creates the SSA value array
>> I noticed this too, and think it's better to get rid of this init/fini
>> functions by some kind re-design.  I found it's quite weird to call
>> threadege_X in tree-vrp.c.  I will keep investigating this.
>>> which is zero-initialized (upon use).  That's probably a non-issue for
>>> the use you propose for
>>> the vectorizer (call cse_bbs once per function).  As Ideally I would
>>> like this facility to replace
>>> the loop unrollers own propagate_constants_for_unrolling it might
>>> become an issue though.
>>> In that regard the unroller facility is also more powerful because it
>>> handles regions (including
>>> PHIs).
>> With the current data-structure, I think it's not very hard to extend
>> the interface to regions.  I will keep investigating this too.  BTW,
>> if it's okay, I tend to do that in following patches.
>
> I'm fine with doing enhancements to this in followup patches (adding Jeff
> to CC for his opinions).
Likewise.  Ideally we'll get all the threading stuff out of DOM/VRP. 
VRP is probably the easiest as I think it'll fall-out of some work 
Andrew is doing.  Getting rid of the threading from DOM is more work, 
but nothing that (IMHO) is terribly complex, its just time consuming.

Jeff

Jeff Law June 21, 2016, 9:58 p.m. UTC | #6

On 06/06/2016 05:23 AM, Bin.Cheng wrote:

> Hi Jeff,
> What's your opinion on this (and how to extend it to a region based
> interface)?  I will update this patch for further review if you are
> okay.
I've never pondered how to revamp DOM into a regional interface (though 
I have pondered how to revamp the threading bits into a regional 
interface -- without any real success).

Conceptually I think it's just a matter of setting an entry block/edge, 
computing all blocks dominated by that entry block/edge and limiting the 
walker to only visiting blocks in that set.  So the regional aspect 
would live in domwalk.[ch].

We'd probably want to avoid all the threading bits in a regional walk -- 
not because threading isn't useful, but sequencing the updates is a PITA 
and trying to do it in the middle of some other pass would just be a mess.

jeff

Jeff Law July 13, 2016, 8:59 p.m. UTC | #7

On 05/25/2016 05:22 AM, Bin Cheng wrote:
> Hi, As analyzed in PR68303 and PR69710, vectorizer generates
> duplicated computations in loop's pre-header basic block when
> creating base address for vector reference to the same memory object.
Not a huge surprise.  Loop optimizations generally have a tendency to 
create and/or expose CSE opportunities.   Unrolling is a common culprit, 
there's certainly the possibility for header duplication, code motions 
and IV rewriting to also expose/create redundant code.

We haven't really had great success squashing these redundancies within 
the loop optimizer itself.  This has lead to running CSE like passes 
after the loop optimizers or mini-CSE passes like we see here.


> Because the duplicated code is out of loop, IVOPT fails to track base
> object for these vector references, resulting in missed strength
> reduction. It's agreed that vectorizer should be improved to generate
> optimal (IVOPT-friendly) code, the difficult part is we want a
> generic infrastructure.  After investigation, I tried to introduce a
> generic/simple local CSE interface by reusing existing
> algorithm/data-structure from tree-ssa-dom (tree-ssa-scopedtables).
I believe you're the only person that's ever tried to re-use that 
infrastructure -- were there any API/usage patterns that would have 
worked better for your needs?

ISTM that conceptually if we broke out the basic hashing bits that's 
what you're most interested in re-using since you're really just 
building a local CSE.



> The interface runs local CSE for each basic block in a bitmap,
> customers of this interface only need to record basic blocks in the
> bitmap when necessary.  Note we don't need scopedtables' unwinding
> facility since the interface runs only for single basic block, this
> should be good in terms of compilation time.
Right.


>
> I checked its impact on various test cases. With this patch,
> PR68030's generated assembly is reduced from ~750 lines to ~580 lines
> on x86_64, with both pre-header and loop body simplified.
That's a good result :-)



  But, 1) It
> doesn't fix all the problem on x86_64.  Root cause is computation for
> base address of the first reference is somehow moved outside of
> loop's pre-header, local CSE can't help in this case.
That's a bid odd -- have you investigated why this is outside the loop 
header?

  Though
> split_constant_offset can back track ssa def chain, it causes
> possible redundant when there is no CSE opportunities in pre-header.
Is the problem that when you split you don't know if doing so will be 
profitable until you you're actually CSE-ing the pre-header?

> 2) It causes regression for PR68030 on AArch64.  I think the
> regression is caused by IVOPT issues which are exposed by this patch.
> Checks on offset validity in get_address_cost is wrong/inaccurate
> now.  It considers an offset as valid if it's within the maximum
> offset range that backend supports.  This is not true, for example,
> AArch64 requires aligned offset additionally.  For example, LDR [base
> + 2060] is invalid for V4SFmode, although "2060" is within the
> maximum offset range.  Another issue is also in get_address_cost.
> Inaccurate cost is computed for "base + offset + INDEX" address
> expression.  When register pressure is low, "base+offset" can be
> hoisted out and we can use [base + INDEX] addressing mode, whichhis
> is current behavior.
>

 > Bootstrap and test on x86_64 and AArch64.  Any comments appreciated.
 >
 > Thanks,
 > bin
 >
 > 2016-05-17 Bin Cheng  <bin.cheng@arm.com>
 >
 >         PR tree-optimization/68030
 >         PR tree-optimization/69710
 >         * tree-ssa-dom.c (cse_bbs): New function.
 >         * tree-ssa-dom.h (cse_bbs): New declaration.
 >         * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref):
 >         Re-associate address by splitting constant offset.
 >         (vect_create_data_ref_ptr, vect_setup_realignment): Record 
changed
 >         basic block.
 >         * tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): 
Record
 >         changed basic block.
 >         * tree-vectorizer.c (tree-ssa-dom.h): Include header file.
 >         (changed_bbs): New variable.
 >         (vectorize_loops): Allocate and free CHANGED_BBS.  Call cse_bbs.
 >         * tree-vectorizer.h (changed_bbs): New declaration.
 >
 >
 > pr69710-20160523.txt
 >
 >
 > diff --git a/gcc/tree-ssa-dom.c b/gcc/tree-ssa-dom.c
 > index 8bf5b3c..b2a0b0c 100644
 > --- a/gcc/tree-ssa-dom.c
 > +++ b/gcc/tree-ssa-dom.c
 > @@ -2090,3 +2090,50 @@ lookup_avail_expr (gimple *stmt, bool insert,
 >
 >    return lhs;
 >  }
 > +
 > +/* A local CSE interface which runs CSE for basic blocks recorded in
 > +   CHANGED_BBS.  */
 > +
 > +void
 > +cse_bbs (bitmap changed_bbs)
 > +{
 > +  unsigned index;
 > +  bitmap_iterator bi;
 > +  gimple_stmt_iterator gsi;
 > +
 > +  hash_table<expr_elt_hasher> *avail_exprs;
 > +  class avail_exprs_stack *avail_exprs_stack;
 > +  class const_and_copies *const_and_copies;
 > +
 > +  avail_exprs = new hash_table<expr_elt_hasher> (1024);
 > +  avail_exprs_stack = new class avail_exprs_stack (avail_exprs);
 > +  const_and_copies = new class const_and_copies ();
 > +
 > +  threadedge_initialize_values ();
 > +  /* Push a marker on the stacks of local information so that we 
know how
 > +     far to unwind when we finalize this block.  */
 > +  avail_exprs_stack->push_marker ();
 > +  const_and_copies->push_marker ();
 > +
 > +  EXECUTE_IF_SET_IN_BITMAP (changed_bbs, 0, index, bi)
 > +    {
 > +      basic_block bb = BASIC_BLOCK_FOR_FN (cfun, index);
 > +
 > +      if (dump_file && (dump_flags & TDF_DETAILS))
 > +	fprintf (dump_file, "\n\nRun local cse on block #%d\n\n", bb->index);
 > +
 > +      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
 > +	optimize_stmt (bb, gsi, const_and_copies, avail_exprs_stack);
 > +
 > +      /* Pop stacks to keep it small.  */
 > +      avail_exprs_stack->pop_to_marker ();
 > +      const_and_copies->pop_to_marker ();
 > +    }
 > +
 > +  delete avail_exprs;
 > +  avail_exprs = NULL;
 > +
 > +  delete avail_exprs_stack;
 > +  delete const_and_copies;
 > +  threadedge_finalize_values ();
 > +}
No real objections here or how it's used in tree-vectorizer.c.  I'd like 
to get to a place where you didn't have to muck with the stacks or 
threadedge_* at all, but that to me is a cleanup item and not a 
requirement to move forward.

I'll assume the reassociations you do in tree-vect-data-refs are reasonable.

So how do you want to move forward on this?  Are the issues you noted 
above serious enough to warrant holding this up?

jeff

[PR68030/PR69710,RFC] Introduce a simple local CSE interface and use it in vectorizer

Commit Message

Comments

Patch