diff mbox

[2/4,AArch64] Increase the loop peeling limit

Message ID 563BC15D.3080608@samsung.com
State New
Headers show

Commit Message

Evandro Menezes Nov. 5, 2015, 8:51 p.m. UTC
2015-11-05  Evandro Menezes <e.menezes@samsung.com>

    gcc/

        * config/aarch64/aarch64.c (aarch64_override_options_internal):
        Increase loop peeling limit.

This patch increases the limit for the number of peeled insns.  With 
this change, I noticed no major regression in either Geekbench v3 or 
SPEC CPU2000 while some benchmarks, typically FP ones, improved 
significantly.

I tested this tuning on Exynos M1 and on A57.  ThunderX seems to benefit 
from this tuning too.  However, I'd appreciate comments from other 
stakeholders.

Thank you,

Comments

Evandro Menezes Nov. 19, 2015, 10:04 p.m. UTC | #1
On 11/05/2015 02:51 PM, Evandro Menezes wrote:
> 2015-11-05  Evandro Menezes <e.menezes@samsung.com>
>
>    gcc/
>
>        * config/aarch64/aarch64.c (aarch64_override_options_internal):
>        Increase loop peeling limit.
>
> This patch increases the limit for the number of peeled insns. With 
> this change, I noticed no major regression in either Geekbench v3 or 
> SPEC CPU2000 while some benchmarks, typically FP ones, improved 
> significantly.
>
> I tested this tuning on Exynos M1 and on A57.  ThunderX seems to 
> benefit from this tuning too.  However, I'd appreciate comments from 
> other stakeholders.

Ping.
James Greenhalgh Nov. 20, 2015, 11:53 a.m. UTC | #2
On Thu, Nov 19, 2015 at 04:04:41PM -0600, Evandro Menezes wrote:
> On 11/05/2015 02:51 PM, Evandro Menezes wrote:
> >2015-11-05  Evandro Menezes <e.menezes@samsung.com>
> >
> >   gcc/
> >
> >       * config/aarch64/aarch64.c (aarch64_override_options_internal):
> >       Increase loop peeling limit.
> >
> >This patch increases the limit for the number of peeled insns.
> >With this change, I noticed no major regression in either
> >Geekbench v3 or SPEC CPU2000 while some benchmarks, typically FP
> >ones, improved significantly.
> >
> >I tested this tuning on Exynos M1 and on A57.  ThunderX seems to
> >benefit from this tuning too.  However, I'd appreciate comments
> >from other stakeholders.
> 
> Ping.

I'd like to leave this for a call from the port maintainers. I can see why
this leads to more opportunities for vectorization, but I'm concerned about
the wider impact on code size. Certainly I wouldn't expect this to be our
default at -O2 and below.

My gut feeling is that this doesn't really belong in the back-end (there are
presumably good reasons why the default for this parameter across GCC has
fluctuated from 400 to 100 to 200 over recent years), but as I say, I'd
like Marcus or Richard to make the call as to whether or not we take this
patch.

For now, I'd drop it from the series (it stands alone anyway).

Thanks,
James
Evandro Menezes Dec. 3, 2015, 9:07 p.m. UTC | #3
On 11/20/2015 05:53 AM, James Greenhalgh wrote:
> On Thu, Nov 19, 2015 at 04:04:41PM -0600, Evandro Menezes wrote:
>> On 11/05/2015 02:51 PM, Evandro Menezes wrote:
>>> 2015-11-05  Evandro Menezes <e.menezes@samsung.com>
>>>
>>>    gcc/
>>>
>>>        * config/aarch64/aarch64.c (aarch64_override_options_internal):
>>>        Increase loop peeling limit.
>>>
>>> This patch increases the limit for the number of peeled insns.
>>> With this change, I noticed no major regression in either
>>> Geekbench v3 or SPEC CPU2000 while some benchmarks, typically FP
>>> ones, improved significantly.
>>>
>>> I tested this tuning on Exynos M1 and on A57.  ThunderX seems to
>>> benefit from this tuning too.  However, I'd appreciate comments
>> >from other stakeholders.
>>
>> Ping.
> I'd like to leave this for a call from the port maintainers. I can see why
> this leads to more opportunities for vectorization, but I'm concerned about
> the wider impact on code size. Certainly I wouldn't expect this to be our
> default at -O2 and below.
>
> My gut feeling is that this doesn't really belong in the back-end (there are
> presumably good reasons why the default for this parameter across GCC has
> fluctuated from 400 to 100 to 200 over recent years), but as I say, I'd
> like Marcus or Richard to make the call as to whether or not we take this
> patch.

Please, correct me if I'm wrong, but loop peeling is enabled only with 
loop unrolling (and with PGO).  If so, then extra code size is not a 
concern, for this heuristic is only active when unrolling loops, when 
code size is already of secondary importance.

Thank you,
James Greenhalgh Dec. 14, 2015, 11:26 a.m. UTC | #4
On Thu, Dec 03, 2015 at 03:07:43PM -0600, Evandro Menezes wrote:
> On 11/20/2015 05:53 AM, James Greenhalgh wrote:
> >On Thu, Nov 19, 2015 at 04:04:41PM -0600, Evandro Menezes wrote:
> >>On 11/05/2015 02:51 PM, Evandro Menezes wrote:
> >>>2015-11-05  Evandro Menezes <e.menezes@samsung.com>
> >>>
> >>>   gcc/
> >>>
> >>>       * config/aarch64/aarch64.c (aarch64_override_options_internal):
> >>>       Increase loop peeling limit.
> >>>
> >>>This patch increases the limit for the number of peeled insns.
> >>>With this change, I noticed no major regression in either
> >>>Geekbench v3 or SPEC CPU2000 while some benchmarks, typically FP
> >>>ones, improved significantly.
> >>>
> >>>I tested this tuning on Exynos M1 and on A57.  ThunderX seems to
> >>>benefit from this tuning too.  However, I'd appreciate comments
> >>>from other stakeholders.
> >>
> >>Ping.
> >I'd like to leave this for a call from the port maintainers. I can see why
> >this leads to more opportunities for vectorization, but I'm concerned about
> >the wider impact on code size. Certainly I wouldn't expect this to be our
> >default at -O2 and below.
> >
> >My gut feeling is that this doesn't really belong in the back-end (there are
> >presumably good reasons why the default for this parameter across GCC has
> >fluctuated from 400 to 100 to 200 over recent years), but as I say, I'd
> >like Marcus or Richard to make the call as to whether or not we take this
> >patch.
> 
> Please, correct me if I'm wrong, but loop peeling is enabled only
> with loop unrolling (and with PGO).  If so, then extra code size is
> not a concern, for this heuristic is only active when unrolling
> loops, when code size is already of secondary importance.

My understanding was that loop peeling is enabled from -O2 upwards, and
is also used to partially peel unaligned loops for vectorization (allowing
the vector code to be well aligned), or to completely peel inner loops which
may then become amenable to SLP vectorization.

If I'm wrong then I take back these objections. But I was sure this
parameter was used in a number of situations outside of just
-funroll-loops/-funroll-all-loops . Certainly I remember seeing performance
sensitivities to this parameter at -O3 in some internal workloads I was
analysing.

Thanks,
James
Evandro Menezes Dec. 15, 2015, 11:34 p.m. UTC | #5
On 12/14/2015 05:26 AM, James Greenhalgh wrote:
> On Thu, Dec 03, 2015 at 03:07:43PM -0600, Evandro Menezes wrote:
>> On 11/20/2015 05:53 AM, James Greenhalgh wrote:
>>> On Thu, Nov 19, 2015 at 04:04:41PM -0600, Evandro Menezes wrote:
>>>> On 11/05/2015 02:51 PM, Evandro Menezes wrote:
>>>>> 2015-11-05  Evandro Menezes <e.menezes@samsung.com>
>>>>>
>>>>>    gcc/
>>>>>
>>>>>        * config/aarch64/aarch64.c (aarch64_override_options_internal):
>>>>>        Increase loop peeling limit.
>>>>>
>>>>> This patch increases the limit for the number of peeled insns.
>>>>> With this change, I noticed no major regression in either
>>>>> Geekbench v3 or SPEC CPU2000 while some benchmarks, typically FP
>>>>> ones, improved significantly.
>>>>>
>>>>> I tested this tuning on Exynos M1 and on A57.  ThunderX seems to
>>>>> benefit from this tuning too.  However, I'd appreciate comments
>>>> >from other stakeholders.
>>>>
>>>> Ping.
>>> I'd like to leave this for a call from the port maintainers. I can see why
>>> this leads to more opportunities for vectorization, but I'm concerned about
>>> the wider impact on code size. Certainly I wouldn't expect this to be our
>>> default at -O2 and below.
>>>
>>> My gut feeling is that this doesn't really belong in the back-end (there are
>>> presumably good reasons why the default for this parameter across GCC has
>>> fluctuated from 400 to 100 to 200 over recent years), but as I say, I'd
>>> like Marcus or Richard to make the call as to whether or not we take this
>>> patch.
>> Please, correct me if I'm wrong, but loop peeling is enabled only
>> with loop unrolling (and with PGO).  If so, then extra code size is
>> not a concern, for this heuristic is only active when unrolling
>> loops, when code size is already of secondary importance.
> My understanding was that loop peeling is enabled from -O2 upwards, and
> is also used to partially peel unaligned loops for vectorization (allowing
> the vector code to be well aligned), or to completely peel inner loops which
> may then become amenable to SLP vectorization.
>
> If I'm wrong then I take back these objections. But I was sure this
> parameter was used in a number of situations outside of just
> -funroll-loops/-funroll-all-loops . Certainly I remember seeing performance
> sensitivities to this parameter at -O3 in some internal workloads I was
> analysing.

Vectorization, including SLP, is only enabled at -O3, isn't it?  It 
seems to me that peeling is only used by optimizations which already 
lead to potential increase in code size.

For instance, with "-Ofast -funroll-all-loops", the total text size for 
the SPEC CPU2000 suite is 26.9MB with this proposed change and 26.8MB 
without it; with just "-O2", it is the same at 23.1MB regardless of this 
setting.

So it seems to me that this proposal should be neutral for up to -O2.

Thank you,
Richard Earnshaw (lists) Dec. 16, 2015, 11:24 a.m. UTC | #6
On 15/12/15 23:34, Evandro Menezes wrote:
> On 12/14/2015 05:26 AM, James Greenhalgh wrote:
>> On Thu, Dec 03, 2015 at 03:07:43PM -0600, Evandro Menezes wrote:
>>> On 11/20/2015 05:53 AM, James Greenhalgh wrote:
>>>> On Thu, Nov 19, 2015 at 04:04:41PM -0600, Evandro Menezes wrote:
>>>>> On 11/05/2015 02:51 PM, Evandro Menezes wrote:
>>>>>> 2015-11-05  Evandro Menezes <e.menezes@samsung.com>
>>>>>>
>>>>>>    gcc/
>>>>>>
>>>>>>        * config/aarch64/aarch64.c
>>>>>> (aarch64_override_options_internal):
>>>>>>        Increase loop peeling limit.
>>>>>>
>>>>>> This patch increases the limit for the number of peeled insns.
>>>>>> With this change, I noticed no major regression in either
>>>>>> Geekbench v3 or SPEC CPU2000 while some benchmarks, typically FP
>>>>>> ones, improved significantly.
>>>>>>
>>>>>> I tested this tuning on Exynos M1 and on A57.  ThunderX seems to
>>>>>> benefit from this tuning too.  However, I'd appreciate comments
>>>>> >from other stakeholders.
>>>>>
>>>>> Ping.
>>>> I'd like to leave this for a call from the port maintainers. I can
>>>> see why
>>>> this leads to more opportunities for vectorization, but I'm
>>>> concerned about
>>>> the wider impact on code size. Certainly I wouldn't expect this to
>>>> be our
>>>> default at -O2 and below.
>>>>
>>>> My gut feeling is that this doesn't really belong in the back-end
>>>> (there are
>>>> presumably good reasons why the default for this parameter across
>>>> GCC has
>>>> fluctuated from 400 to 100 to 200 over recent years), but as I say, I'd
>>>> like Marcus or Richard to make the call as to whether or not we take
>>>> this
>>>> patch.
>>> Please, correct me if I'm wrong, but loop peeling is enabled only
>>> with loop unrolling (and with PGO).  If so, then extra code size is
>>> not a concern, for this heuristic is only active when unrolling
>>> loops, when code size is already of secondary importance.
>> My understanding was that loop peeling is enabled from -O2 upwards, and
>> is also used to partially peel unaligned loops for vectorization
>> (allowing
>> the vector code to be well aligned), or to completely peel inner loops
>> which
>> may then become amenable to SLP vectorization.
>>
>> If I'm wrong then I take back these objections. But I was sure this
>> parameter was used in a number of situations outside of just
>> -funroll-loops/-funroll-all-loops . Certainly I remember seeing
>> performance
>> sensitivities to this parameter at -O3 in some internal workloads I was
>> analysing.
> 
> Vectorization, including SLP, is only enabled at -O3, isn't it?  It
> seems to me that peeling is only used by optimizations which already
> lead to potential increase in code size.
> 
> For instance, with "-Ofast -funroll-all-loops", the total text size for
> the SPEC CPU2000 suite is 26.9MB with this proposed change and 26.8MB
> without it; with just "-O2", it is the same at 23.1MB regardless of this
> setting.
> 
> So it seems to me that this proposal should be neutral for up to -O2.
> 
> Thank you,
> 

My preference would be to not diverge from the global parameter
settings.  I haven't looked in detail at this parameter but it seems to
me there are two possible paths:

1) We could get agreement globally that the parameter should be increased.
2) We could agree that this specific use of the parameter is distinct
from some other uses and deserves a new param in its own right with a
higher value.

R.
Richard Biener Dec. 16, 2015, 12:42 p.m. UTC | #7
On Wed, Dec 16, 2015 at 12:24 PM, Richard Earnshaw (lists)
<Richard.Earnshaw@arm.com> wrote:
> On 15/12/15 23:34, Evandro Menezes wrote:
>> On 12/14/2015 05:26 AM, James Greenhalgh wrote:
>>> On Thu, Dec 03, 2015 at 03:07:43PM -0600, Evandro Menezes wrote:
>>>> On 11/20/2015 05:53 AM, James Greenhalgh wrote:
>>>>> On Thu, Nov 19, 2015 at 04:04:41PM -0600, Evandro Menezes wrote:
>>>>>> On 11/05/2015 02:51 PM, Evandro Menezes wrote:
>>>>>>> 2015-11-05  Evandro Menezes <e.menezes@samsung.com>
>>>>>>>
>>>>>>>    gcc/
>>>>>>>
>>>>>>>        * config/aarch64/aarch64.c
>>>>>>> (aarch64_override_options_internal):
>>>>>>>        Increase loop peeling limit.
>>>>>>>
>>>>>>> This patch increases the limit for the number of peeled insns.
>>>>>>> With this change, I noticed no major regression in either
>>>>>>> Geekbench v3 or SPEC CPU2000 while some benchmarks, typically FP
>>>>>>> ones, improved significantly.
>>>>>>>
>>>>>>> I tested this tuning on Exynos M1 and on A57.  ThunderX seems to
>>>>>>> benefit from this tuning too.  However, I'd appreciate comments
>>>>>> >from other stakeholders.
>>>>>>
>>>>>> Ping.
>>>>> I'd like to leave this for a call from the port maintainers. I can
>>>>> see why
>>>>> this leads to more opportunities for vectorization, but I'm
>>>>> concerned about
>>>>> the wider impact on code size. Certainly I wouldn't expect this to
>>>>> be our
>>>>> default at -O2 and below.
>>>>>
>>>>> My gut feeling is that this doesn't really belong in the back-end
>>>>> (there are
>>>>> presumably good reasons why the default for this parameter across
>>>>> GCC has
>>>>> fluctuated from 400 to 100 to 200 over recent years), but as I say, I'd
>>>>> like Marcus or Richard to make the call as to whether or not we take
>>>>> this
>>>>> patch.
>>>> Please, correct me if I'm wrong, but loop peeling is enabled only
>>>> with loop unrolling (and with PGO).  If so, then extra code size is
>>>> not a concern, for this heuristic is only active when unrolling
>>>> loops, when code size is already of secondary importance.
>>> My understanding was that loop peeling is enabled from -O2 upwards, and
>>> is also used to partially peel unaligned loops for vectorization
>>> (allowing
>>> the vector code to be well aligned), or to completely peel inner loops
>>> which
>>> may then become amenable to SLP vectorization.
>>>
>>> If I'm wrong then I take back these objections. But I was sure this
>>> parameter was used in a number of situations outside of just
>>> -funroll-loops/-funroll-all-loops . Certainly I remember seeing
>>> performance
>>> sensitivities to this parameter at -O3 in some internal workloads I was
>>> analysing.
>>
>> Vectorization, including SLP, is only enabled at -O3, isn't it?  It
>> seems to me that peeling is only used by optimizations which already
>> lead to potential increase in code size.
>>
>> For instance, with "-Ofast -funroll-all-loops", the total text size for
>> the SPEC CPU2000 suite is 26.9MB with this proposed change and 26.8MB
>> without it; with just "-O2", it is the same at 23.1MB regardless of this
>> setting.
>>
>> So it seems to me that this proposal should be neutral for up to -O2.
>>
>> Thank you,
>>
>
> My preference would be to not diverge from the global parameter
> settings.  I haven't looked in detail at this parameter but it seems to
> me there are two possible paths:
>
> 1) We could get agreement globally that the parameter should be increased.
> 2) We could agree that this specific use of the parameter is distinct
> from some other uses and deserves a new param in its own right with a
> higher value.

I think the fix is to improve the unrolled size estimates by better taking into
account constant propagation and CSE opportunities.  I have some ideas
here but not sure if I have enough free cycles to implement this for GCC 7.

Richard.

> R.
Evandro Menezes Dec. 16, 2015, 8:11 p.m. UTC | #8
On 12/16/2015 05:24 AM, Richard Earnshaw (lists) wrote:
> On 15/12/15 23:34, Evandro Menezes wrote:
>> On 12/14/2015 05:26 AM, James Greenhalgh wrote:
>>> On Thu, Dec 03, 2015 at 03:07:43PM -0600, Evandro Menezes wrote:
>>>> On 11/20/2015 05:53 AM, James Greenhalgh wrote:
>>>>> On Thu, Nov 19, 2015 at 04:04:41PM -0600, Evandro Menezes wrote:
>>>>>> On 11/05/2015 02:51 PM, Evandro Menezes wrote:
>>>>>>> 2015-11-05  Evandro Menezes <e.menezes@samsung.com>
>>>>>>>
>>>>>>>     gcc/
>>>>>>>
>>>>>>>         * config/aarch64/aarch64.c
>>>>>>> (aarch64_override_options_internal):
>>>>>>>         Increase loop peeling limit.
>>>>>>>
>>>>>>> This patch increases the limit for the number of peeled insns.
>>>>>>> With this change, I noticed no major regression in either
>>>>>>> Geekbench v3 or SPEC CPU2000 while some benchmarks, typically FP
>>>>>>> ones, improved significantly.
>>>>>>>
>>>>>>> I tested this tuning on Exynos M1 and on A57.  ThunderX seems to
>>>>>>> benefit from this tuning too.  However, I'd appreciate comments
>>>>>> >from other stakeholders.
>>>>>>
>>>>>> Ping.
>>>>> I'd like to leave this for a call from the port maintainers. I can
>>>>> see why
>>>>> this leads to more opportunities for vectorization, but I'm
>>>>> concerned about
>>>>> the wider impact on code size. Certainly I wouldn't expect this to
>>>>> be our
>>>>> default at -O2 and below.
>>>>>
>>>>> My gut feeling is that this doesn't really belong in the back-end
>>>>> (there are
>>>>> presumably good reasons why the default for this parameter across
>>>>> GCC has
>>>>> fluctuated from 400 to 100 to 200 over recent years), but as I say, I'd
>>>>> like Marcus or Richard to make the call as to whether or not we take
>>>>> this
>>>>> patch.
>>>> Please, correct me if I'm wrong, but loop peeling is enabled only
>>>> with loop unrolling (and with PGO).  If so, then extra code size is
>>>> not a concern, for this heuristic is only active when unrolling
>>>> loops, when code size is already of secondary importance.
>>> My understanding was that loop peeling is enabled from -O2 upwards, and
>>> is also used to partially peel unaligned loops for vectorization
>>> (allowing
>>> the vector code to be well aligned), or to completely peel inner loops
>>> which
>>> may then become amenable to SLP vectorization.
>>>
>>> If I'm wrong then I take back these objections. But I was sure this
>>> parameter was used in a number of situations outside of just
>>> -funroll-loops/-funroll-all-loops . Certainly I remember seeing
>>> performance
>>> sensitivities to this parameter at -O3 in some internal workloads I was
>>> analysing.
>> Vectorization, including SLP, is only enabled at -O3, isn't it?  It
>> seems to me that peeling is only used by optimizations which already
>> lead to potential increase in code size.
>>
>> For instance, with "-Ofast -funroll-all-loops", the total text size for
>> the SPEC CPU2000 suite is 26.9MB with this proposed change and 26.8MB
>> without it; with just "-O2", it is the same at 23.1MB regardless of this
>> setting.
>>
>> So it seems to me that this proposal should be neutral for up to -O2.
>>
>> Thank you,
>>
> My preference would be to not diverge from the global parameter
> settings.  I haven't looked in detail at this parameter but it seems to
> me there are two possible paths:
>
> 1) We could get agreement globally that the parameter should be increased.
> 2) We could agree that this specific use of the parameter is distinct
> from some other uses and deserves a new param in its own right with a
> higher value.
>

Here's what I have observed, not only in AArch64: architectures benefit 
differently from certain loop optimizations, especially those dealing 
with vectorization.  Be it because some have plenty of registers of more 
aggressive loop unrolling, or because some have lower costs to 
vectorize.  With this, I'm trying to imply that there may be the case to 
wiggle this parameter to suit loop optimizations better to specific 
targets.  While it is not the only parameter related to loop 
optimizations, it seems to be the one with the desired effects, as 
exemplified by PPC, S390 and x86 (AOSP).  Though there is the 
possibility that they are actually side-effects, as Richard Biener 
perhaps implied in another reply.

Cheers,
Evandro Menezes Jan. 8, 2016, 10:55 p.m. UTC | #9
On 12/16/2015 02:11 PM, Evandro Menezes wrote:
> On 12/16/2015 05:24 AM, Richard Earnshaw (lists) wrote:
>> On 15/12/15 23:34, Evandro Menezes wrote:
>>> On 12/14/2015 05:26 AM, James Greenhalgh wrote:
>>>> On Thu, Dec 03, 2015 at 03:07:43PM -0600, Evandro Menezes wrote:
>>>>> On 11/20/2015 05:53 AM, James Greenhalgh wrote:
>>>>>> On Thu, Nov 19, 2015 at 04:04:41PM -0600, Evandro Menezes wrote:
>>>>>>> On 11/05/2015 02:51 PM, Evandro Menezes wrote:
>>>>>>>> 2015-11-05  Evandro Menezes <e.menezes@samsung.com>
>>>>>>>>
>>>>>>>>     gcc/
>>>>>>>>
>>>>>>>>         * config/aarch64/aarch64.c
>>>>>>>> (aarch64_override_options_internal):
>>>>>>>>         Increase loop peeling limit.
>>>>>>>>
>>>>>>>> This patch increases the limit for the number of peeled insns.
>>>>>>>> With this change, I noticed no major regression in either
>>>>>>>> Geekbench v3 or SPEC CPU2000 while some benchmarks, typically FP
>>>>>>>> ones, improved significantly.
>>>>>>>>
>>>>>>>> I tested this tuning on Exynos M1 and on A57. ThunderX seems to
>>>>>>>> benefit from this tuning too.  However, I'd appreciate comments
>>>>>>> >from other stakeholders.
>>>>>>>
>>>>>>> Ping.
>>>>>> I'd like to leave this for a call from the port maintainers. I can
>>>>>> see why
>>>>>> this leads to more opportunities for vectorization, but I'm
>>>>>> concerned about
>>>>>> the wider impact on code size. Certainly I wouldn't expect this to
>>>>>> be our
>>>>>> default at -O2 and below.
>>>>>>
>>>>>> My gut feeling is that this doesn't really belong in the back-end
>>>>>> (there are
>>>>>> presumably good reasons why the default for this parameter across
>>>>>> GCC has
>>>>>> fluctuated from 400 to 100 to 200 over recent years), but as I 
>>>>>> say, I'd
>>>>>> like Marcus or Richard to make the call as to whether or not we take
>>>>>> this
>>>>>> patch.
>>>>> Please, correct me if I'm wrong, but loop peeling is enabled only
>>>>> with loop unrolling (and with PGO).  If so, then extra code size is
>>>>> not a concern, for this heuristic is only active when unrolling
>>>>> loops, when code size is already of secondary importance.
>>>> My understanding was that loop peeling is enabled from -O2 upwards, 
>>>> and
>>>> is also used to partially peel unaligned loops for vectorization
>>>> (allowing
>>>> the vector code to be well aligned), or to completely peel inner loops
>>>> which
>>>> may then become amenable to SLP vectorization.
>>>>
>>>> If I'm wrong then I take back these objections. But I was sure this
>>>> parameter was used in a number of situations outside of just
>>>> -funroll-loops/-funroll-all-loops . Certainly I remember seeing
>>>> performance
>>>> sensitivities to this parameter at -O3 in some internal workloads I 
>>>> was
>>>> analysing.
>>> Vectorization, including SLP, is only enabled at -O3, isn't it?  It
>>> seems to me that peeling is only used by optimizations which already
>>> lead to potential increase in code size.
>>>
>>> For instance, with "-Ofast -funroll-all-loops", the total text size for
>>> the SPEC CPU2000 suite is 26.9MB with this proposed change and 26.8MB
>>> without it; with just "-O2", it is the same at 23.1MB regardless of 
>>> this
>>> setting.
>>>
>>> So it seems to me that this proposal should be neutral for up to -O2.
>>>
>>> Thank you,
>>>
>> My preference would be to not diverge from the global parameter
>> settings.  I haven't looked in detail at this parameter but it seems to
>> me there are two possible paths:
>>
>> 1) We could get agreement globally that the parameter should be 
>> increased.
>> 2) We could agree that this specific use of the parameter is distinct
>> from some other uses and deserves a new param in its own right with a
>> higher value.
>>
>
> Here's what I have observed, not only in AArch64: architectures 
> benefit differently from certain loop optimizations, especially those 
> dealing with vectorization.  Be it because some have plenty of 
> registers of more aggressive loop unrolling, or because some have 
> lower costs to vectorize.  With this, I'm trying to imply that there 
> may be the case to wiggle this parameter to suit loop optimizations 
> better to specific targets.  While it is not the only parameter 
> related to loop optimizations, it seems to be the one with the desired 
> effects, as exemplified by PPC, S390 and x86 (AOSP).  Though there is 
> the possibility that they are actually side-effects, as Richard Biener 
> perhaps implied in another reply.
>


Gents,

Any new thoughts on this proposal?

Thank you,
Evandro Menezes Feb. 3, 2016, 7:46 p.m. UTC | #10
On 01/08/16 16:55, Evandro Menezes wrote:
> On 12/16/2015 02:11 PM, Evandro Menezes wrote:
>> On 12/16/2015 05:24 AM, Richard Earnshaw (lists) wrote:
>>> On 15/12/15 23:34, Evandro Menezes wrote:
>>>> On 12/14/2015 05:26 AM, James Greenhalgh wrote:
>>>>> On Thu, Dec 03, 2015 at 03:07:43PM -0600, Evandro Menezes wrote:
>>>>>> On 11/20/2015 05:53 AM, James Greenhalgh wrote:
>>>>>>> On Thu, Nov 19, 2015 at 04:04:41PM -0600, Evandro Menezes wrote:
>>>>>>>> On 11/05/2015 02:51 PM, Evandro Menezes wrote:
>>>>>>>>> 2015-11-05  Evandro Menezes <e.menezes@samsung.com>
>>>>>>>>>
>>>>>>>>>     gcc/
>>>>>>>>>
>>>>>>>>>         * config/aarch64/aarch64.c
>>>>>>>>> (aarch64_override_options_internal):
>>>>>>>>>         Increase loop peeling limit.
>>>>>>>>>
>>>>>>>>> This patch increases the limit for the number of peeled insns.
>>>>>>>>> With this change, I noticed no major regression in either
>>>>>>>>> Geekbench v3 or SPEC CPU2000 while some benchmarks, typically FP
>>>>>>>>> ones, improved significantly.
>>>>>>>>>
>>>>>>>>> I tested this tuning on Exynos M1 and on A57. ThunderX seems to
>>>>>>>>> benefit from this tuning too.  However, I'd appreciate comments
>>>>>>>> >from other stakeholders.
>>>>>>>>
>>>>>>>> Ping.
>>>>>>> I'd like to leave this for a call from the port maintainers. I can
>>>>>>> see why
>>>>>>> this leads to more opportunities for vectorization, but I'm
>>>>>>> concerned about
>>>>>>> the wider impact on code size. Certainly I wouldn't expect this to
>>>>>>> be our
>>>>>>> default at -O2 and below.
>>>>>>>
>>>>>>> My gut feeling is that this doesn't really belong in the back-end
>>>>>>> (there are
>>>>>>> presumably good reasons why the default for this parameter across
>>>>>>> GCC has
>>>>>>> fluctuated from 400 to 100 to 200 over recent years), but as I 
>>>>>>> say, I'd
>>>>>>> like Marcus or Richard to make the call as to whether or not we 
>>>>>>> take
>>>>>>> this
>>>>>>> patch.
>>>>>> Please, correct me if I'm wrong, but loop peeling is enabled only
>>>>>> with loop unrolling (and with PGO).  If so, then extra code size is
>>>>>> not a concern, for this heuristic is only active when unrolling
>>>>>> loops, when code size is already of secondary importance.
>>>>> My understanding was that loop peeling is enabled from -O2 
>>>>> upwards, and
>>>>> is also used to partially peel unaligned loops for vectorization
>>>>> (allowing
>>>>> the vector code to be well aligned), or to completely peel inner 
>>>>> loops
>>>>> which
>>>>> may then become amenable to SLP vectorization.
>>>>>
>>>>> If I'm wrong then I take back these objections. But I was sure this
>>>>> parameter was used in a number of situations outside of just
>>>>> -funroll-loops/-funroll-all-loops . Certainly I remember seeing
>>>>> performance
>>>>> sensitivities to this parameter at -O3 in some internal workloads 
>>>>> I was
>>>>> analysing.
>>>> Vectorization, including SLP, is only enabled at -O3, isn't it?  It
>>>> seems to me that peeling is only used by optimizations which already
>>>> lead to potential increase in code size.
>>>>
>>>> For instance, with "-Ofast -funroll-all-loops", the total text size 
>>>> for
>>>> the SPEC CPU2000 suite is 26.9MB with this proposed change and 26.8MB
>>>> without it; with just "-O2", it is the same at 23.1MB regardless of 
>>>> this
>>>> setting.
>>>>
>>>> So it seems to me that this proposal should be neutral for up to -O2.
>>>>
>>>> Thank you,
>>>>
>>> My preference would be to not diverge from the global parameter
>>> settings.  I haven't looked in detail at this parameter but it seems to
>>> me there are two possible paths:
>>>
>>> 1) We could get agreement globally that the parameter should be 
>>> increased.
>>> 2) We could agree that this specific use of the parameter is distinct
>>> from some other uses and deserves a new param in its own right with a
>>> higher value.
>>>
>>
>> Here's what I have observed, not only in AArch64: architectures 
>> benefit differently from certain loop optimizations, especially those 
>> dealing with vectorization.  Be it because some have plenty of 
>> registers of more aggressive loop unrolling, or because some have 
>> lower costs to vectorize.  With this, I'm trying to imply that there 
>> may be the case to wiggle this parameter to suit loop optimizations 
>> better to specific targets.  While it is not the only parameter 
>> related to loop optimizations, it seems to be the one with the 
>> desired effects, as exemplified by PPC, S390 and x86 (AOSP).  Though 
>> there is the possibility that they are actually side-effects, as 
>> Richard Biener perhaps implied in another reply.
>>
>
>
> Gents,
>
> Any new thoughts on this proposal?
>

Ping?
Evandro Menezes March 16, 2016, 7:48 p.m. UTC | #11
On 02/03/16 13:46, Evandro Menezes wrote:
> On 01/08/16 16:55, Evandro Menezes wrote:
>> On 12/16/2015 02:11 PM, Evandro Menezes wrote:
>>> On 12/16/2015 05:24 AM, Richard Earnshaw (lists) wrote:
>>>> On 15/12/15 23:34, Evandro Menezes wrote:
>>>>> On 12/14/2015 05:26 AM, James Greenhalgh wrote:
>>>>>> On Thu, Dec 03, 2015 at 03:07:43PM -0600, Evandro Menezes wrote:
>>>>>>> On 11/20/2015 05:53 AM, James Greenhalgh wrote:
>>>>>>>> On Thu, Nov 19, 2015 at 04:04:41PM -0600, Evandro Menezes wrote:
>>>>>>>>> On 11/05/2015 02:51 PM, Evandro Menezes wrote:
>>>>>>>>>> 2015-11-05  Evandro Menezes <e.menezes@samsung.com>
>>>>>>>>>>
>>>>>>>>>>     gcc/
>>>>>>>>>>
>>>>>>>>>>         * config/aarch64/aarch64.c
>>>>>>>>>> (aarch64_override_options_internal):
>>>>>>>>>>         Increase loop peeling limit.
>>>>>>>>>>
>>>>>>>>>> This patch increases the limit for the number of peeled insns.
>>>>>>>>>> With this change, I noticed no major regression in either
>>>>>>>>>> Geekbench v3 or SPEC CPU2000 while some benchmarks, typically FP
>>>>>>>>>> ones, improved significantly.
>>>>>>>>>>
>>>>>>>>>> I tested this tuning on Exynos M1 and on A57. ThunderX seems to
>>>>>>>>>> benefit from this tuning too.  However, I'd appreciate comments
>>>>>>>>> >from other stakeholders.
>>>>>>>>>
>>>>>>>>> Ping.
>>>>>>>>
>>>>>>>> I'd like to leave this for a call from the port maintainers. I can
>>>>>>>> see why
>>>>>>>> this leads to more opportunities for vectorization, but I'm
>>>>>>>> concerned about
>>>>>>>> the wider impact on code size. Certainly I wouldn't expect this to
>>>>>>>> be our
>>>>>>>> default at -O2 and below.
>>>>>>>>
>>>>>>>> My gut feeling is that this doesn't really belong in the back-end
>>>>>>>> (there are
>>>>>>>> presumably good reasons why the default for this parameter across
>>>>>>>> GCC has
>>>>>>>> fluctuated from 400 to 100 to 200 over recent years), but as I 
>>>>>>>> say, I'd
>>>>>>>> like Marcus or Richard to make the call as to whether or not we 
>>>>>>>> take
>>>>>>>> this
>>>>>>>> patch.
>>>>>>>
>>>>>>> Please, correct me if I'm wrong, but loop peeling is enabled only
>>>>>>> with loop unrolling (and with PGO).  If so, then extra code size is
>>>>>>> not a concern, for this heuristic is only active when unrolling
>>>>>>> loops, when code size is already of secondary importance.
>>>>>>
>>>>>> My understanding was that loop peeling is enabled from -O2 
>>>>>> upwards, and
>>>>>> is also used to partially peel unaligned loops for vectorization
>>>>>> (allowing
>>>>>> the vector code to be well aligned), or to completely peel inner 
>>>>>> loops
>>>>>> which
>>>>>> may then become amenable to SLP vectorization.
>>>>>>
>>>>>> If I'm wrong then I take back these objections. But I was sure this
>>>>>> parameter was used in a number of situations outside of just
>>>>>> -funroll-loops/-funroll-all-loops . Certainly I remember seeing
>>>>>> performance
>>>>>> sensitivities to this parameter at -O3 in some internal workloads 
>>>>>> I was
>>>>>> analysing.
>>>>>
>>>>> Vectorization, including SLP, is only enabled at -O3, isn't it?  It
>>>>> seems to me that peeling is only used by optimizations which already
>>>>> lead to potential increase in code size.
>>>>>
>>>>> For instance, with "-Ofast -funroll-all-loops", the total text 
>>>>> size for
>>>>> the SPEC CPU2000 suite is 26.9MB with this proposed change and 26.8MB
>>>>> without it; with just "-O2", it is the same at 23.1MB regardless 
>>>>> of this
>>>>> setting.
>>>>>
>>>>> So it seems to me that this proposal should be neutral for up to -O2.
>>>>
>>>> My preference would be to not diverge from the global parameter
>>>> settings.  I haven't looked in detail at this parameter but it 
>>>> seems to
>>>> me there are two possible paths:
>>>>
>>>> 1) We could get agreement globally that the parameter should be 
>>>> increased.
>>>> 2) We could agree that this specific use of the parameter is distinct
>>>> from some other uses and deserves a new param in its own right with a
>>>> higher value.
>>>
>>> Here's what I have observed, not only in AArch64: architectures 
>>> benefit differently from certain loop optimizations, especially 
>>> those dealing with vectorization. Be it because some have plenty of 
>>> registers of more aggressive loop unrolling, or because some have 
>>> lower costs to vectorize.  With this, I'm trying to imply that there 
>>> may be the case to wiggle this parameter to suit loop optimizations 
>>> better to specific targets.  While it is not the only parameter 
>>> related to loop optimizations, it seems to be the one with the 
>>> desired effects, as exemplified by PPC, S390 and x86 (AOSP).  Though 
>>> there is the possibility that they are actually side-effects, as 
>>> Richard Biener perhaps implied in another reply.
>>
>> Gents,
>>
>> Any new thoughts on this proposal?
>
> Ping?

Ping^2
Evandro Menezes March 31, 2016, 10:23 p.m. UTC | #12
On 03/16/16 14:48, Evandro Menezes wrote:
> On 02/03/16 13:46, Evandro Menezes wrote:
>> On 01/08/16 16:55, Evandro Menezes wrote:
>>> On 12/16/2015 02:11 PM, Evandro Menezes wrote:
>>>> On 12/16/2015 05:24 AM, Richard Earnshaw (lists) wrote:
>>>>> On 15/12/15 23:34, Evandro Menezes wrote:
>>>>>> On 12/14/2015 05:26 AM, James Greenhalgh wrote:
>>>>>>> On Thu, Dec 03, 2015 at 03:07:43PM -0600, Evandro Menezes wrote:
>>>>>>>> On 11/20/2015 05:53 AM, James Greenhalgh wrote:
>>>>>>>>> On Thu, Nov 19, 2015 at 04:04:41PM -0600, Evandro Menezes wrote:
>>>>>>>>>> On 11/05/2015 02:51 PM, Evandro Menezes wrote:
>>>>>>>>>>> 2015-11-05  Evandro Menezes <e.menezes@samsung.com>
>>>>>>>>>>>
>>>>>>>>>>>     gcc/
>>>>>>>>>>>
>>>>>>>>>>>         * config/aarch64/aarch64.c
>>>>>>>>>>> (aarch64_override_options_internal):
>>>>>>>>>>>         Increase loop peeling limit.
>>>>>>>>>>>
>>>>>>>>>>> This patch increases the limit for the number of peeled insns.
>>>>>>>>>>> With this change, I noticed no major regression in either
>>>>>>>>>>> Geekbench v3 or SPEC CPU2000 while some benchmarks, 
>>>>>>>>>>> typically FP
>>>>>>>>>>> ones, improved significantly.
>>>>>>>>>>>
>>>>>>>>>>> I tested this tuning on Exynos M1 and on A57. ThunderX seems to
>>>>>>>>>>> benefit from this tuning too.  However, I'd appreciate comments
>>>>>>>>>> >from other stakeholders.
>>>>>>>>>>
>>>>>>>>>> Ping.
>>>>>>>>>
>>>>>>>>> I'd like to leave this for a call from the port maintainers. I 
>>>>>>>>> can
>>>>>>>>> see why
>>>>>>>>> this leads to more opportunities for vectorization, but I'm
>>>>>>>>> concerned about
>>>>>>>>> the wider impact on code size. Certainly I wouldn't expect 
>>>>>>>>> this to
>>>>>>>>> be our
>>>>>>>>> default at -O2 and below.
>>>>>>>>>
>>>>>>>>> My gut feeling is that this doesn't really belong in the back-end
>>>>>>>>> (there are
>>>>>>>>> presumably good reasons why the default for this parameter across
>>>>>>>>> GCC has
>>>>>>>>> fluctuated from 400 to 100 to 200 over recent years), but as I 
>>>>>>>>> say, I'd
>>>>>>>>> like Marcus or Richard to make the call as to whether or not 
>>>>>>>>> we take
>>>>>>>>> this
>>>>>>>>> patch.
>>>>>>>>
>>>>>>>> Please, correct me if I'm wrong, but loop peeling is enabled only
>>>>>>>> with loop unrolling (and with PGO).  If so, then extra code 
>>>>>>>> size is
>>>>>>>> not a concern, for this heuristic is only active when unrolling
>>>>>>>> loops, when code size is already of secondary importance.
>>>>>>>
>>>>>>> My understanding was that loop peeling is enabled from -O2 
>>>>>>> upwards, and
>>>>>>> is also used to partially peel unaligned loops for vectorization
>>>>>>> (allowing
>>>>>>> the vector code to be well aligned), or to completely peel inner 
>>>>>>> loops
>>>>>>> which
>>>>>>> may then become amenable to SLP vectorization.
>>>>>>>
>>>>>>> If I'm wrong then I take back these objections. But I was sure this
>>>>>>> parameter was used in a number of situations outside of just
>>>>>>> -funroll-loops/-funroll-all-loops . Certainly I remember seeing
>>>>>>> performance
>>>>>>> sensitivities to this parameter at -O3 in some internal 
>>>>>>> workloads I was
>>>>>>> analysing.
>>>>>>
>>>>>> Vectorization, including SLP, is only enabled at -O3, isn't it?  It
>>>>>> seems to me that peeling is only used by optimizations which already
>>>>>> lead to potential increase in code size.
>>>>>>
>>>>>> For instance, with "-Ofast -funroll-all-loops", the total text 
>>>>>> size for
>>>>>> the SPEC CPU2000 suite is 26.9MB with this proposed change and 
>>>>>> 26.8MB
>>>>>> without it; with just "-O2", it is the same at 23.1MB regardless 
>>>>>> of this
>>>>>> setting.
>>>>>>
>>>>>> So it seems to me that this proposal should be neutral for up to 
>>>>>> -O2.
>>>>>
>>>>> My preference would be to not diverge from the global parameter
>>>>> settings.  I haven't looked in detail at this parameter but it 
>>>>> seems to
>>>>> me there are two possible paths:
>>>>>
>>>>> 1) We could get agreement globally that the parameter should be 
>>>>> increased.
>>>>> 2) We could agree that this specific use of the parameter is distinct
>>>>> from some other uses and deserves a new param in its own right with a
>>>>> higher value.
>>>>
>>>> Here's what I have observed, not only in AArch64: architectures 
>>>> benefit differently from certain loop optimizations, especially 
>>>> those dealing with vectorization. Be it because some have plenty of 
>>>> registers of more aggressive loop unrolling, or because some have 
>>>> lower costs to vectorize.  With this, I'm trying to imply that 
>>>> there may be the case to wiggle this parameter to suit loop 
>>>> optimizations better to specific targets.  While it is not the only 
>>>> parameter related to loop optimizations, it seems to be the one 
>>>> with the desired effects, as exemplified by PPC, S390 and x86 
>>>> (AOSP).  Though there is the possibility that they are actually 
>>>> side-effects, as Richard Biener perhaps implied in another reply.
>>>
>>> Gents,
>>>
>>> Any new thoughts on this proposal?
>>
>> Ping?
>
> Ping^2
>

Ping^3
diff mbox

Patch

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 5c8604f..66122e7 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -7672,6 +7672,12 @@  aarch64_override_options_internal (struct gcc_options *opts)
 			 opts->x_param_values,
 			 global_options_set.x_param_values);
 
+  /* Increase the maximum peeling limit.  */
+  maybe_set_param_value (PARAM_MAX_COMPLETELY_PEELED_INSNS,
+			 400,
+			 opts->x_param_values,
+			 global_options_set.x_param_values);
+
   aarch64_override_options_after_change_1 (opts);
 }