diff mbox

[ARM] Update max_cond_insns settings

Message ID AM5PR0802MB261045DA4C8A4F9CF8E5D21F83030@AM5PR0802MB2610.eurprd08.prod.outlook.com
State New
Headers show

Commit Message

Wilco Dijkstra April 12, 2017, 1:02 p.m. UTC
The existing setting of max_cond_insns for most cores is non-optimal.
Thumb-2 IT has a maximum limit of 4, so 5 means emitting 2 IT sequences.
Also such long sequences of conditional instructions can increase the number
of executed instructions significantly, so using 5 for max_cond_insns is
non-optimal.

Previous benchmarking showed that setting max_cond_insn to 2 was the best value
for Cortex-A15 and Cortex-A57.  All ARMv8-A cores use 2 - apart from Cortex-A35
and Cortex-A53.  Given that using 5 is worse, set it to 2.  This also has the
advantage of producing more uniform code.

Bootstrap and regress OK on arm-none-linux-gnueabihf.

OK for stage 1?

ChangeLog:
2017-04-12  Wilco Dijkstra  <wdijkstr@arm.com>

        * gcc/config/arm/arm.c (arm_cortex_a53_tune): Set max_cond_insns to 2.
        (arm_cortex_a35_tune): Likewise.
---

Comments

Richard Earnshaw (lists) May 4, 2017, 10:23 a.m. UTC | #1
On 12/04/17 14:02, Wilco Dijkstra wrote:
> The existing setting of max_cond_insns for most cores is non-optimal.
> Thumb-2 IT has a maximum limit of 4, so 5 means emitting 2 IT sequences.
> Also such long sequences of conditional instructions can increase the number
> of executed instructions significantly, so using 5 for max_cond_insns is
> non-optimal.
> 
> Previous benchmarking showed that setting max_cond_insn to 2 was the best value
> for Cortex-A15 and Cortex-A57.  All ARMv8-A cores use 2 - apart from Cortex-A35
> and Cortex-A53.  Given that using 5 is worse, set it to 2.  This also has the
> advantage of producing more uniform code.
> 
> Bootstrap and regress OK on arm-none-linux-gnueabihf.
> 
> OK for stage 1?
> 
> ChangeLog:
> 2017-04-12  Wilco Dijkstra  <wdijkstr@arm.com>
> 
>         * gcc/config/arm/arm.c (arm_cortex_a53_tune): Set max_cond_insns to 2.
>         (arm_cortex_a35_tune): Likewise.
> ---
> 
> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> index 29e8d1d07d918fbb2a627a653510dfc8587ee01a..1a6d552aa322114795acbb3667c6ea36963bf193 100644
> --- a/gcc/config/arm/arm.c
> +++ b/gcc/config/arm/arm.c
> @@ -1967,7 +1967,7 @@ const struct tune_params arm_cortex_a35_tune =
>    arm_default_branch_cost,
>    &arm_default_vec_cost,
>    1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  2,						/* Max cond insns.  */
>    8,						/* Memset max inline.  */
>    1,						/* Issue rate.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> @@ -1990,7 +1990,7 @@ const struct tune_params arm_cortex_a53_tune =
>    arm_default_branch_cost,
>    &arm_default_vec_cost,
>    1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  2,						/* Max cond insns.  */
>    8,						/* Memset max inline.  */
>    2,						/* Issue rate.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> 


This parameter is also used for A32 code.  Is that really the right
number there as well?

I do wonder if the code in arm_option_params_internal should be tweaked
to hard-limit the number of skipped insns for Thumb2 to one IT block.  So

    /* When -mrestrict-it is in use tone down the if-conversion.  */
    max_insns_skipped = (TARGET_THUMB2 && arm_restrict_it)
      ? 1
      : (TARGET_THUMB2 ? MIN (current_tune->max_insns_skipped, 4)
         | current_tune->max_insns_skipped);
Wilco Dijkstra May 4, 2017, 5:38 p.m. UTC | #2
Richard Earnshaw wrote:

> -  5,                                         /* Max cond insns.  */
> +  2,                                         /* Max cond insns.  */

> This parameter is also used for A32 code.  Is that really the right
> number there as well?

Yes, this parameter has always been the same for ARM and Thumb-2.

> I do wonder if the code in arm_option_params_internal should be tweaked
> to hard-limit the number of skipped insns for Thumb2 to one IT block.  So

You mean https://gcc.gnu.org/ml/gcc-patches/2017-01/msg01191.html ? :-)

Wilco
Richard Earnshaw (lists) May 5, 2017, 12:19 p.m. UTC | #3
On 04/05/17 18:38, Wilco Dijkstra wrote:
> Richard Earnshaw wrote:
> 
>> -  5,                                         /* Max cond insns.  */
>> +  2,                                         /* Max cond insns.  */
> 
>> This parameter is also used for A32 code.  Is that really the right
>> number there as well?
> 
> Yes, this parameter has always been the same for ARM and Thumb-2.

I know that.  I'm questioning whether that number (2) is right when on
ARM.  It seems very low to me, especially when branches are unpredictable.

> 
>> I do wonder if the code in arm_option_params_internal should be tweaked
>> to hard-limit the number of skipped insns for Thumb2 to one IT block.  So
> 
> You mean https://gcc.gnu.org/ml/gcc-patches/2017-01/msg01191.html ? :-)
> 

Haven't got as far as that one yet.

R.

> Wilco
>
Wilco Dijkstra May 5, 2017, 12:42 p.m. UTC | #4
Richard Earnshaw (lists) wrote:
> On 04/05/17 18:38, Wilco Dijkstra wrote:
> > Richard Earnshaw wrote:
> > 
>>> -  5,                                         /* Max cond insns.  */
>>> +  2,                                         /* Max cond insns.  */
>> 
>>> This parameter is also used for A32 code.  Is that really the right
>>> number there as well?
>> 
>> Yes, this parameter has always been the same for ARM and Thumb-2.
>
> I know that.  I'm questioning whether that number (2) is right when on
> ARM.  It seems very low to me, especially when branches are unpredictable.

Why does it seem low? Benchmarking showed 2 was the best value for modern
cores. The same branch predictor is used, so the same settings should be used
for ARM and Thumb-2.

Wilco
Richard Earnshaw (lists) May 5, 2017, 12:49 p.m. UTC | #5
On 05/05/17 13:42, Wilco Dijkstra wrote:
> Richard Earnshaw (lists) wrote:
>> On 04/05/17 18:38, Wilco Dijkstra wrote:
>> > Richard Earnshaw wrote:
>> > 
>>>> -  5,                                         /* Max cond insns.  */
>>>> +  2,                                         /* Max cond insns.  */
>>> 
>>>> This parameter is also used for A32 code.  Is that really the right
>>>> number there as well?
>>> 
>>> Yes, this parameter has always been the same for ARM and Thumb-2.
>>
>> I know that.  I'm questioning whether that number (2) is right when on
>> ARM.  It seems very low to me, especially when branches are unpredictable.
> 
> Why does it seem low? Benchmarking showed 2 was the best value for modern
> cores. The same branch predictor is used, so the same settings should be
> used
> for ARM and Thumb-2.
> 
> Wilco
> 
>    

Thumb2 code has to execute an additional instruction to start an IT
sequence.  It might therefore seem reasonable for the ARM sequence to be
one instruction longer.

R.
Wilco Dijkstra May 5, 2017, 4:29 p.m. UTC | #6
Richard Earnshaw (lists) wrote:
> On 05/05/17 13:42, Wilco Dijkstra wrote:
>> Richard Earnshaw (lists) wrote:
>>> On 04/05/17 18:38, Wilco Dijkstra wrote:
>>> > Richard Earnshaw wrote:
>>> > 
>>>>> -  5,                                         /* Max cond insns.  */
>>>>> +  2,                                         /* Max cond insns.  */
>>>> 
>>>>> This parameter is also used for A32 code.  Is that really the right
>>>>> number there as well?
>>>> 
>>>> Yes, this parameter has always been the same for ARM and Thumb-2.
>>>
>>> I know that.  I'm questioning whether that number (2) is right when on
>>> ARM.  It seems very low to me, especially when branches are unpredictable.
>> 
>> Why does it seem low? Benchmarking showed 2 was the best value for modern
>> cores. The same branch predictor is used, so the same settings should be
>> used
>> for ARM and Thumb-2.
>
> Thumb2 code has to execute an additional instruction to start an IT
> sequence.  It might therefore seem reasonable for the ARM sequence to be
> one instruction longer.

The IT instruction has no inputs/outputs and thus behaves like a NOP - unlike
conditional instructions which have real latencies and additional dependencies due
to being conditional. So the overhead of IT itself is small.

Wilco
Wilco Dijkstra June 13, 2017, 1:56 p.m. UTC | #7
ping
    
Richard Earnshaw (lists) wrote:
> On 05/05/17 13:42, Wilco Dijkstra wrote:
>> Richard Earnshaw (lists) wrote:
>>> On 04/05/17 18:38, Wilco Dijkstra wrote:
>>> > Richard Earnshaw wrote:
>>> > 
>>>>> -  5,                                         /* Max cond insns.  */
>>>>> +  2,                                         /* Max cond insns.  */
>>>> 
>>>>> This parameter is also used for A32 code.  Is that really the right
>>>>> number there as well?
>>>> 
>>>> Yes, this parameter has always been the same for ARM and Thumb-2.
>>>
>>> I know that.  I'm questioning whether that number (2) is right when on
>>> ARM.  It seems very low to me, especially when branches are unpredictable.
>> 
>> Why does it seem low? Benchmarking showed 2 was the best value for modern
>> cores. The same branch predictor is used, so the same settings should be
>> used
>> for ARM and Thumb-2.
>
> Thumb2 code has to execute an additional instruction to start an IT
> sequence.  It might therefore seem reasonable for the ARM sequence to be
> one instruction longer.

The IT instruction has no inputs/outputs and thus behaves like a NOP - unlike
conditional instructions which have real latencies and additional dependencies due
to being conditional. So the overhead of IT itself is small.

Wilco
Wilco Dijkstra June 27, 2017, 3:41 p.m. UTC | #8
ping
    
Richard Earnshaw (lists) wrote:
> On 05/05/17 13:42, Wilco Dijkstra wrote:
>> Richard Earnshaw (lists) wrote:
>>> On 04/05/17 18:38, Wilco Dijkstra wrote:
>>> > Richard Earnshaw wrote:
>>> > 
>>>>> -  5,                                         /* Max cond insns.  */
>>>>> +  2,                                         /* Max cond insns.  */
>>>> 
>>>>> This parameter is also used for A32 code.  Is that really the right
>>>>> number there as well?
>>>> 
>>>> Yes, this parameter has always been the same for ARM and Thumb-2.
>>>
>>> I know that.  I'm questioning whether that number (2) is right when on
>>> ARM.  It seems very low to me, especially when branches are unpredictable.
>> 
>> Why does it seem low? Benchmarking showed 2 was the best value for modern
>> cores. The same branch predictor is used, so the same settings should be
>> used
>> for ARM and Thumb-2.
>
> Thumb2 code has to execute an additional instruction to start an IT
> sequence.  It might therefore seem reasonable for the ARM sequence to be
> one instruction longer.

The IT instruction has no inputs/outputs and thus behaves like a NOP - unlike
conditional instructions which have real latencies and additional dependencies due
to being conditional. So the overhead of IT itself is small.

Wilco
diff mbox

Patch

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 29e8d1d07d918fbb2a627a653510dfc8587ee01a..1a6d552aa322114795acbb3667c6ea36963bf193 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -1967,7 +1967,7 @@  const struct tune_params arm_cortex_a35_tune =
   arm_default_branch_cost,
   &arm_default_vec_cost,
   1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  2,						/* Max cond insns.  */
   8,						/* Memset max inline.  */
   1,						/* Issue rate.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
@@ -1990,7 +1990,7 @@  const struct tune_params arm_cortex_a53_tune =
   arm_default_branch_cost,
   &arm_default_vec_cost,
   1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  2,						/* Max cond insns.  */
   8,						/* Memset max inline.  */
   2,						/* Issue rate.  */
   ARM_PREFETCH_NOT_BENEFICIAL,