Message ID | AM5PR0802MB261045DA4C8A4F9CF8E5D21F83030@AM5PR0802MB2610.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
On 12/04/17 14:02, Wilco Dijkstra wrote: > The existing setting of max_cond_insns for most cores is non-optimal. > Thumb-2 IT has a maximum limit of 4, so 5 means emitting 2 IT sequences. > Also such long sequences of conditional instructions can increase the number > of executed instructions significantly, so using 5 for max_cond_insns is > non-optimal. > > Previous benchmarking showed that setting max_cond_insn to 2 was the best value > for Cortex-A15 and Cortex-A57. All ARMv8-A cores use 2 - apart from Cortex-A35 > and Cortex-A53. Given that using 5 is worse, set it to 2. This also has the > advantage of producing more uniform code. > > Bootstrap and regress OK on arm-none-linux-gnueabihf. > > OK for stage 1? > > ChangeLog: > 2017-04-12 Wilco Dijkstra <wdijkstr@arm.com> > > * gcc/config/arm/arm.c (arm_cortex_a53_tune): Set max_cond_insns to 2. > (arm_cortex_a35_tune): Likewise. > --- > > diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c > index 29e8d1d07d918fbb2a627a653510dfc8587ee01a..1a6d552aa322114795acbb3667c6ea36963bf193 100644 > --- a/gcc/config/arm/arm.c > +++ b/gcc/config/arm/arm.c > @@ -1967,7 +1967,7 @@ const struct tune_params arm_cortex_a35_tune = > arm_default_branch_cost, > &arm_default_vec_cost, > 1, /* Constant limit. */ > - 5, /* Max cond insns. */ > + 2, /* Max cond insns. */ > 8, /* Memset max inline. */ > 1, /* Issue rate. */ > ARM_PREFETCH_NOT_BENEFICIAL, > @@ -1990,7 +1990,7 @@ const struct tune_params arm_cortex_a53_tune = > arm_default_branch_cost, > &arm_default_vec_cost, > 1, /* Constant limit. */ > - 5, /* Max cond insns. */ > + 2, /* Max cond insns. */ > 8, /* Memset max inline. */ > 2, /* Issue rate. */ > ARM_PREFETCH_NOT_BENEFICIAL, > This parameter is also used for A32 code. Is that really the right number there as well? I do wonder if the code in arm_option_params_internal should be tweaked to hard-limit the number of skipped insns for Thumb2 to one IT block. So /* When -mrestrict-it is in use tone down the if-conversion. */ max_insns_skipped = (TARGET_THUMB2 && arm_restrict_it) ? 1 : (TARGET_THUMB2 ? MIN (current_tune->max_insns_skipped, 4) | current_tune->max_insns_skipped);
Richard Earnshaw wrote: > - 5, /* Max cond insns. */ > + 2, /* Max cond insns. */ > This parameter is also used for A32 code. Is that really the right > number there as well? Yes, this parameter has always been the same for ARM and Thumb-2. > I do wonder if the code in arm_option_params_internal should be tweaked > to hard-limit the number of skipped insns for Thumb2 to one IT block. So You mean https://gcc.gnu.org/ml/gcc-patches/2017-01/msg01191.html ? :-) Wilco
On 04/05/17 18:38, Wilco Dijkstra wrote: > Richard Earnshaw wrote: > >> - 5, /* Max cond insns. */ >> + 2, /* Max cond insns. */ > >> This parameter is also used for A32 code. Is that really the right >> number there as well? > > Yes, this parameter has always been the same for ARM and Thumb-2. I know that. I'm questioning whether that number (2) is right when on ARM. It seems very low to me, especially when branches are unpredictable. > >> I do wonder if the code in arm_option_params_internal should be tweaked >> to hard-limit the number of skipped insns for Thumb2 to one IT block. So > > You mean https://gcc.gnu.org/ml/gcc-patches/2017-01/msg01191.html ? :-) > Haven't got as far as that one yet. R. > Wilco >
Richard Earnshaw (lists) wrote: > On 04/05/17 18:38, Wilco Dijkstra wrote: > > Richard Earnshaw wrote: > > >>> - 5, /* Max cond insns. */ >>> + 2, /* Max cond insns. */ >> >>> This parameter is also used for A32 code. Is that really the right >>> number there as well? >> >> Yes, this parameter has always been the same for ARM and Thumb-2. > > I know that. I'm questioning whether that number (2) is right when on > ARM. It seems very low to me, especially when branches are unpredictable. Why does it seem low? Benchmarking showed 2 was the best value for modern cores. The same branch predictor is used, so the same settings should be used for ARM and Thumb-2. Wilco
On 05/05/17 13:42, Wilco Dijkstra wrote: > Richard Earnshaw (lists) wrote: >> On 04/05/17 18:38, Wilco Dijkstra wrote: >> > Richard Earnshaw wrote: >> > >>>> - 5, /* Max cond insns. */ >>>> + 2, /* Max cond insns. */ >>> >>>> This parameter is also used for A32 code. Is that really the right >>>> number there as well? >>> >>> Yes, this parameter has always been the same for ARM and Thumb-2. >> >> I know that. I'm questioning whether that number (2) is right when on >> ARM. It seems very low to me, especially when branches are unpredictable. > > Why does it seem low? Benchmarking showed 2 was the best value for modern > cores. The same branch predictor is used, so the same settings should be > used > for ARM and Thumb-2. > > Wilco > > Thumb2 code has to execute an additional instruction to start an IT sequence. It might therefore seem reasonable for the ARM sequence to be one instruction longer. R.
Richard Earnshaw (lists) wrote: > On 05/05/17 13:42, Wilco Dijkstra wrote: >> Richard Earnshaw (lists) wrote: >>> On 04/05/17 18:38, Wilco Dijkstra wrote: >>> > Richard Earnshaw wrote: >>> > >>>>> - 5, /* Max cond insns. */ >>>>> + 2, /* Max cond insns. */ >>>> >>>>> This parameter is also used for A32 code. Is that really the right >>>>> number there as well? >>>> >>>> Yes, this parameter has always been the same for ARM and Thumb-2. >>> >>> I know that. I'm questioning whether that number (2) is right when on >>> ARM. It seems very low to me, especially when branches are unpredictable. >> >> Why does it seem low? Benchmarking showed 2 was the best value for modern >> cores. The same branch predictor is used, so the same settings should be >> used >> for ARM and Thumb-2. > > Thumb2 code has to execute an additional instruction to start an IT > sequence. It might therefore seem reasonable for the ARM sequence to be > one instruction longer. The IT instruction has no inputs/outputs and thus behaves like a NOP - unlike conditional instructions which have real latencies and additional dependencies due to being conditional. So the overhead of IT itself is small. Wilco
ping Richard Earnshaw (lists) wrote: > On 05/05/17 13:42, Wilco Dijkstra wrote: >> Richard Earnshaw (lists) wrote: >>> On 04/05/17 18:38, Wilco Dijkstra wrote: >>> > Richard Earnshaw wrote: >>> > >>>>> - 5, /* Max cond insns. */ >>>>> + 2, /* Max cond insns. */ >>>> >>>>> This parameter is also used for A32 code. Is that really the right >>>>> number there as well? >>>> >>>> Yes, this parameter has always been the same for ARM and Thumb-2. >>> >>> I know that. I'm questioning whether that number (2) is right when on >>> ARM. It seems very low to me, especially when branches are unpredictable. >> >> Why does it seem low? Benchmarking showed 2 was the best value for modern >> cores. The same branch predictor is used, so the same settings should be >> used >> for ARM and Thumb-2. > > Thumb2 code has to execute an additional instruction to start an IT > sequence. It might therefore seem reasonable for the ARM sequence to be > one instruction longer. The IT instruction has no inputs/outputs and thus behaves like a NOP - unlike conditional instructions which have real latencies and additional dependencies due to being conditional. So the overhead of IT itself is small. Wilco
ping Richard Earnshaw (lists) wrote: > On 05/05/17 13:42, Wilco Dijkstra wrote: >> Richard Earnshaw (lists) wrote: >>> On 04/05/17 18:38, Wilco Dijkstra wrote: >>> > Richard Earnshaw wrote: >>> > >>>>> - 5, /* Max cond insns. */ >>>>> + 2, /* Max cond insns. */ >>>> >>>>> This parameter is also used for A32 code. Is that really the right >>>>> number there as well? >>>> >>>> Yes, this parameter has always been the same for ARM and Thumb-2. >>> >>> I know that. I'm questioning whether that number (2) is right when on >>> ARM. It seems very low to me, especially when branches are unpredictable. >> >> Why does it seem low? Benchmarking showed 2 was the best value for modern >> cores. The same branch predictor is used, so the same settings should be >> used >> for ARM and Thumb-2. > > Thumb2 code has to execute an additional instruction to start an IT > sequence. It might therefore seem reasonable for the ARM sequence to be > one instruction longer. The IT instruction has no inputs/outputs and thus behaves like a NOP - unlike conditional instructions which have real latencies and additional dependencies due to being conditional. So the overhead of IT itself is small. Wilco
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index 29e8d1d07d918fbb2a627a653510dfc8587ee01a..1a6d552aa322114795acbb3667c6ea36963bf193 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -1967,7 +1967,7 @@ const struct tune_params arm_cortex_a35_tune = arm_default_branch_cost, &arm_default_vec_cost, 1, /* Constant limit. */ - 5, /* Max cond insns. */ + 2, /* Max cond insns. */ 8, /* Memset max inline. */ 1, /* Issue rate. */ ARM_PREFETCH_NOT_BENEFICIAL, @@ -1990,7 +1990,7 @@ const struct tune_params arm_cortex_a53_tune = arm_default_branch_cost, &arm_default_vec_cost, 1, /* Constant limit. */ - 5, /* Max cond insns. */ + 2, /* Max cond insns. */ 8, /* Memset max inline. */ 2, /* Issue rate. */ ARM_PREFETCH_NOT_BENEFICIAL,