diff mbox

[RFC,AArch64] Improve generic branch cost

Message ID VI1PR0802MB262138504A5473B8286763C783210@VI1PR0802MB2621.eurprd08.prod.outlook.com
State New
Headers show

Commit Message

Wilco Dijkstra March 9, 2017, 2:42 p.m. UTC
Hi,

Recently we've put a lot of effort into improving ifcvt to use CSEL on AArch64.
In  https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01639.html James determined
the best value for AArch64 code generation.  Although this setting is used when
explicitly targeting Cortex cores, it is not otherwise used.  This means by
default GCC will not use (F)CSEL in many common cases.  Most code is built
without -mcpu= and thus doesn't use CSEL like this example from GLIBC:

strtok:
    stp    x29, x30, [sp, -48]!
    add    x29, sp, 0
    stp    x21, x22, [sp, 32]
    mov    x21, x1
    stp    x19, x20, [sp, 16]
    adrp    x22, .LANCHOR0
    mov    x19, x0
    cbz    x0, .L12
.L2:    ldrb    w0, [x19]

.L12:
    ldr    x19, [x22, #:lo12:.LANCHOR0]
    b    .L2

With -mcpu=cortex-a57 GCC generates:

    stp    x29, x30, [sp, -48]!
    cmp    x0, 0
    add    x29, sp, 0
    stp    x21, x22, [sp, 32]
    adrp    x21, .LANCHOR0
    stp    x19, x20, [sp, 16]
    mov    x19, x0
    ldr    x0, [x21, #:lo12:.LANCHOR0]
    csel    x19, x0, x19, eq
    ldrb    w0, [x19]

This is generally faster and smaller.  On one benchmark the new setting fixes a 
regression since GCC6 and improves performance by 49%.  So I propose to change
generic_branch_cost to be the same as cortexa57_branch_cost so that all supported
cores benefit equally from CSEL.  Are there any objections to this?

Wilco


ChangeLog:
2017-03-09  Wilco Dijkstra  <wdijkstr@arm.com>

	* config/aarch64/aarch64.c (generic_branch_cost): Copy cortexa57_branch_cost.
--

Comments

Andrew Pinski March 9, 2017, 10:06 p.m. UTC | #1
On Thu, Mar 9, 2017 at 6:42 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Hi,
>
> Recently we've put a lot of effort into improving ifcvt to use CSEL on AArch64.
> In  https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01639.html James determined
> the best value for AArch64 code generation.  Although this setting is used when
> explicitly targeting Cortex cores, it is not otherwise used.  This means by
> default GCC will not use (F)CSEL in many common cases.  Most code is built
> without -mcpu= and thus doesn't use CSEL like this example from GLIBC:
>
> strtok:
>     stp    x29, x30, [sp, -48]!
>     add    x29, sp, 0
>     stp    x21, x22, [sp, 32]
>     mov    x21, x1
>     stp    x19, x20, [sp, 16]
>     adrp    x22, .LANCHOR0
>     mov    x19, x0
>     cbz    x0, .L12
> .L2:    ldrb    w0, [x19]
>
> .L12:
>     ldr    x19, [x22, #:lo12:.LANCHOR0]
>     b    .L2
>
> With -mcpu=cortex-a57 GCC generates:
>
>     stp    x29, x30, [sp, -48]!
>     cmp    x0, 0
>     add    x29, sp, 0
>     stp    x21, x22, [sp, 32]
>     adrp    x21, .LANCHOR0
>     stp    x19, x20, [sp, 16]
>     mov    x19, x0
>     ldr    x0, [x21, #:lo12:.LANCHOR0]
>     csel    x19, x0, x19, eq
>     ldrb    w0, [x19]
>
> This is generally faster and smaller.  On one benchmark the new setting fixes a
> regression since GCC6 and improves performance by 49%.  So I propose to change
> generic_branch_cost to be the same as cortexa57_branch_cost so that all supported
> cores benefit equally from CSEL.  Are there any objections to this?

I have no objections.  In fact thunderx2t99's branch_cost is 1,3.  I
had not looked into improving thunderx branch cost yet but that might
be because I have local patches that improve phiopt for doing ifcvt
earlier.  Also my phiopt change does not have a cost model either so
using csel more is good for thunderx 1 and ThunderX 2.

Thanks,
Andrew

>
> Wilco
>
>
> ChangeLog:
> 2017-03-09  Wilco Dijkstra  <wdijkstr@arm.com>
>
>         * config/aarch64/aarch64.c (generic_branch_cost): Copy cortexa57_branch_cost.
> --
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 5870b5e5d7e8e48cf925b3a62030346f041a7fd6..ea16074af86087a6200d9895583e05acf43d90e2 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -377,8 +377,8 @@ static const struct cpu_vector_cost xgene1_vector_cost =
>  /* Generic costs for branch instructions.  */
>  static const struct cpu_branch_cost generic_branch_cost =
>  {
> -  2,  /* Predictable.  */
> -  2   /* Unpredictable.  */
> +  1,  /* Predictable.  */
> +  3   /* Unpredictable.  */
>  };
>
>  /* Branch costs for Cortex-A57.  */
James Greenhalgh March 14, 2017, 9:37 a.m. UTC | #2
On Thu, Mar 09, 2017 at 02:06:16PM -0800, Andrew Pinski wrote:
> On Thu, Mar 9, 2017 at 6:42 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> > Hi,
> >
> > Recently we've put a lot of effort into improving ifcvt to use CSEL on
> > AArch64.  In  https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01639.html
> > James determined the best value for AArch64 code generation.

This was before the rewrite to the ifcvt costs which I made earlier in the
GCC 7 release cycle. But I think 1,3 is about right, and I'd be happy
to see us take that direction for "generic".

I'd like to hear comments from the Exynos-M1, Falkor and
xgene-1 subtarget contributors, particularly as these targets use
generic_branch_costs for their subtarget-sepcific tuning. It may be that
your patch needs to preserve the 2,2 setting for such cores even if the
generic target does move to 1,3.

At this stage in the release, this patch will have to wait for GCC 8
regardless of any comments received. I'd suggest that when we do think
about this for GCC 8, we might want to take a wider look at the "generic"
tunings, any opinions from other subtarget contributors, or the other
AArch64 maintainers as to further changes they would advocate for would
be welcome.

> I have no objections.  In fact thunderx2t99's branch_cost is 1,3.  I
> had not looked into improving thunderx branch cost yet but that might
> be because I have local patches that improve phiopt for doing ifcvt
> earlier.  Also my phiopt change does not have a cost model either so
> using csel more is good for thunderx 1 and ThunderX 2.

Thanks for the comments,
James
Jim Wilson March 17, 2017, 3:19 a.m. UTC | #3
On Tue, Mar 14, 2017 at 2:37 AM, James Greenhalgh
<james.greenhalgh@arm.com> wrote:
> I'd like to hear comments from the Exynos-M1, Falkor and
> xgene-1 subtarget contributors, particularly as these targets use
> generic_branch_costs for their subtarget-sepcific tuning. It may be that
> your patch needs to preserve the 2,2 setting for such cores even if the
> generic target does move to 1,3.

I was at Linaro Connect last week of course.  I took a look at this
issue this week.  I don't see any measurable performance change on
SPEC CPU2006 for falkor, so the change looks OK to me.

In general, I'm not too concerned about changes like this, as I'm
watching the FSF GCC tree, and will make appropriate changes to the
falkor tuning structure as necessary to maintain good performance.

Jim
Richard Earnshaw (lists) March 17, 2017, 10:15 a.m. UTC | #4
On 09/03/17 14:42, Wilco Dijkstra wrote:
> Hi,
> 
> Recently we've put a lot of effort into improving ifcvt to use CSEL on AArch64.
> In  https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01639.html James determined
> the best value for AArch64 code generation.  Although this setting is used when
> explicitly targeting Cortex cores, it is not otherwise used.  This means by
> default GCC will not use (F)CSEL in many common cases.  Most code is built
> without -mcpu= and thus doesn't use CSEL like this example from GLIBC:
> 
> strtok:
>     stp    x29, x30, [sp, -48]!
>     add    x29, sp, 0
>     stp    x21, x22, [sp, 32]
>     mov    x21, x1
>     stp    x19, x20, [sp, 16]
>     adrp    x22, .LANCHOR0
>     mov    x19, x0
>     cbz    x0, .L12
> .L2:    ldrb    w0, [x19]
> 
> .L12:
>     ldr    x19, [x22, #:lo12:.LANCHOR0]
>     b    .L2
> 
> With -mcpu=cortex-a57 GCC generates:
> 
>     stp    x29, x30, [sp, -48]!
>     cmp    x0, 0
>     add    x29, sp, 0
>     stp    x21, x22, [sp, 32]
>     adrp    x21, .LANCHOR0
>     stp    x19, x20, [sp, 16]
>     mov    x19, x0
>     ldr    x0, [x21, #:lo12:.LANCHOR0]
>     csel    x19, x0, x19, eq
>     ldrb    w0, [x19]
> 
> This is generally faster and smaller.  On one benchmark the new setting fixes a 
> regression since GCC6 and improves performance by 49%.  So I propose to change
> generic_branch_cost to be the same as cortexa57_branch_cost so that all supported
> cores benefit equally from CSEL.  Are there any objections to this?
> 
> Wilco
> 
> 
> ChangeLog:
> 2017-03-09  Wilco Dijkstra  <wdijkstr@arm.com>
> 
> 	* config/aarch64/aarch64.c (generic_branch_cost): Copy cortexa57_branch_cost.

This is OK.  We already have a number of cores using these values so I
don't think this is likely to be a risky change even in stage 4.

R.

> --
> 
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 5870b5e5d7e8e48cf925b3a62030346f041a7fd6..ea16074af86087a6200d9895583e05acf43d90e2 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -377,8 +377,8 @@ static const struct cpu_vector_cost xgene1_vector_cost =
>  /* Generic costs for branch instructions.  */
>  static const struct cpu_branch_cost generic_branch_cost =
>  {
> -  2,  /* Predictable.  */
> -  2   /* Unpredictable.  */
> +  1,  /* Predictable.  */
> +  3   /* Unpredictable.  */
>  };
>  
>  /* Branch costs for Cortex-A57.  */
>
diff mbox

Patch

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 5870b5e5d7e8e48cf925b3a62030346f041a7fd6..ea16074af86087a6200d9895583e05acf43d90e2 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -377,8 +377,8 @@  static const struct cpu_vector_cost xgene1_vector_cost =
 /* Generic costs for branch instructions.  */
 static const struct cpu_branch_cost generic_branch_cost =
 {
-  2,  /* Predictable.  */
-  2   /* Unpredictable.  */
+  1,  /* Predictable.  */
+  3   /* Unpredictable.  */
 };
 
 /* Branch costs for Cortex-A57.  */