diff mbox

[AArch64] Add vector permute cost

Message ID AM3PR08MB00884CDE10C132CD92DB97AD83EE0@AM3PR08MB0088.eurprd08.prod.outlook.com
State New
Headers show

Commit Message

Wilco Dijkstra Dec. 15, 2015, 11:35 a.m. UTC
Add support for vector permute cost since various permutes can expand into a complex
sequence of instructions.  This fixes major performance regressions due to recent changes
in the SLP vectorizer (which now vectorizes more aggressively and emits many complex 
permutes).

Set the cost to > 1 for all microarchitectures so that the number of permutes is usually zero
and regressions disappear.  An example of the kind of code that might be emitted for
VEC_PERM_EXPR {0, 3} where registers happen to be in the wrong order:

        adrp    x4, .LC16
        ldr     q5, [x4, #:lo12:.LC16
        eor     v1.16b, v1.16b, v0.16b
        eor     v0.16b, v1.16b, v0.16b
        eor     v1.16b, v1.16b, v0.16b
        tbl     v0.16b, {v0.16b - v1.16b}, v5.16b

Regress passes. This fixes regressions that were introduced recently, so OK for commit?


ChangeLog:
2015-12-15  Wilco Dijkstra  <wdijkstr@arm.com>

	* gcc/config/aarch64/aarch64.c (generic_vector_cost):
	Set vec_permute_cost.
	(cortexa57_vector_cost): Likewise.
	(exynosm1_vector_cost): Likewise.
	(xgene1_vector_cost): Likewise.
	(aarch64_builtin_vectorization_cost): Use vec_permute_cost.
	* gcc/config/aarch64/aarch64-protos.h (cpu_vector_cost):
	Add vec_permute_cost entry.

Comments

James Greenhalgh Dec. 16, 2015, 9:32 a.m. UTC | #1
On Tue, Dec 15, 2015 at 11:35:45AM +0000, Wilco Dijkstra wrote:
> 
> Add support for vector permute cost since various permutes can expand into a complex
> sequence of instructions.  This fixes major performance regressions due to recent changes
> in the SLP vectorizer (which now vectorizes more aggressively and emits many complex 
> permutes).
> 
> Set the cost to > 1 for all microarchitectures so that the number of permutes is usually zero
> and regressions disappear.  An example of the kind of code that might be emitted for
> VEC_PERM_EXPR {0, 3} where registers happen to be in the wrong order:
> 
>         adrp    x4, .LC16
>         ldr     q5, [x4, #:lo12:.LC16
>         eor     v1.16b, v1.16b, v0.16b
>         eor     v0.16b, v1.16b, v0.16b
>         eor     v1.16b, v1.16b, v0.16b
>         tbl     v0.16b, {v0.16b - v1.16b}, v5.16b
> 
> Regress passes. This fixes regressions that were introduced recently, so OK for commit?
> 
> 
> ChangeLog:
> 2015-12-15  Wilco Dijkstra  <wdijkstr@arm.com>
> 
> 	* gcc/config/aarch64/aarch64.c (generic_vector_cost):
> 	Set vec_permute_cost.
> 	(cortexa57_vector_cost): Likewise.
> 	(exynosm1_vector_cost): Likewise.
> 	(xgene1_vector_cost): Likewise.
> 	(aarch64_builtin_vectorization_cost): Use vec_permute_cost.
> 	* gcc/config/aarch64/aarch64-protos.h (cpu_vector_cost):
> 	Add vec_permute_cost entry.
> 
> 
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 10754c88c0973d8ef3c847195b727f02b193bbd8..2584f16d345b3d015d577dd28c08a73ee3e0b0fb 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -314,6 +314,7 @@ static const struct cpu_vector_cost generic_vector_cost =
>    1, /* scalar_load_cost  */
>    1, /* scalar_store_cost  */
>    1, /* vec_stmt_cost  */
> +  2, /* vec_permute_cost  */
>    1, /* vec_to_scalar_cost  */
>    1, /* scalar_to_vec_cost  */
>    1, /* vec_align_load_cost  */

Is there any reasoning behind making this 2? Do we now miss vectorization
for some of the cheaper permutes? Across the cost models/pipeline
descriptions that have been contributed to GCC I think that this is a
sensible change to the generic costs, but I just want to check there
was some reasoning/experimentation behind the number you picked.

As permutes can have such wildly different costs, this all seems like a good
candidate for some future much more involved hook from the vectorizer to the
back-end specifying the candidate permute operation and requesting a cost
(part of the bigger gimple costs framework?).

Thanks,
James
Richard Biener Dec. 16, 2015, 10:05 a.m. UTC | #2
On Wed, Dec 16, 2015 at 10:32 AM, James Greenhalgh
<james.greenhalgh@arm.com> wrote:
> On Tue, Dec 15, 2015 at 11:35:45AM +0000, Wilco Dijkstra wrote:
>>
>> Add support for vector permute cost since various permutes can expand into a complex
>> sequence of instructions.  This fixes major performance regressions due to recent changes
>> in the SLP vectorizer (which now vectorizes more aggressively and emits many complex
>> permutes).
>>
>> Set the cost to > 1 for all microarchitectures so that the number of permutes is usually zero
>> and regressions disappear.  An example of the kind of code that might be emitted for
>> VEC_PERM_EXPR {0, 3} where registers happen to be in the wrong order:
>>
>>         adrp    x4, .LC16
>>         ldr     q5, [x4, #:lo12:.LC16
>>         eor     v1.16b, v1.16b, v0.16b
>>         eor     v0.16b, v1.16b, v0.16b
>>         eor     v1.16b, v1.16b, v0.16b
>>         tbl     v0.16b, {v0.16b - v1.16b}, v5.16b
>>
>> Regress passes. This fixes regressions that were introduced recently, so OK for commit?
>>
>>
>> ChangeLog:
>> 2015-12-15  Wilco Dijkstra  <wdijkstr@arm.com>
>>
>>       * gcc/config/aarch64/aarch64.c (generic_vector_cost):
>>       Set vec_permute_cost.
>>       (cortexa57_vector_cost): Likewise.
>>       (exynosm1_vector_cost): Likewise.
>>       (xgene1_vector_cost): Likewise.
>>       (aarch64_builtin_vectorization_cost): Use vec_permute_cost.
>>       * gcc/config/aarch64/aarch64-protos.h (cpu_vector_cost):
>>       Add vec_permute_cost entry.
>>
>>
>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index 10754c88c0973d8ef3c847195b727f02b193bbd8..2584f16d345b3d015d577dd28c08a73ee3e0b0fb 100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -314,6 +314,7 @@ static const struct cpu_vector_cost generic_vector_cost =
>>    1, /* scalar_load_cost  */
>>    1, /* scalar_store_cost  */
>>    1, /* vec_stmt_cost  */
>> +  2, /* vec_permute_cost  */
>>    1, /* vec_to_scalar_cost  */
>>    1, /* scalar_to_vec_cost  */
>>    1, /* vec_align_load_cost  */
>
> Is there any reasoning behind making this 2? Do we now miss vectorization
> for some of the cheaper permutes? Across the cost models/pipeline
> descriptions that have been contributed to GCC I think that this is a
> sensible change to the generic costs, but I just want to check there
> was some reasoning/experimentation behind the number you picked.
>
> As permutes can have such wildly different costs, this all seems like a good
> candidate for some future much more involved hook from the vectorizer to the
> back-end specifying the candidate permute operation and requesting a cost
> (part of the bigger gimple costs framework?).

Yes, the vectorizer side also needs to improve here.  Not sure if it is possible
to represent this kind of complex cost queries with a single gimple cost hook.
After all we don't really want to generate the full gimple stmt just to query
its cost ...

To better represent permute cost in the short term we'd need another vectorizer
specific hook, not sth for stage3 unless we face some serious regressions
on real-world code (thus not microbenchmarks only)

Richard.

> Thanks,
> James
>
Wilco Dijkstra Dec. 16, 2015, 11:37 a.m. UTC | #3
Richard Biener wrote:
> On Wed, Dec 16, 2015 at 10:32 AM, James Greenhalgh

> <james.greenhalgh@arm.com> wrote:

> > On Tue, Dec 15, 2015 at 11:35:45AM +0000, Wilco Dijkstra wrote:

> >>

> >> Add support for vector permute cost since various permutes can expand into a complex

> >> sequence of instructions.  This fixes major performance regressions due to recent changes

> >> in the SLP vectorizer (which now vectorizes more aggressively and emits many complex

> >> permutes).

> >>

> >> Set the cost to > 1 for all microarchitectures so that the number of permutes is usually zero

> >> and regressions disappear.  An example of the kind of code that might be emitted for

> >> VEC_PERM_EXPR {0, 3} where registers happen to be in the wrong order:

> >>

> >>         adrp    x4, .LC16

> >>         ldr     q5, [x4, #:lo12:.LC16

> >>         eor     v1.16b, v1.16b, v0.16b

> >>         eor     v0.16b, v1.16b, v0.16b

> >>         eor     v1.16b, v1.16b, v0.16b

> >>         tbl     v0.16b, {v0.16b - v1.16b}, v5.16b

> >>

> >> Regress passes. This fixes regressions that were introduced recently, so OK for commit?

> >>

> >>

> >> ChangeLog:

> >> 2015-12-15  Wilco Dijkstra  <wdijkstr@arm.com>

> >>

> >>       * gcc/config/aarch64/aarch64.c (generic_vector_cost):

> >>       Set vec_permute_cost.

> >>       (cortexa57_vector_cost): Likewise.

> >>       (exynosm1_vector_cost): Likewise.

> >>       (xgene1_vector_cost): Likewise.

> >>       (aarch64_builtin_vectorization_cost): Use vec_permute_cost.

> >>       * gcc/config/aarch64/aarch64-protos.h (cpu_vector_cost):

> >>       Add vec_permute_cost entry.

> >>

> >>

> >> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c

> >> index 10754c88c0973d8ef3c847195b727f02b193bbd8..2584f16d345b3d015d577dd28c08a73ee3e0b0fb 100644

> >> --- a/gcc/config/aarch64/aarch64.c

> >> +++ b/gcc/config/aarch64/aarch64.c

> >> @@ -314,6 +314,7 @@ static const struct cpu_vector_cost generic_vector_cost =

> >>    1, /* scalar_load_cost  */

> >>    1, /* scalar_store_cost  */

> >>    1, /* vec_stmt_cost  */

> >> +  2, /* vec_permute_cost  */

> >>    1, /* vec_to_scalar_cost  */

> >>    1, /* scalar_to_vec_cost  */

> >>    1, /* vec_align_load_cost  */

> >

> > Is there any reasoning behind making this 2? Do we now miss vectorization

> > for some of the cheaper permutes? Across the cost models/pipeline

> > descriptions that have been contributed to GCC I think that this is a

> > sensible change to the generic costs, but I just want to check there

> > was some reasoning/experimentation behind the number you picked.


Yes, it blocks vectorization if they have a high percentage of permutes, even if trivial.
But that is exactly the goal - when we vectorize we want to minimize data shuffling
that just adds latency and does not contribute to actual work.

The value 2 was the minimal value that avoided the regressions I noticed due the
improved SLP vectorization. I tried other possibilities like increasing the statement cost,
but that affects other cases, so this is the simplest change.

Note that I'm not too convinced about any of the existing vector costs, they seem
fairly random numbers - the A57 vector costs for example block many simple 
vectorization cases that are beneficial. The goal of this is simply to fix the regressions
rather than tuning vector costs (which is only possible if we have a better cost model).

> > As permutes can have such wildly different costs, this all seems like a good

> > candidate for some future much more involved hook from the vectorizer to the

> > back-end specifying the candidate permute operation and requesting a cost

> > (part of the bigger gimple costs framework?).

> 

> Yes, the vectorizer side also needs to improve here.  Not sure if it is possible

> to represent this kind of complex cost queries with a single gimple cost hook.

> After all we don't really want to generate the full gimple stmt just to query

> its cost ...

> 

> To better represent permute cost in the short term we'd need another vectorizer

> specific hook, not sth for stage3 unless we face some serious regressions

> on real-world code (thus not microbenchmarks only)


Yes we need a better cost model for the vectorizer. The sort of things that are important
is the vector mode (so we can differentiate between 2x, 4x, 8x, 16x vectorization etc), 
the specific permute for permutes, the size and type of load/store (as some do permutes) etc.

Wilco
Wilco Dijkstra Jan. 26, 2016, 2:38 p.m. UTC | #4
ping
James Greenhalgh Jan. 26, 2016, 2:50 p.m. UTC | #5
On Tue, Dec 15, 2015 at 11:35:45AM +0000, Wilco Dijkstra wrote:
> 
> Add support for vector permute cost since various permutes can expand into a complex
> sequence of instructions.  This fixes major performance regressions due to recent changes
> in the SLP vectorizer (which now vectorizes more aggressively and emits many complex 
> permutes).
> 
> Set the cost to > 1 for all microarchitectures so that the number of permutes is usually zero
> and regressions disappear.  An example of the kind of code that might be emitted for
> VEC_PERM_EXPR {0, 3} where registers happen to be in the wrong order:
> 
>         adrp    x4, .LC16
>         ldr     q5, [x4, #:lo12:.LC16
>         eor     v1.16b, v1.16b, v0.16b
>         eor     v0.16b, v1.16b, v0.16b
>         eor     v1.16b, v1.16b, v0.16b
>         tbl     v0.16b, {v0.16b - v1.16b}, v5.16b
> 
> Regress passes. This fixes regressions that were introduced recently, so OK for commit?

OK.

Thanks,
James

> ChangeLog:
> 2015-12-15  Wilco Dijkstra  <wdijkstr@arm.com>
> 
> 	* gcc/config/aarch64/aarch64.c (generic_vector_cost):
> 	Set vec_permute_cost.
> 	(cortexa57_vector_cost): Likewise.
> 	(exynosm1_vector_cost): Likewise.
> 	(xgene1_vector_cost): Likewise.
> 	(aarch64_builtin_vectorization_cost): Use vec_permute_cost.
> 	* gcc/config/aarch64/aarch64-protos.h (cpu_vector_cost):
> 	Add vec_permute_cost entry.
diff mbox

Patch

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 549a89d1f691b32efbc74359f045b5df74765f0e..1bc812a4d01e8b9895c11cefde3148429397e95a 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -156,9 +156,10 @@  struct cpu_vector_cost
   const int scalar_load_cost;		 /* Cost of scalar load.  */
   const int scalar_store_cost;		 /* Cost of scalar store.  */
   const int vec_stmt_cost;		 /* Cost of any vector operation,
-					    excluding load, store,
+					    excluding load, store, permute,
 					    vector-to-scalar and
 					    scalar-to-vector operation.  */
+  const int vec_permute_cost;		 /* Cost of permute operation.  */
   const int vec_to_scalar_cost;		 /* Cost of vec-to-scalar operation.  */
   const int scalar_to_vec_cost;		 /* Cost of scalar-to-vector
 					    operation.  */
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 10754c88c0973d8ef3c847195b727f02b193bbd8..2584f16d345b3d015d577dd28c08a73ee3e0b0fb 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -314,6 +314,7 @@  static const struct cpu_vector_cost generic_vector_cost =
   1, /* scalar_load_cost  */
   1, /* scalar_store_cost  */
   1, /* vec_stmt_cost  */
+  2, /* vec_permute_cost  */
   1, /* vec_to_scalar_cost  */
   1, /* scalar_to_vec_cost  */
   1, /* vec_align_load_cost  */
@@ -331,6 +332,7 @@  static const struct cpu_vector_cost cortexa57_vector_cost =
   4, /* scalar_load_cost  */
   1, /* scalar_store_cost  */
   3, /* vec_stmt_cost  */
+  3, /* vec_permute_cost  */
   8, /* vec_to_scalar_cost  */
   8, /* scalar_to_vec_cost  */
   5, /* vec_align_load_cost  */
@@ -347,6 +349,7 @@  static const struct cpu_vector_cost exynosm1_vector_cost =
   5, /* scalar_load_cost  */
   1, /* scalar_store_cost  */
   3, /* vec_stmt_cost  */
+  3, /* vec_permute_cost  */
   3, /* vec_to_scalar_cost  */
   3, /* scalar_to_vec_cost  */
   5, /* vec_align_load_cost  */
@@ -364,6 +367,7 @@  static const struct cpu_vector_cost xgene1_vector_cost =
   5, /* scalar_load_cost  */
   1, /* scalar_store_cost  */
   2, /* vec_stmt_cost  */
+  2, /* vec_permute_cost  */
   4, /* vec_to_scalar_cost  */
   4, /* scalar_to_vec_cost  */
   10, /* vec_align_load_cost  */
@@ -7555,6 +7559,8 @@  aarch64_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
 	return aarch64_tune_params.vec_costs->cond_not_taken_branch_cost;
 
       case vec_perm:
+	return aarch64_tune_params.vec_costs->vec_permute_cost;
+
       case vec_promote_demote:
 	return aarch64_tune_params.vec_costs->vec_stmt_cost;