diff mbox series

[rs6000] Make load cost more in vectorization cost for P8/P9

Message ID b01729eb-8826-ef57-f436-bb018d9909c9@linux.ibm.com
State New
Headers show
Series [rs6000] Make load cost more in vectorization cost for P8/P9 | expand

Commit Message

Kewen.Lin Nov. 4, 2019, 7:16 a.m. UTC
Hi,

To align with rs6000_insn_cost costing more for load type insns,
this patch is to make load insns cost more in vectorization cost
function.  Considering that the result of load usually is used
somehow later (true-dep) but store won't, we keep the store as
before.

The SPEC2017 performance evaluation on Power8 shows 525.x264_r
+9.56%, 511.povray_r +2.08%, 527.cam4_r 1.16% gains, no 
significant degradation, SPECINT geomean +0.88%, SPECFP geomean
+0.26%.

The SPEC2017 performance evaluation on Power9 shows no significant
improvement or degradation, SPECINT geomean +0.04%, SPECFP geomean
+0.04%.

The SPEC2006 performance evaluation on Power8 shows 454.calculix
+4.41% gain but 416.gamess -1.19% and 453.povray -3.83% degradation.
I looked into the two degradation bmks, the degradation were NOT
due to hotspot changes by vectorization, were all side effects.
SPECINT geomean +0.10%, SPECFP geomean no changed considering
the degradation.

Bootstrapped and regress tested on powerpc64le-linux-gnu.  
Is OK for trunk?


BR,
Kewen

---

gcc/ChangeLog

2019-11-04  Kewen Lin  <linkw@gcc.gnu.org>

	* config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Make
	scalar_load, vector_load, unaligned_load and vector_gather_load cost
	a bit more on Power8 and up.

Comments

Segher Boessenkool Nov. 4, 2019, 8:21 p.m. UTC | #1
Hi!

On Mon, Nov 04, 2019 at 03:16:06PM +0800, Kewen.Lin wrote:
> To align with rs6000_insn_cost costing more for load type insns,

(Which itself has history in rs6000_rtx_costs).

> this patch is to make load insns cost more in vectorization cost
> function.  Considering that the result of load usually is used
> somehow later (true-dep) but store won't, we keep the store as
> before.

The latency of load insns is about twice that of "simple" instructions;
2 vs. 1 on older cores, and 4 (or so) vs. 2 on newer cores.

> The SPEC2017 performance evaluation on Power8 shows 525.x264_r
> +9.56%, 511.povray_r +2.08%, 527.cam4_r 1.16% gains, no 
> significant degradation, SPECINT geomean +0.88%, SPECFP geomean
> +0.26%.

Nice :-)

> The SPEC2017 performance evaluation on Power9 shows no significant
> improvement or degradation, SPECINT geomean +0.04%, SPECFP geomean
> +0.04%.
> 
> The SPEC2006 performance evaluation on Power8 shows 454.calculix
> +4.41% gain but 416.gamess -1.19% and 453.povray -3.83% degradation.
> I looked into the two degradation bmks, the degradation were NOT
> due to hotspot changes by vectorization, were all side effects.
> SPECINT geomean +0.10%, SPECFP geomean no changed considering
> the degradation.

Also nice.

> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -4763,15 +4763,22 @@ rs6000_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
>    switch (type_of_cost)
>      {
>        case scalar_stmt:
> -      case scalar_load:
>        case scalar_store:
>        case vector_stmt:
> -      case vector_load:
>        case vector_store:
>        case vec_to_scalar:
>        case scalar_to_vec:
>        case cond_branch_not_taken:
>          return 1;
> +      case scalar_load:
> +      case vector_load:
> +	/* Like rs6000_insn_cost, make load insns cost a bit more. FIXME: the

(two spaces after full stop).

> +	   benefits were observed on Power8 and up, we can unify it if similar
> +	   profits are measured on Power6 and Power7.  */
> +	if (TARGET_P8_VECTOR)
> +	  return 2;
> +	else
> +	  return 1;

Hrm, but you showed benchmark improvements for p9 as well?

What happens if you enable this for everything as well?

> -        return 2;
> +	/* Like rs6000_insn_cost, make load insns cost a bit more. FIXME: the
> +	   benefits were observed on Power8 and up, we can unify it if similar
> +	   profits are measured on Power6 and Power7.  */
> +	if (TARGET_P8_VECTOR)
> +	  return 4;
> +	else
> +	  return 2;

And this.


Segher
Kewen.Lin Nov. 5, 2019, 2:14 a.m. UTC | #2
Hi Segher,

Thanks for the comments!

on 2019/11/5 上午4:21, Segher Boessenkool wrote:
> Hi!
> 
> On Mon, Nov 04, 2019 at 03:16:06PM +0800, Kewen.Lin wrote:
>> To align with rs6000_insn_cost costing more for load type insns,
> 
> (Which itself has history in rs6000_rtx_costs).
> 
>> this patch is to make load insns cost more in vectorization cost
>> function.  Considering that the result of load usually is used
>> somehow later (true-dep) but store won't, we keep the store as
>> before.
> 
> The latency of load insns is about twice that of "simple" instructions;
> 2 vs. 1 on older cores, and 4 (or so) vs. 2 on newer cores.
> 

Yes, for latest Power9, general load takes 4, vsx load takes 5.

>> The SPEC2017 performance evaluation on Power8 shows 525.x264_r
>> +9.56%, 511.povray_r +2.08%, 527.cam4_r 1.16% gains, no 
>> significant degradation, SPECINT geomean +0.88%, SPECFP geomean
>> +0.26%.
> 
> Nice :-)
> 
>> The SPEC2017 performance evaluation on Power9 shows no significant
>> improvement or degradation, SPECINT geomean +0.04%, SPECFP geomean
>> +0.04%.
>>
>> The SPEC2006 performance evaluation on Power8 shows 454.calculix
>> +4.41% gain but 416.gamess -1.19% and 453.povray -3.83% degradation.
>> I looked into the two degradation bmks, the degradation were NOT
>> due to hotspot changes by vectorization, were all side effects.
>> SPECINT geomean +0.10%, SPECFP geomean no changed considering
>> the degradation.
> 
> Also nice.
> 
>> --- a/gcc/config/rs6000/rs6000.c
>> +++ b/gcc/config/rs6000/rs6000.c
>> @@ -4763,15 +4763,22 @@ rs6000_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
>>    switch (type_of_cost)
>>      {
>>        case scalar_stmt:
>> -      case scalar_load:
>>        case scalar_store:
>>        case vector_stmt:
>> -      case vector_load:
>>        case vector_store:
>>        case vec_to_scalar:
>>        case scalar_to_vec:
>>        case cond_branch_not_taken:
>>          return 1;
>> +      case scalar_load:
>> +      case vector_load:
>> +	/* Like rs6000_insn_cost, make load insns cost a bit more. FIXME: the
> 
> (two spaces after full stop).
> 

Good catch!  Will fix it (and the others).

>> +	   benefits were observed on Power8 and up, we can unify it if similar
>> +	   profits are measured on Power6 and Power7.  */
>> +	if (TARGET_P8_VECTOR)
>> +	  return 2;
>> +	else
>> +	  return 1;
> 
> Hrm, but you showed benchmark improvements for p9 as well?
> 

No significant gains but no degradation as well, so I thought it's fine to align
it together.  Does it make sense?

> What happens if you enable this for everything as well?
> 

My concern was that if we enable it for everything, it's possible to introduce
degradation for some benchmarks on P6 or P7 where we didn't evaluate the
performance impact.  Although it's reasonable from the point view of load latency,
it's possible to get worse result in the actual benchmarks based on my fine grain
cost adjustment experiment before.  

Or do you suggest enabling it everywhere and solve the degradation issue if exposed?
I'm also fine with that.  :)


BR,
Kewen
Segher Boessenkool Nov. 6, 2019, 5:38 p.m. UTC | #3
Hi!

On Tue, Nov 05, 2019 at 10:14:46AM +0800, Kewen.Lin wrote:
> >> +	   benefits were observed on Power8 and up, we can unify it if similar
> >> +	   profits are measured on Power6 and Power7.  */
> >> +	if (TARGET_P8_VECTOR)
> >> +	  return 2;
> >> +	else
> >> +	  return 1;
> > 
> > Hrm, but you showed benchmark improvements for p9 as well?
> > 
> 
> No significant gains but no degradation as well, so I thought it's fine to align
> it together.  Does it make sense?

It's a bit strange at this point to do tunings for p8 that do we do not
do for later cpus.

> > What happens if you enable this for everything as well?
> 
> My concern was that if we enable it for everything, it's possible to introduce
> degradation for some benchmarks on P6 or P7 where we didn't evaluate the
> performance impact.

No one cares about p6.

We reasonably expect it will work just as well on p7 as on p8 and later.
That you haven't tested on p7 yet says something about how important that
platform is now ;-)

> Although it's reasonable from the point view of load latency,
> it's possible to get worse result in the actual benchmarks based on my fine grain
> cost adjustment experiment before.  
> 
> Or do you suggest enabling it everywhere and solve the degradation issue if exposed?
> I'm also fine with that.  :)

Yeah, let's just enable it everywhere.


Segher
diff mbox series

Patch

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 5876714..876c7ef 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -4763,15 +4763,22 @@  rs6000_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
   switch (type_of_cost)
     {
       case scalar_stmt:
-      case scalar_load:
       case scalar_store:
       case vector_stmt:
-      case vector_load:
       case vector_store:
       case vec_to_scalar:
       case scalar_to_vec:
       case cond_branch_not_taken:
         return 1;
+      case scalar_load:
+      case vector_load:
+	/* Like rs6000_insn_cost, make load insns cost a bit more. FIXME: the
+	   benefits were observed on Power8 and up, we can unify it if similar
+	   profits are measured on Power6 and Power7.  */
+	if (TARGET_P8_VECTOR)
+	  return 2;
+	else
+	  return 1;
 
       case vec_perm:
 	/* Power7 has only one permute unit, make it a bit expensive.  */
@@ -4792,8 +4799,9 @@  rs6000_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
 
       case unaligned_load:
       case vector_gather_load:
+	/* Like rs6000_insn_cost, make load insns cost a bit more.  */
 	if (TARGET_EFFICIENT_UNALIGNED_VSX)
-	  return 1;
+	  return 2;
 
         if (TARGET_VSX && TARGET_ALLOW_MOVMISALIGN)
           {
@@ -4827,7 +4835,13 @@  rs6000_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
           /* Misaligned loads are not supported.  */
           gcc_unreachable ();
 
-        return 2;
+	/* Like rs6000_insn_cost, make load insns cost a bit more. FIXME: the
+	   benefits were observed on Power8 and up, we can unify it if similar
+	   profits are measured on Power6 and Power7.  */
+	if (TARGET_P8_VECTOR)
+	  return 4;
+	else
+	  return 2;
 
       case unaligned_store:
       case vector_scatter_store: