diff mbox series

Fix PR82255 (vectorizer cost model overcounts some vector load costs)

Message ID 7570cb71-cb74-d97f-3b7a-b161631e36c5@linux.vnet.ibm.com
State New
Headers show
Series Fix PR82255 (vectorizer cost model overcounts some vector load costs) | expand

Commit Message

Bill Schmidt Sept. 19, 2017, 5:38 p.m. UTC
Hi,

https://gcc.gnu.org/PR82255 identifies a problem in the vector cost model
where a vectorized load is treated as having the cost of a strided load
in a case where we will not actually generate a strided load.  This is
simply a mismatch between the conditions tested in the cost model and
those tested in the code that generates vectorized instructions.  This
patch fixes the problem by recognizing when only a single non-strided
load will be generated and reporting the cost accordingly.

I believe this patch is sufficient to catch all such cases, but I admit
that the code in vectorizable_load is complex enough that I could have
missed a trick.

I've added a test in the PowerPC cost model subdirectory.  Even though
this isn't a target-specific issue, the test does rely on a 16-byte 
vector size, so this seems safest.

Bootstrapped and tested on powerpc64le-linux-gnu with no regressions.
Is this ok for trunk?

Thanks!
Bill


[gcc]

2017-09-19  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

	PR tree-optimization/82255
	* tree-vect-stmts.c (vect_model_load_cost): Don't count
	vec_construct cost when a true strided load isn't present.

[gcc/testsuite]

2017-09-19  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

	PR tree-optimization/82255
	* gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c: New file.

Comments

Bill Schmidt Sept. 19, 2017, 7:28 p.m. UTC | #1
On 9/19/17 12:38 PM, Bill Schmidt wrote:
> Hi,
>
> https://gcc.gnu.org/PR82255 identifies a problem in the vector cost model
> where a vectorized load is treated as having the cost of a strided load
> in a case where we will not actually generate a strided load.  This is
> simply a mismatch between the conditions tested in the cost model and
> those tested in the code that generates vectorized instructions.  This
> patch fixes the problem by recognizing when only a single non-strided
> load will be generated and reporting the cost accordingly.
>
> I believe this patch is sufficient to catch all such cases, but I admit
> that the code in vectorizable_load is complex enough that I could have
> missed a trick.
>
> I've added a test in the PowerPC cost model subdirectory.  Even though
> this isn't a target-specific issue, the test does rely on a 16-byte 
> vector size, so this seems safest.
>
> Bootstrapped and tested on powerpc64le-linux-gnu with no regressions.
> Is this ok for trunk?

After posting, I realized that I had wrongly recalculated stmt_info
in the patch.  Here's a new version (also passing regstrap) that removes
that flaw.

[gcc]

2017-09-19  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

	PR tree-optimization/82255
	* tree-vect-stmts.c (vect_model_load_cost): Don't count
	vec_construct cost when a true strided load isn't present.

[gcc/testsuite]

2017-09-19  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

	PR tree-optimization/82255
	* gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c: New file.


Index: gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c	(nonexistent)
+++ gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c	(working copy)
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+
+/* PR82255: Ensure we don't require a vec_construct cost when we aren't
+   going to generate a strided load.  */
+
+extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ ((__const__));
+
+static int
+foo (unsigned char *w, int i, unsigned char *x, int j)
+{
+  int tot = 0;
+  for (int a = 0; a < 16; a++)
+    {
+      for (int b = 0; b < 16; b++)
+	tot += abs (w[b] - x[b]);
+      w += i;
+      x += j;
+    }
+  return tot;
+}
+
+void
+bar (unsigned char *w, unsigned char *x, int i, int *result)
+{
+  *result = foo (w, 16, x, i);
+}
+
+/* { dg-final { scan-tree-dump-times "vec_construct required" 0 "vect" } } */
+
Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	(revision 252760)
+++ gcc/tree-vect-stmts.c	(working copy)
@@ -1091,8 +1091,19 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
 			prologue_cost_vec, body_cost_vec, true);
   if (memory_access_type == VMAT_ELEMENTWISE
       || memory_access_type == VMAT_STRIDED_SLP)
-    inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct,
-				     stmt_info, 0, vect_body);
+    {
+      int group_size = GROUP_SIZE (stmt_info);
+      int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
+      if (group_size < nunits)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "vect_model_load_cost: vec_construct required");
+	  inside_cost += record_stmt_cost (body_cost_vec, ncopies,
+					   vec_construct, stmt_info, 0,
+					   vect_body);
+	}
+    }
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
Richard Sandiford Sept. 19, 2017, 8:58 p.m. UTC | #2
Bill Schmidt <wschmidt@linux.vnet.ibm.com> writes:
> Index: gcc/tree-vect-stmts.c
> ===================================================================
> --- gcc/tree-vect-stmts.c	(revision 252760)
> +++ gcc/tree-vect-stmts.c	(working copy)
> @@ -1091,8 +1091,19 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
>  			prologue_cost_vec, body_cost_vec, true);
>    if (memory_access_type == VMAT_ELEMENTWISE
>        || memory_access_type == VMAT_STRIDED_SLP)
> -    inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct,
> -				     stmt_info, 0, vect_body);
> +    {
> +      int group_size = GROUP_SIZE (stmt_info);
> +      int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
> +      if (group_size < nunits)
> +	{
> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_NOTE, vect_location,
> +			     "vect_model_load_cost: vec_construct required");
> +	  inside_cost += record_stmt_cost (body_cost_vec, ncopies,
> +					   vec_construct, stmt_info, 0,
> +					   vect_body);
> +	}
> +    }
>  
>    if (dump_enabled_p ())
>      dump_printf_loc (MSG_NOTE, vect_location,

This feels like we've probably got the wrong memory_access_type.
If it's a just a contiguous load then it should be VMAT_CONTIGUOUS
instead.

Thanks,
Richard
Bill Schmidt Sept. 19, 2017, 9:21 p.m. UTC | #3
On Sep 19, 2017, at 3:58 PM, Richard Sandiford <richard.sandiford@linaro.org> wrote:
> 
> Bill Schmidt <wschmidt@linux.vnet.ibm.com> writes:
>> Index: gcc/tree-vect-stmts.c
>> ===================================================================
>> --- gcc/tree-vect-stmts.c	(revision 252760)
>> +++ gcc/tree-vect-stmts.c	(working copy)
>> @@ -1091,8 +1091,19 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
>> 			prologue_cost_vec, body_cost_vec, true);
>>   if (memory_access_type == VMAT_ELEMENTWISE
>>       || memory_access_type == VMAT_STRIDED_SLP)
>> -    inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct,
>> -				     stmt_info, 0, vect_body);
>> +    {
>> +      int group_size = GROUP_SIZE (stmt_info);
>> +      int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>> +      if (group_size < nunits)
>> +	{
>> +	  if (dump_enabled_p ())
>> +	    dump_printf_loc (MSG_NOTE, vect_location,
>> +			     "vect_model_load_cost: vec_construct required");
>> +	  inside_cost += record_stmt_cost (body_cost_vec, ncopies,
>> +					   vec_construct, stmt_info, 0,
>> +					   vect_body);
>> +	}
>> +    }
>> 
>>   if (dump_enabled_p ())
>>     dump_printf_loc (MSG_NOTE, vect_location,
> 
> This feels like we've probably got the wrong memory_access_type.
> If it's a just a contiguous load then it should be VMAT_CONTIGUOUS
> instead.

I tend to agree -- that will take more surgery to the code that detects strided loads in
both the cost model analysis and in vectorizable_load.

Bill

> 
> Thanks,
> Richard
> 
>
Bill Schmidt Sept. 19, 2017, 10:44 p.m. UTC | #4
On Sep 19, 2017, at 4:21 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote:
> 
> On Sep 19, 2017, at 3:58 PM, Richard Sandiford <richard.sandiford@linaro.org> wrote:
>> 
>> Bill Schmidt <wschmidt@linux.vnet.ibm.com> writes:
>>> Index: gcc/tree-vect-stmts.c
>>> ===================================================================
>>> --- gcc/tree-vect-stmts.c	(revision 252760)
>>> +++ gcc/tree-vect-stmts.c	(working copy)
>>> @@ -1091,8 +1091,19 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
>>> 			prologue_cost_vec, body_cost_vec, true);
>>>  if (memory_access_type == VMAT_ELEMENTWISE
>>>      || memory_access_type == VMAT_STRIDED_SLP)
>>> -    inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct,
>>> -				     stmt_info, 0, vect_body);
>>> +    {
>>> +      int group_size = GROUP_SIZE (stmt_info);
>>> +      int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>>> +      if (group_size < nunits)
>>> +	{
>>> +	  if (dump_enabled_p ())
>>> +	    dump_printf_loc (MSG_NOTE, vect_location,
>>> +			     "vect_model_load_cost: vec_construct required");
>>> +	  inside_cost += record_stmt_cost (body_cost_vec, ncopies,
>>> +					   vec_construct, stmt_info, 0,
>>> +					   vect_body);
>>> +	}
>>> +    }
>>> 
>>>  if (dump_enabled_p ())
>>>    dump_printf_loc (MSG_NOTE, vect_location,
>> 
>> This feels like we've probably got the wrong memory_access_type.
>> If it's a just a contiguous load then it should be VMAT_CONTIGUOUS
>> instead.
> 
> I tend to agree -- that will take more surgery to the code that detects strided loads in
> both the cost model analysis and in vectorizable_load.
> 
It's just a little less clear how to clean this up in vect_analyze_data_refs.  The
too-simplistic test there now is:

      else if (is_a <loop_vec_info> (vinfo)
               && TREE_CODE (DR_STEP (dr)) != INTEGER_CST)
        {
          if (nested_in_vect_loop_p (loop, stmt))
            {
              if (dump_enabled_p ())
		{
                  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                                   "not vectorized: not suitable for strided "
                                   "load ");
                  dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0);
                }
              return false;
            }
          STMT_VINFO_STRIDED_P (stmt_info) = true;
        }

Furthermore, this is being set while processing each data reference, not after
all information has been gathered and analyzed.  So this is too early.

Next place to fix this would be in vect_analyze_slp_cost_1, where

      vect_memory_access_type memory_access_type
        = (STMT_VINFO_STRIDED_P (stmt_info)
           ? VMAT_STRIDED_SLP
           : VMAT_CONTIGUOUS);

is too simplistic.  I guess we could do something here, but it would require unpacking
all the grouped accesses and duplicating all the logic from scratch to determine 
whether it's a single load or not.

So it's going to be a little painful to do it "right."  I submitted this patch because I felt
it would be the simplest solution.

Thanks,
Bill


> Bill
> 
>> 
>> Thanks,
>> Richard
Richard Sandiford Sept. 20, 2017, 8:49 a.m. UTC | #5
Bill Schmidt <wschmidt@linux.vnet.ibm.com> writes:
> On Sep 19, 2017, at 4:21 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote:
>> On Sep 19, 2017, at 3:58 PM, Richard Sandiford
>> <richard.sandiford@linaro.org> wrote:
>>> Bill Schmidt <wschmidt@linux.vnet.ibm.com> writes:
>>>> Index: gcc/tree-vect-stmts.c
>>>> ===================================================================
>>>> --- gcc/tree-vect-stmts.c	(revision 252760)
>>>> +++ gcc/tree-vect-stmts.c	(working copy)
>>>> @@ -1091,8 +1091,19 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
>>>> 			prologue_cost_vec, body_cost_vec, true);
>>>>  if (memory_access_type == VMAT_ELEMENTWISE
>>>>      || memory_access_type == VMAT_STRIDED_SLP)
>>>> -    inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct,
>>>> -				     stmt_info, 0, vect_body);
>>>> +    {
>>>> +      int group_size = GROUP_SIZE (stmt_info);
>>>> +      int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>>>> +      if (group_size < nunits)
>>>> +	{
>>>> +	  if (dump_enabled_p ())
>>>> +	    dump_printf_loc (MSG_NOTE, vect_location,
>>>> +			     "vect_model_load_cost: vec_construct required");
>>>> +	  inside_cost += record_stmt_cost (body_cost_vec, ncopies,
>>>> +					   vec_construct, stmt_info, 0,
>>>> +					   vect_body);
>>>> +	}
>>>> +    }
>>>> 
>>>>  if (dump_enabled_p ())
>>>>    dump_printf_loc (MSG_NOTE, vect_location,
>>> 
>>> This feels like we've probably got the wrong memory_access_type.
>>> If it's a just a contiguous load then it should be VMAT_CONTIGUOUS
>>> instead.
>> 
>> I tend to agree -- that will take more surgery to the code that
>> detects strided loads in
>> both the cost model analysis and in vectorizable_load.
>> 
> It's just a little less clear how to clean this up in
> vect_analyze_data_refs.  The too-simplistic test there now is:
>
>       else if (is_a <loop_vec_info> (vinfo)
>                && TREE_CODE (DR_STEP (dr)) != INTEGER_CST)
>         {
>           if (nested_in_vect_loop_p (loop, stmt))
>             {
>               if (dump_enabled_p ())
> 		{
>                   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                                    "not vectorized: not suitable for strided "
>                                    "load ");
>                   dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0);
>                 }
>               return false;
>             }
>           STMT_VINFO_STRIDED_P (stmt_info) = true;
>         }
>
> Furthermore, this is being set while processing each data reference, not after
> all information has been gathered and analyzed.  So this is too early.

Yeah, agree we can't tell at this stage.

> Next place to fix this would be in vect_analyze_slp_cost_1, where
>
>       vect_memory_access_type memory_access_type
>         = (STMT_VINFO_STRIDED_P (stmt_info)
>            ? VMAT_STRIDED_SLP
>            : VMAT_CONTIGUOUS);
>
> is too simplistic.  I guess we could do something here, but it would
> require unpacking all the grouped accesses and duplicating all the
> logic from scratch to determine whether it's a single load or not.

I think we can use:

    SLP_INSTANCE_UNROLLING_FACTOR (instance) * ncopies_for_cost

to calculate the number of times that the vector SLP statements repeat
the original scalar group.  We should then only use STRIDED_SLP if
that's greater than 1.

This affects stores as well as loads, so fixing it here will work
for both.

But I think this shows up another problem.  In the vectorised loop,
we have 1 copy of the load and 4 copies of the ABS (after unpacking).
But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4
even for the loads.  So I think it's a double whammy: we've quadrupling
the real cost via the excessive ncopies_for_cost, and we're using the
wrong access type too.

The patch below should fix the second problem but isn't much use without
the first.  I think we need to adjust ncopies_for_cost in the recursive
calls as well.

Of course, fixing an overestimate here is likely to expose an
underestimate elsewhere. :-)

Interesting test case though.  It seems counterproductive to unroll
a loop like that before vectorisation.

Thanks,
Richard


Index: gcc/tree-vect-slp.c
===================================================================
--- gcc/tree-vect-slp.c	(revision 252986)
+++ gcc/tree-vect-slp.c	(working copy)
@@ -1687,8 +1687,10 @@
   stmt_info = vinfo_for_stmt (stmt);
   if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
     {
+      unsigned int group_ncopies = (SLP_INSTANCE_UNROLLING_FACTOR (instance)
+				   * ncopies_for_cost);
       vect_memory_access_type memory_access_type
-	= (STMT_VINFO_STRIDED_P (stmt_info)
+	= (STMT_VINFO_STRIDED_P (stmt_info) && group_ncopies > 1
 	   ? VMAT_STRIDED_SLP
 	   : VMAT_CONTIGUOUS);
       if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
Bill Schmidt Sept. 20, 2017, 2:45 p.m. UTC | #6
On Sep 20, 2017, at 3:49 AM, Richard Sandiford <richard.sandiford@linaro.org> wrote:
> 
> Bill Schmidt <wschmidt@linux.vnet.ibm.com> writes:
>> On Sep 19, 2017, at 4:21 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote:
>>> On Sep 19, 2017, at 3:58 PM, Richard Sandiford
>>> <richard.sandiford@linaro.org> wrote:
>>>> Bill Schmidt <wschmidt@linux.vnet.ibm.com> writes:
>>>>> Index: gcc/tree-vect-stmts.c
>>>>> ===================================================================
>>>>> --- gcc/tree-vect-stmts.c	(revision 252760)
>>>>> +++ gcc/tree-vect-stmts.c	(working copy)
>>>>> @@ -1091,8 +1091,19 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
>>>>> 			prologue_cost_vec, body_cost_vec, true);
>>>>> if (memory_access_type == VMAT_ELEMENTWISE
>>>>>     || memory_access_type == VMAT_STRIDED_SLP)
>>>>> -    inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct,
>>>>> -				     stmt_info, 0, vect_body);
>>>>> +    {
>>>>> +      int group_size = GROUP_SIZE (stmt_info);
>>>>> +      int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>>>>> +      if (group_size < nunits)
>>>>> +	{
>>>>> +	  if (dump_enabled_p ())
>>>>> +	    dump_printf_loc (MSG_NOTE, vect_location,
>>>>> +			     "vect_model_load_cost: vec_construct required");
>>>>> +	  inside_cost += record_stmt_cost (body_cost_vec, ncopies,
>>>>> +					   vec_construct, stmt_info, 0,
>>>>> +					   vect_body);
>>>>> +	}
>>>>> +    }
>>>>> 
>>>>> if (dump_enabled_p ())
>>>>>   dump_printf_loc (MSG_NOTE, vect_location,
>>>> 
>>>> This feels like we've probably got the wrong memory_access_type.
>>>> If it's a just a contiguous load then it should be VMAT_CONTIGUOUS
>>>> instead.
>>> 
>>> I tend to agree -- that will take more surgery to the code that
>>> detects strided loads in
>>> both the cost model analysis and in vectorizable_load.
>>> 
>> It's just a little less clear how to clean this up in
>> vect_analyze_data_refs.  The too-simplistic test there now is:
>> 
>>      else if (is_a <loop_vec_info> (vinfo)
>>               && TREE_CODE (DR_STEP (dr)) != INTEGER_CST)
>>        {
>>          if (nested_in_vect_loop_p (loop, stmt))
>>            {
>>              if (dump_enabled_p ())
>> 		{
>>                  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>                                   "not vectorized: not suitable for strided "
>>                                   "load ");
>>                  dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0);
>>                }
>>              return false;
>>            }
>>          STMT_VINFO_STRIDED_P (stmt_info) = true;
>>        }
>> 
>> Furthermore, this is being set while processing each data reference, not after
>> all information has been gathered and analyzed.  So this is too early.
> 
> Yeah, agree we can't tell at this stage.
> 
>> Next place to fix this would be in vect_analyze_slp_cost_1, where
>> 
>>      vect_memory_access_type memory_access_type
>>        = (STMT_VINFO_STRIDED_P (stmt_info)
>>           ? VMAT_STRIDED_SLP
>>           : VMAT_CONTIGUOUS);
>> 
>> is too simplistic.  I guess we could do something here, but it would
>> require unpacking all the grouped accesses and duplicating all the
>> logic from scratch to determine whether it's a single load or not.
> 
> I think we can use:
> 
>    SLP_INSTANCE_UNROLLING_FACTOR (instance) * ncopies_for_cost
> 
> to calculate the number of times that the vector SLP statements repeat
> the original scalar group.  We should then only use STRIDED_SLP if
> that's greater than 1.
> 
> This affects stores as well as loads, so fixing it here will work
> for both.

Thanks, Richard, that makes sense.  Let me play around with it a bit.
> 
> But I think this shows up another problem.  In the vectorised loop,
> we have 1 copy of the load and 4 copies of the ABS (after unpacking).
> But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4
> even for the loads.  So I think it's a double whammy: we've quadrupling
> the real cost via the excessive ncopies_for_cost, and we're using the
> wrong access type too.

Absolutely right, I failed to pay attention to this!  Good catch.
> 
> The patch below should fix the second problem but isn't much use without
> the first.  I think we need to adjust ncopies_for_cost in the recursive
> calls as well.

Just to be clear, I think you mean it fixes the original problem (wrong
access type).  I'll look into how to fix the excessive ncopies_for_cost.
> 
> Of course, fixing an overestimate here is likely to expose an
> underestimate elsewhere. :-)

Right...it was ever thus... 

Very much appreciate your assistance!

Bill
> 
> Interesting test case though.  It seems counterproductive to unroll
> a loop like that before vectorisation.
> 
> Thanks,
> Richard
> 
> 
> Index: gcc/tree-vect-slp.c
> ===================================================================
> --- gcc/tree-vect-slp.c	(revision 252986)
> +++ gcc/tree-vect-slp.c	(working copy)
> @@ -1687,8 +1687,10 @@
>   stmt_info = vinfo_for_stmt (stmt);
>   if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
>     {
> +      unsigned int group_ncopies = (SLP_INSTANCE_UNROLLING_FACTOR (instance)
> +				   * ncopies_for_cost);
>       vect_memory_access_type memory_access_type
> -	= (STMT_VINFO_STRIDED_P (stmt_info)
> +	= (STMT_VINFO_STRIDED_P (stmt_info) && group_ncopies > 1
> 	   ? VMAT_STRIDED_SLP
> 	   : VMAT_CONTIGUOUS);
>       if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
Bill Schmidt Sept. 20, 2017, 7:14 p.m. UTC | #7
On Sep 20, 2017, at 3:49 AM, Richard Sandiford <richard.sandiford@linaro.org> wrote:
> 
> But I think this shows up another problem.  In the vectorised loop,
> we have 1 copy of the load and 4 copies of the ABS (after unpacking).
> But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4
> even for the loads.  So I think it's a double whammy: we've quadrupling
> the real cost via the excessive ncopies_for_cost, and we're using the
> wrong access type too.
> 
> The patch below should fix the second problem but isn't much use without
> the first.  I think we need to adjust ncopies_for_cost in the recursive
> calls as well.

Unfortunately we have to deal with the fact that ncopies_for_cost is set
once for the whole SLP instance, which is not ideal since we know the
number of loads and stores doesn't follow the same rules.

What about the following?  I *think* that group_size / nunits is sufficient for
VMAT_STRIDED_SLP and VMAT_CONTIGUOUS for which permuted loads are
not possible (already handled differently), but I could well be missing something.

Index: gcc/tree-vect-slp.c
===================================================================
--- gcc/tree-vect-slp.c	(revision 252760)
+++ gcc/tree-vect-slp.c	(working copy)
@@ -1662,12 +1662,14 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl
   stmt_info = vinfo_for_stmt (stmt);
   if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
     {
+      int group_size = GROUP_SIZE (stmt_info);
+      int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
       vect_memory_access_type memory_access_type
-	= (STMT_VINFO_STRIDED_P (stmt_info)
+	= (STMT_VINFO_STRIDED_P (stmt_info) && group_size > nunits
 	   ? VMAT_STRIDED_SLP
 	   : VMAT_CONTIGUOUS);
       if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
-	vect_model_store_cost (stmt_info, ncopies_for_cost,
+	vect_model_store_cost (stmt_info, group_size / nunits,
 			       memory_access_type, vect_uninitialized_def,
 			       node, prologue_cost_vec, body_cost_vec);
       else
@@ -1714,6 +1716,8 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl
 			      + nunits - 1) / nunits);
 	      ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance);
 	    }
+	  else
+	    ncopies_for_cost = group_size / nunits;
 	  /* Record the cost for the vector loads.  */
 	  vect_model_load_cost (stmt_info, ncopies_for_cost,
 				memory_access_type, node, prologue_cost_vec,

Bill
Bill Schmidt Sept. 20, 2017, 7:17 p.m. UTC | #8
On Sep 20, 2017, at 2:14 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote:
> 
> On Sep 20, 2017, at 3:49 AM, Richard Sandiford <richard.sandiford@linaro.org> wrote:
>> 
>> But I think this shows up another problem.  In the vectorised loop,
>> we have 1 copy of the load and 4 copies of the ABS (after unpacking).
>> But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4
>> even for the loads.  So I think it's a double whammy: we've quadrupling
>> the real cost via the excessive ncopies_for_cost, and we're using the
>> wrong access type too.
>> 
>> The patch below should fix the second problem but isn't much use without
>> the first.  I think we need to adjust ncopies_for_cost in the recursive
>> calls as well.
> 
> Unfortunately we have to deal with the fact that ncopies_for_cost is set
> once for the whole SLP instance, which is not ideal since we know the
> number of loads and stores doesn't follow the same rules.
> 
> What about the following?  I *think* that group_size / nunits is sufficient for
> VMAT_STRIDED_SLP and VMAT_CONTIGUOUS for which permuted loads are
> not possible (already handled differently), but I could well be missing something.

Uh, I need to do the obvious rounding of (group_size + nunits - 1) / nunits...

Bill
> 
> Index: gcc/tree-vect-slp.c
> ===================================================================
> --- gcc/tree-vect-slp.c	(revision 252760)
> +++ gcc/tree-vect-slp.c	(working copy)
> @@ -1662,12 +1662,14 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl
>   stmt_info = vinfo_for_stmt (stmt);
>   if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
>     {
> +      int group_size = GROUP_SIZE (stmt_info);
> +      int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>       vect_memory_access_type memory_access_type
> -	= (STMT_VINFO_STRIDED_P (stmt_info)
> +	= (STMT_VINFO_STRIDED_P (stmt_info) && group_size > nunits
> 	   ? VMAT_STRIDED_SLP
> 	   : VMAT_CONTIGUOUS);
>       if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
> -	vect_model_store_cost (stmt_info, ncopies_for_cost,
> +	vect_model_store_cost (stmt_info, group_size / nunits,
> 			       memory_access_type, vect_uninitialized_def,
> 			       node, prologue_cost_vec, body_cost_vec);
>       else
> @@ -1714,6 +1716,8 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl
> 			      + nunits - 1) / nunits);
> 	      ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance);
> 	    }
> +	  else
> +	    ncopies_for_cost = group_size / nunits;
> 	  /* Record the cost for the vector loads.  */
> 	  vect_model_load_cost (stmt_info, ncopies_for_cost,
> 				memory_access_type, node, prologue_cost_vec,
> 
> Bill
>
Richard Biener Sept. 21, 2017, 7:37 a.m. UTC | #9
On Wed, Sep 20, 2017 at 9:17 PM, Bill Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
> On Sep 20, 2017, at 2:14 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote:
>>
>> On Sep 20, 2017, at 3:49 AM, Richard Sandiford <richard.sandiford@linaro.org> wrote:
>>>
>>> But I think this shows up another problem.  In the vectorised loop,
>>> we have 1 copy of the load and 4 copies of the ABS (after unpacking).
>>> But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4
>>> even for the loads.  So I think it's a double whammy: we've quadrupling
>>> the real cost via the excessive ncopies_for_cost, and we're using the
>>> wrong access type too.
>>>
>>> The patch below should fix the second problem but isn't much use without
>>> the first.  I think we need to adjust ncopies_for_cost in the recursive
>>> calls as well.
>>
>> Unfortunately we have to deal with the fact that ncopies_for_cost is set
>> once for the whole SLP instance, which is not ideal since we know the
>> number of loads and stores doesn't follow the same rules.
>>
>> What about the following?  I *think* that group_size / nunits is sufficient for
>> VMAT_STRIDED_SLP and VMAT_CONTIGUOUS for which permuted loads are
>> not possible (already handled differently), but I could well be missing something.
>
> Uh, I need to do the obvious rounding of (group_size + nunits - 1) / nunits...

I don't think any such simple approach will result in appropriate costing.  The
current costing is conservative and you definitely don't want to be
overly optimistic
here ;)

You'd have to fully emulate the vectorizable_load code.  Which means we should
probably refactor the code-gen part

  if (memory_access_type == VMAT_ELEMENTWISE
      || memory_access_type == VMAT_STRIDED_SLP)
    {
...

where it computes ltype, nloads, etc., do that computation up-front in
the analysis
phase, storing the result somewhere in stmt_vinfo (err, or slp_node or
both -- think
of hybrid SLP and/or the stmt used in different SLP contexts) and use the info
during transform.  And hope to have enough info to statically compute the
number of loads and their size/alignment at costing time.

Or go the full way of restructuring costing to have a prologue/body_cost_vec
in stmt_info and slp_node so that we can compute and store the cost
from vectorizable_* rather than duplicating the hard parts in SLP costing code.

Richard.

> Bill
>>
>> Index: gcc/tree-vect-slp.c
>> ===================================================================
>> --- gcc/tree-vect-slp.c       (revision 252760)
>> +++ gcc/tree-vect-slp.c       (working copy)
>> @@ -1662,12 +1662,14 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl
>>   stmt_info = vinfo_for_stmt (stmt);
>>   if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
>>     {
>> +      int group_size = GROUP_SIZE (stmt_info);
>> +      int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>>       vect_memory_access_type memory_access_type
>> -     = (STMT_VINFO_STRIDED_P (stmt_info)
>> +     = (STMT_VINFO_STRIDED_P (stmt_info) && group_size > nunits
>>          ? VMAT_STRIDED_SLP
>>          : VMAT_CONTIGUOUS);
>>       if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
>> -     vect_model_store_cost (stmt_info, ncopies_for_cost,
>> +     vect_model_store_cost (stmt_info, group_size / nunits,
>>                              memory_access_type, vect_uninitialized_def,
>>                              node, prologue_cost_vec, body_cost_vec);
>>       else
>> @@ -1714,6 +1716,8 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl
>>                             + nunits - 1) / nunits);
>>             ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance);
>>           }
>> +       else
>> +         ncopies_for_cost = group_size / nunits;
>>         /* Record the cost for the vector loads.  */
>>         vect_model_load_cost (stmt_info, ncopies_for_cost,
>>                               memory_access_type, node, prologue_cost_vec,
>>
>> Bill
>>
>
Bill Schmidt Sept. 21, 2017, 1:15 p.m. UTC | #10
On Sep 21, 2017, at 2:37 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> 
> On Wed, Sep 20, 2017 at 9:17 PM, Bill Schmidt
> <wschmidt@linux.vnet.ibm.com> wrote:
>> On Sep 20, 2017, at 2:14 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote:
>>> 
>>> On Sep 20, 2017, at 3:49 AM, Richard Sandiford <richard.sandiford@linaro.org> wrote:
>>>> 
>>>> But I think this shows up another problem.  In the vectorised loop,
>>>> we have 1 copy of the load and 4 copies of the ABS (after unpacking).
>>>> But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4
>>>> even for the loads.  So I think it's a double whammy: we've quadrupling
>>>> the real cost via the excessive ncopies_for_cost, and we're using the
>>>> wrong access type too.
>>>> 
>>>> The patch below should fix the second problem but isn't much use without
>>>> the first.  I think we need to adjust ncopies_for_cost in the recursive
>>>> calls as well.
>>> 
>>> Unfortunately we have to deal with the fact that ncopies_for_cost is set
>>> once for the whole SLP instance, which is not ideal since we know the
>>> number of loads and stores doesn't follow the same rules.
>>> 
>>> What about the following?  I *think* that group_size / nunits is sufficient for
>>> VMAT_STRIDED_SLP and VMAT_CONTIGUOUS for which permuted loads are
>>> not possible (already handled differently), but I could well be missing something.
>> 
>> Uh, I need to do the obvious rounding of (group_size + nunits - 1) / nunits...
> 
> I don't think any such simple approach will result in appropriate costing.  The
> current costing is conservative and you definitely don't want to be
> overly optimistic
> here ;)
> 
> You'd have to fully emulate the vectorizable_load code.  Which means we should
> probably refactor the code-gen part
> 
>  if (memory_access_type == VMAT_ELEMENTWISE
>      || memory_access_type == VMAT_STRIDED_SLP)
>    {
> ...
> 
> where it computes ltype, nloads, etc., do that computation up-front in
> the analysis
> phase, storing the result somewhere in stmt_vinfo (err, or slp_node or
> both -- think
> of hybrid SLP and/or the stmt used in different SLP contexts) and use the info
> during transform.  And hope to have enough info to statically compute the
> number of loads and their size/alignment at costing time.

I suppose so.  I was thinking that if these are truly strided loads, then the
cost is predictable -- but only for targets that have hardware strided loads
available, which are few, so this isn't conservative enough.  The code in
question didn't deal with VMAT_ELEMENTWISE, so the factoring is even
more complicated -- the structures of the costing logic and the code gen
logic don't match well at all.

I'm unlikely to have time for a complex solution right now, so I'll stick a pointer
to this conversation in the bugzilla and take my name off in case somebody
gets to it sooner.  But I'll probably keep poking at it in spare moments.

Thanks,
Bill
> 
> Or go the full way of restructuring costing to have a prologue/body_cost_vec
> in stmt_info and slp_node so that we can compute and store the cost
> from vectorizable_* rather than duplicating the hard parts in SLP costing code.
> 
> Richard.
> 
>> Bill
>>> 
>>> Index: gcc/tree-vect-slp.c
>>> ===================================================================
>>> --- gcc/tree-vect-slp.c       (revision 252760)
>>> +++ gcc/tree-vect-slp.c       (working copy)
>>> @@ -1662,12 +1662,14 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl
>>>  stmt_info = vinfo_for_stmt (stmt);
>>>  if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
>>>    {
>>> +      int group_size = GROUP_SIZE (stmt_info);
>>> +      int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>>>      vect_memory_access_type memory_access_type
>>> -     = (STMT_VINFO_STRIDED_P (stmt_info)
>>> +     = (STMT_VINFO_STRIDED_P (stmt_info) && group_size > nunits
>>>         ? VMAT_STRIDED_SLP
>>>         : VMAT_CONTIGUOUS);
>>>      if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
>>> -     vect_model_store_cost (stmt_info, ncopies_for_cost,
>>> +     vect_model_store_cost (stmt_info, group_size / nunits,
>>>                             memory_access_type, vect_uninitialized_def,
>>>                             node, prologue_cost_vec, body_cost_vec);
>>>      else
>>> @@ -1714,6 +1716,8 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl
>>>                            + nunits - 1) / nunits);
>>>            ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance);
>>>          }
>>> +       else
>>> +         ncopies_for_cost = group_size / nunits;
>>>        /* Record the cost for the vector loads.  */
>>>        vect_model_load_cost (stmt_info, ncopies_for_cost,
>>>                              memory_access_type, node, prologue_cost_vec,
>>> 
>>> Bill
>>> 
>> 
>
Richard Biener Sept. 21, 2017, 2:38 p.m. UTC | #11
On Thu, Sep 21, 2017 at 3:15 PM, Bill Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
> On Sep 21, 2017, at 2:37 AM, Richard Biener <richard.guenther@gmail.com> wrote
>>
>> On Wed, Sep 20, 2017 at 9:17 PM, Bill Schmidt
>> <wschmidt@linux.vnet.ibm.com> wrote:
>>> On Sep 20, 2017, at 2:14 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote:
>>>>
>>>> On Sep 20, 2017, at 3:49 AM, Richard Sandiford <richard.sandiford@linaro.org> wrote:
>>>>>
>>>>> But I think this shows up another problem.  In the vectorised loop,
>>>>> we have 1 copy of the load and 4 copies of the ABS (after unpacking).
>>>>> But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4
>>>>> even for the loads.  So I think it's a double whammy: we've quadrupling
>>>>> the real cost via the excessive ncopies_for_cost, and we're using the
>>>>> wrong access type too.
>>>>>
>>>>> The patch below should fix the second problem but isn't much use without
>>>>> the first.  I think we need to adjust ncopies_for_cost in the recursive
>>>>> calls as well.
>>>>
>>>> Unfortunately we have to deal with the fact that ncopies_for_cost is set
>>>> once for the whole SLP instance, which is not ideal since we know the
>>>> number of loads and stores doesn't follow the same rules.
>>>>
>>>> What about the following?  I *think* that group_size / nunits is sufficient for
>>>> VMAT_STRIDED_SLP and VMAT_CONTIGUOUS for which permuted loads are
>>>> not possible (already handled differently), but I could well be missing something.
>>>
>>> Uh, I need to do the obvious rounding of (group_size + nunits - 1) / nunits...
>>
>> I don't think any such simple approach will result in appropriate costing.  The
>> current costing is conservative and you definitely don't want to be
>> overly optimistic
>> here ;)
>>
>> You'd have to fully emulate the vectorizable_load code.  Which means we should
>> probably refactor the code-gen part
>>
>>  if (memory_access_type == VMAT_ELEMENTWISE
>>      || memory_access_type == VMAT_STRIDED_SLP)
>>    {
>> ...
>>
>> where it computes ltype, nloads, etc., do that computation up-front in
>> the analysis
>> phase, storing the result somewhere in stmt_vinfo (err, or slp_node or
>> both -- think
>> of hybrid SLP and/or the stmt used in different SLP contexts) and use the info
>> during transform.  And hope to have enough info to statically compute the
>> number of loads and their size/alignment at costing time.
>
> I suppose so.  I was thinking that if these are truly strided loads, then the
> cost is predictable -- but only for targets that have hardware strided loads
> available, which are few, so this isn't conservative enough.  The code in
> question didn't deal with VMAT_ELEMENTWISE, so the factoring is even
> more complicated -- the structures of the costing logic and the code gen
> logic don't match well at all.
>
> I'm unlikely to have time for a complex solution right now, so I'll stick a pointer
> to this conversation in the bugzilla and take my name off in case somebody
> gets to it sooner.  But I'll probably keep poking at it in spare moments.

I hope to get to some costmodel cleanups for GCC 8.

Richard.

> Thanks,
> Bill
>>
>> Or go the full way of restructuring costing to have a prologue/body_cost_vec
>> in stmt_info and slp_node so that we can compute and store the cost
>> from vectorizable_* rather than duplicating the hard parts in SLP costing code.
>>
>> Richard.
>>
>>> Bill
>>>>
>>>> Index: gcc/tree-vect-slp.c
>>>> ===================================================================
>>>> --- gcc/tree-vect-slp.c       (revision 252760)
>>>> +++ gcc/tree-vect-slp.c       (working copy)
>>>> @@ -1662,12 +1662,14 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl
>>>>  stmt_info = vinfo_for_stmt (stmt);
>>>>  if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
>>>>    {
>>>> +      int group_size = GROUP_SIZE (stmt_info);
>>>> +      int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>>>>      vect_memory_access_type memory_access_type
>>>> -     = (STMT_VINFO_STRIDED_P (stmt_info)
>>>> +     = (STMT_VINFO_STRIDED_P (stmt_info) && group_size > nunits
>>>>         ? VMAT_STRIDED_SLP
>>>>         : VMAT_CONTIGUOUS);
>>>>      if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
>>>> -     vect_model_store_cost (stmt_info, ncopies_for_cost,
>>>> +     vect_model_store_cost (stmt_info, group_size / nunits,
>>>>                             memory_access_type, vect_uninitialized_def,
>>>>                             node, prologue_cost_vec, body_cost_vec);
>>>>      else
>>>> @@ -1714,6 +1716,8 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl
>>>>                            + nunits - 1) / nunits);
>>>>            ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance);
>>>>          }
>>>> +       else
>>>> +         ncopies_for_cost = group_size / nunits;
>>>>        /* Record the cost for the vector loads.  */
>>>>        vect_model_load_cost (stmt_info, ncopies_for_cost,
>>>>                              memory_access_type, node, prologue_cost_vec,
>>>>
>>>> Bill
>>>>
>>>
>>
>
diff mbox series

Patch

Index: gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c	(nonexistent)
+++ gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c	(working copy)
@@ -0,0 +1,30 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+
+/* PR82255: Ensure we don't require a vec_construct cost when we aren't
+   going to generate a strided load.  */
+
+extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ ((__const__));
+
+static int
+foo (unsigned char *w, int i, unsigned char *x, int j)
+{
+  int tot = 0;
+  for (int a = 0; a < 16; a++)
+    {
+      for (int b = 0; b < 16; b++)
+	tot += abs (w[b] - x[b]);
+      w += i;
+      x += j;
+    }
+  return tot;
+}
+
+void
+bar (unsigned char *w, unsigned char *x, int i, int *result)
+{
+  *result = foo (w, 16, x, i);
+}
+
+/* { dg-final { scan-tree-dump-times "vec_construct required" 0 "vect" } } */
+
Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	(revision 252760)
+++ gcc/tree-vect-stmts.c	(working copy)
@@ -1091,8 +1091,20 @@  vect_model_load_cost (stmt_vec_info stmt_info, int
 			prologue_cost_vec, body_cost_vec, true);
   if (memory_access_type == VMAT_ELEMENTWISE
       || memory_access_type == VMAT_STRIDED_SLP)
-    inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct,
-				     stmt_info, 0, vect_body);
+    {
+      stmt_vec_info stmt_info = vinfo_for_stmt (first_stmt);
+      int group_size = GROUP_SIZE (stmt_info);
+      int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
+      if (group_size < nunits)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "vect_model_load_cost: vec_construct required");
+	  inside_cost += record_stmt_cost (body_cost_vec, ncopies,
+					   vec_construct, stmt_info, 0,
+					   vect_body);
+	}
+    }
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,