diff mbox series

[PR81740] Enforce dependence check for outer loop vectorization

Message ID DB5PR0801MB27422285F855F93CFAEA210BE70B0@DB5PR0801MB2742.eurprd08.prod.outlook.com
State New
Headers show
Series [PR81740] Enforce dependence check for outer loop vectorization | expand

Commit Message

Bin Cheng Dec. 15, 2017, 11:30 a.m. UTC
Hi,
As explained in the PR, given below test case:
int a[8][10] = { [2][5] = 4 }, c;

int
main ()
{
  short b;
  int i, d;
  for (b = 4; b >= 0; b--)
    for (c = 0; c <= 6; c++)
      a[c + 1][b + 2] = a[c][b + 1];
  for (i = 0; i < 8; i++)
    for (d = 0; d < 10; d++)
      if (a[i][d] != (i == 3 && d == 6) * 4)
        __builtin_abort ();
  return 0;

the loop nest is illegal for vectorization without reversing inner loop.  The issue
is in data dependence checking of vectorizer, I believe the mentioned revision just
exposed this.  Previously the vectorization is skipped because of unsupported memory
operation.  The outer loop vectorization unrolls the outer loop into:

  for (b = 4; b > 0; b -= 4)
  {
    for (c = 0; c <= 6; c++)
      a[c + 1][6] = a[c][5];
    for (c = 0; c <= 6; c++)
      a[c + 1][5] = a[c][4];
    for (c = 0; c <= 6; c++)
      a[c + 1][4] = a[c][3];
    for (c = 0; c <= 6; c++)
      a[c + 1][3] = a[c][2];
  }
Then four inner loops are fused into:
  for (b = 4; b > 0; b -= 4)
  {
    for (c = 0; c <= 6; c++)
    {
      a[c + 1][6] = a[c][5];  // S1
      a[c + 1][5] = a[c][4];  // S2
      a[c + 1][4] = a[c][3];
      a[c + 1][3] = a[c][2];
    }
  }
The loop fusion needs to meet the dependence requirement.  Basically, GCC's data
dependence analyzer does not model dep between references in sibling loops, but
in practice, fusion requirement can be checked by analyzing all data references
after fusion, and there is no backward data dependence.

Apparently, the requirement is violated because we have backward data dependence
between references (a[c][5], a[c+1][5]) in S1/S2.  Note, if we reverse the inner
loop, the outer loop would become legal for vectorization.

This patch fixes the issue by enforcing dependence check.  It also adds two tests
with one shouldn't be vectorized and the other should.  Bootstrap and test on x86_64
and AArch64.  Is it OK?

Thanks,
bin
2017-12-15  Bin Cheng  <bin.cheng@arm.com>

	PR tree-optimization/81740
	* tree-vect-data-refs.c (vect_analyze_data_ref_dependence): In case
	of outer loop vectorization, check backward dependence at inner loop
	if dependence at outer loop is reversed.

gcc/testsuite
2017-12-15  Bin Cheng  <bin.cheng@arm.com>

	PR tree-optimization/81740
	* gcc.dg/vect/pr81740-1.c: New test.
	* gcc.dg/vect/pr81740-2.c: Refine test.
From c0c8cfae08c0bde2cec41a8d3abcbfea0bd2e211 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binche01@e108451-lin.cambridge.arm.com>
Date: Thu, 14 Dec 2017 15:32:02 +0000
Subject: [PATCH] pr81740-20171212.txt

---
 gcc/testsuite/gcc.dg/vect/pr81740-1.c | 17 +++++++++++++++++
 gcc/testsuite/gcc.dg/vect/pr81740-2.c | 21 +++++++++++++++++++++
 gcc/tree-vect-data-refs.c             | 11 +++++++++++
 3 files changed, 49 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr81740-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr81740-2.c

Comments

Richard Biener Dec. 15, 2017, 11:55 a.m. UTC | #1
On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> Hi,
> As explained in the PR, given below test case:
> int a[8][10] = { [2][5] = 4 }, c;
>
> int
> main ()
> {
>   short b;
>   int i, d;
>   for (b = 4; b >= 0; b--)
>     for (c = 0; c <= 6; c++)
>       a[c + 1][b + 2] = a[c][b + 1];
>   for (i = 0; i < 8; i++)
>     for (d = 0; d < 10; d++)
>       if (a[i][d] != (i == 3 && d == 6) * 4)
>         __builtin_abort ();
>   return 0;
>
> the loop nest is illegal for vectorization without reversing inner loop.  The issue
> is in data dependence checking of vectorizer, I believe the mentioned revision just
> exposed this.  Previously the vectorization is skipped because of unsupported memory
> operation.  The outer loop vectorization unrolls the outer loop into:
>
>   for (b = 4; b > 0; b -= 4)
>   {
>     for (c = 0; c <= 6; c++)
>       a[c + 1][6] = a[c][5];
>     for (c = 0; c <= 6; c++)
>       a[c + 1][5] = a[c][4];
>     for (c = 0; c <= 6; c++)
>       a[c + 1][4] = a[c][3];
>     for (c = 0; c <= 6; c++)
>       a[c + 1][3] = a[c][2];
>   }
> Then four inner loops are fused into:
>   for (b = 4; b > 0; b -= 4)
>   {
>     for (c = 0; c <= 6; c++)
>     {
>       a[c + 1][6] = a[c][5];  // S1
>       a[c + 1][5] = a[c][4];  // S2
>       a[c + 1][4] = a[c][3];
>       a[c + 1][3] = a[c][2];
>     }
>   }

Note that they are not really "fused" but they are interleaved.  With
GIMPLE in mind
that makes a difference, you should get the equivalent of

   for (c = 0; c <= 6; c++)
     {
       tem1 = a[c][5];
       tem2 = a[c][4];
       tem3 = a[c][3];
       tem4 = a[c][2];
       a[c+1][6] = tem1;
       a[c +1][5] = tem2;
        a[c+1][4] = tem3;
        a[c+1][3] = tem4;
     }

> The loop fusion needs to meet the dependence requirement.  Basically, GCC's data
> dependence analyzer does not model dep between references in sibling loops, but
> in practice, fusion requirement can be checked by analyzing all data references
> after fusion, and there is no backward data dependence.
>
> Apparently, the requirement is violated because we have backward data dependence
> between references (a[c][5], a[c+1][5]) in S1/S2.  Note, if we reverse the inner
> loop, the outer loop would become legal for vectorization.
>
> This patch fixes the issue by enforcing dependence check.  It also adds two tests
> with one shouldn't be vectorized and the other should.  Bootstrap and test on x86_64
> and AArch64.  Is it OK?

I think you have identified the spot where things go wrong but I'm not
sure you fix the
problem fully.  The spot you pacth is (loop is the outer loop):

  loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr));
...
  FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
    {
      int dist = dist_v[loop_depth];
...
      if (dist > 0 && DDR_REVERSED_P (ddr))
        {
          /* If DDR_REVERSED_P the order of the data-refs in DDR was
             reversed (to make distance vector positive), and the actual
             distance is negative.  */
          if (dump_enabled_p ())
            dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                             "dependence distance negative.\n");

where you add

+         /* When doing outer loop vectorization, we need to check if there is
+            backward dependence at inner loop level if dependence at the outer
+            loop is reversed.  See PR81740 for more information.  */
+         if (nested_in_vect_loop_p (loop, DR_STMT (dra))
+             || nested_in_vect_loop_p (loop, DR_STMT (drb)))
+           {
+             unsigned inner_depth = index_in_loop_nest (loop->inner->num,
+                                                        DDR_LOOP_NEST (ddr));
+             if (dist_v[inner_depth] < 0)
+               return true;
+           }

but I don't understand how the dependence direction with respect to the
outer loop matters here.

Given there's DDR_REVERSED on the outer loop distance what does that
mean for the inner loop distance given the quite non-obvious code handling
this case in tree-data-ref.c:

      /* Verify a basic constraint: classic distance vectors should
         always be lexicographically positive.

         Data references are collected in the order of execution of
         the program, thus for the following loop

         | for (i = 1; i < 100; i++)
         |   for (j = 1; j < 100; j++)
         |     {
         |       t = T[j+1][i-1];  // A
         |       T[j][i] = t + 2;  // B
         |     }

         references are collected following the direction of the wind:
         A then B.  The data dependence tests are performed also
         following this order, such that we're looking at the distance
         separating the elements accessed by A from the elements later
         accessed by B.  But in this example, the distance returned by
         test_dep (A, B) is lexicographically negative (-1, 1), that
         means that the access A occurs later than B with respect to
         the outer loop, ie. we're actually looking upwind.  In this
         case we solve test_dep (B, A) looking downwind to the
         lexicographically positive solution, that returns the
         distance vector (1, -1).  */
      if (!lambda_vector_lexico_pos (dist_v, DDR_NB_LOOPS (ddr)))
        {
          lambda_vector save_v = lambda_vector_new (DDR_NB_LOOPS (ddr));
          if (!subscript_dependence_tester_1 (ddr, 1, 0, loop_nest))
            return false;
          compute_subscript_distance (ddr);
          if (!build_classic_dist_vector_1 (ddr, 1, 0, save_v, &init_b,
                                            &index_carry))
            return false;
          save_dist_v (ddr, save_v);
          DDR_REVERSED_P (ddr) = true;

that is, what does dist_v[inner_depth] < 0 tell us here?  Does it tell us
that the dependence direction is reverse of the outer loop?

CCing Micha who might remember some more details from unroll-and-jam.

Thanks,
Richard.

> Thanks,
> bin
> 2017-12-15  Bin Cheng  <bin.cheng@arm.com>
>
>         PR tree-optimization/81740
>         * tree-vect-data-refs.c (vect_analyze_data_ref_dependence): In case
>         of outer loop vectorization, check backward dependence at inner loop
>         if dependence at outer loop is reversed.
>
> gcc/testsuite
> 2017-12-15  Bin Cheng  <bin.cheng@arm.com>
>
>         PR tree-optimization/81740
>         * gcc.dg/vect/pr81740-1.c: New test.
>         * gcc.dg/vect/pr81740-2.c: Refine test.
Bin.Cheng Dec. 15, 2017, 12:09 p.m. UTC | #2
On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>> Hi,
>> As explained in the PR, given below test case:
>> int a[8][10] = { [2][5] = 4 }, c;
>>
>> int
>> main ()
>> {
>>   short b;
>>   int i, d;
>>   for (b = 4; b >= 0; b--)
>>     for (c = 0; c <= 6; c++)
>>       a[c + 1][b + 2] = a[c][b + 1];
>>   for (i = 0; i < 8; i++)
>>     for (d = 0; d < 10; d++)
>>       if (a[i][d] != (i == 3 && d == 6) * 4)
>>         __builtin_abort ();
>>   return 0;
>>
>> the loop nest is illegal for vectorization without reversing inner loop.  The issue
>> is in data dependence checking of vectorizer, I believe the mentioned revision just
>> exposed this.  Previously the vectorization is skipped because of unsupported memory
>> operation.  The outer loop vectorization unrolls the outer loop into:
>>
>>   for (b = 4; b > 0; b -= 4)
>>   {
>>     for (c = 0; c <= 6; c++)
>>       a[c + 1][6] = a[c][5];
>>     for (c = 0; c <= 6; c++)
>>       a[c + 1][5] = a[c][4];
>>     for (c = 0; c <= 6; c++)
>>       a[c + 1][4] = a[c][3];
>>     for (c = 0; c <= 6; c++)
>>       a[c + 1][3] = a[c][2];
>>   }
>> Then four inner loops are fused into:
>>   for (b = 4; b > 0; b -= 4)
>>   {
>>     for (c = 0; c <= 6; c++)
>>     {
>>       a[c + 1][6] = a[c][5];  // S1
>>       a[c + 1][5] = a[c][4];  // S2
>>       a[c + 1][4] = a[c][3];
>>       a[c + 1][3] = a[c][2];
>>     }
>>   }
>
> Note that they are not really "fused" but they are interleaved.  With
> GIMPLE in mind
> that makes a difference, you should get the equivalent of
>
>    for (c = 0; c <= 6; c++)
>      {
>        tem1 = a[c][5];
>        tem2 = a[c][4];
>        tem3 = a[c][3];
>        tem4 = a[c][2];
>        a[c+1][6] = tem1;
>        a[c +1][5] = tem2;
>         a[c+1][4] = tem3;
>         a[c+1][3] = tem4;
>      }
Yeah, I will double check if this abstract breaks the patch and how.

>
>> The loop fusion needs to meet the dependence requirement.  Basically, GCC's data
>> dependence analyzer does not model dep between references in sibling loops, but
>> in practice, fusion requirement can be checked by analyzing all data references
>> after fusion, and there is no backward data dependence.
>>
>> Apparently, the requirement is violated because we have backward data dependence
>> between references (a[c][5], a[c+1][5]) in S1/S2.  Note, if we reverse the inner
>> loop, the outer loop would become legal for vectorization.
>>
>> This patch fixes the issue by enforcing dependence check.  It also adds two tests
>> with one shouldn't be vectorized and the other should.  Bootstrap and test on x86_64
>> and AArch64.  Is it OK?
>
> I think you have identified the spot where things go wrong but I'm not
> sure you fix the
> problem fully.  The spot you pacth is (loop is the outer loop):
>
>   loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr));
> ...
>   FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
>     {
>       int dist = dist_v[loop_depth];
> ...
>       if (dist > 0 && DDR_REVERSED_P (ddr))
>         {
>           /* If DDR_REVERSED_P the order of the data-refs in DDR was
>              reversed (to make distance vector positive), and the actual
>              distance is negative.  */
>           if (dump_enabled_p ())
>             dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                              "dependence distance negative.\n");
>
> where you add
>
> +         /* When doing outer loop vectorization, we need to check if there is
> +            backward dependence at inner loop level if dependence at the outer
> +            loop is reversed.  See PR81740 for more information.  */
> +         if (nested_in_vect_loop_p (loop, DR_STMT (dra))
> +             || nested_in_vect_loop_p (loop, DR_STMT (drb)))
> +           {
> +             unsigned inner_depth = index_in_loop_nest (loop->inner->num,
> +                                                        DDR_LOOP_NEST (ddr));
> +             if (dist_v[inner_depth] < 0)
> +               return true;
> +           }
>
> but I don't understand how the dependence direction with respect to the
> outer loop matters here.
If the direction wrto outer loop is positive by itself, i.e,
reversed_p equals to false, then dist is checked against max_vf.  In
this case, it's not possible to have references refer to the same
object?
On the other hand, dist is not checked at all for reversed case.
Maybe an additional check "dist < max_vf" can relax the patch a bit.
>
> Given there's DDR_REVERSED on the outer loop distance what does that
> mean for the inner loop distance given the quite non-obvious code handling
> this case in tree-data-ref.c:
>
>       /* Verify a basic constraint: classic distance vectors should
>          always be lexicographically positive.
>
>          Data references are collected in the order of execution of
>          the program, thus for the following loop
>
>          | for (i = 1; i < 100; i++)
>          |   for (j = 1; j < 100; j++)
>          |     {
>          |       t = T[j+1][i-1];  // A
>          |       T[j][i] = t + 2;  // B
>          |     }
>
>          references are collected following the direction of the wind:
>          A then B.  The data dependence tests are performed also
>          following this order, such that we're looking at the distance
>          separating the elements accessed by A from the elements later
>          accessed by B.  But in this example, the distance returned by
>          test_dep (A, B) is lexicographically negative (-1, 1), that
>          means that the access A occurs later than B with respect to
>          the outer loop, ie. we're actually looking upwind.  In this
>          case we solve test_dep (B, A) looking downwind to the
>          lexicographically positive solution, that returns the
>          distance vector (1, -1).  */
>       if (!lambda_vector_lexico_pos (dist_v, DDR_NB_LOOPS (ddr)))
>         {
>           lambda_vector save_v = lambda_vector_new (DDR_NB_LOOPS (ddr));
>           if (!subscript_dependence_tester_1 (ddr, 1, 0, loop_nest))
>             return false;
>           compute_subscript_distance (ddr);
>           if (!build_classic_dist_vector_1 (ddr, 1, 0, save_v, &init_b,
>                                             &index_carry))
>             return false;
>           save_dist_v (ddr, save_v);
>           DDR_REVERSED_P (ddr) = true;
>
> that is, what does dist_v[inner_depth] < 0 tell us here?  Does it tell us
> that the dependence direction is reverse of the outer loop?
Hmm, this part of code is a bit confusing for me.  So we always
compute classic distance by sorting two data references in lexico
order, which could be the opposite given we collect references by dom
order.  IIUC, the negative dist at inner loop means a backward
dependence for that index/loop.  It's not related to outer loop or the
reversed_p flag.

Thanks,
bin
>
> CCing Micha who might remember some more details from unroll-and-jam.
>
> Thanks,
> Richard.
>
>> Thanks,
>> bin
>> 2017-12-15  Bin Cheng  <bin.cheng@arm.com>
>>
>>         PR tree-optimization/81740
>>         * tree-vect-data-refs.c (vect_analyze_data_ref_dependence): In case
>>         of outer loop vectorization, check backward dependence at inner loop
>>         if dependence at outer loop is reversed.
>>
>> gcc/testsuite
>> 2017-12-15  Bin Cheng  <bin.cheng@arm.com>
>>
>>         PR tree-optimization/81740
>>         * gcc.dg/vect/pr81740-1.c: New test.
>>         * gcc.dg/vect/pr81740-2.c: Refine test.
Bin.Cheng Dec. 15, 2017, 12:35 p.m. UTC | #3
On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>> Hi,
>>> As explained in the PR, given below test case:
>>> int a[8][10] = { [2][5] = 4 }, c;
>>>
>>> int
>>> main ()
>>> {
>>>   short b;
>>>   int i, d;
>>>   for (b = 4; b >= 0; b--)
>>>     for (c = 0; c <= 6; c++)
>>>       a[c + 1][b + 2] = a[c][b + 1];
>>>   for (i = 0; i < 8; i++)
>>>     for (d = 0; d < 10; d++)
>>>       if (a[i][d] != (i == 3 && d == 6) * 4)
>>>         __builtin_abort ();
>>>   return 0;
>>>
>>> the loop nest is illegal for vectorization without reversing inner loop.  The issue
>>> is in data dependence checking of vectorizer, I believe the mentioned revision just
>>> exposed this.  Previously the vectorization is skipped because of unsupported memory
>>> operation.  The outer loop vectorization unrolls the outer loop into:
>>>
>>>   for (b = 4; b > 0; b -= 4)
>>>   {
>>>     for (c = 0; c <= 6; c++)
>>>       a[c + 1][6] = a[c][5];
>>>     for (c = 0; c <= 6; c++)
>>>       a[c + 1][5] = a[c][4];
>>>     for (c = 0; c <= 6; c++)
>>>       a[c + 1][4] = a[c][3];
>>>     for (c = 0; c <= 6; c++)
>>>       a[c + 1][3] = a[c][2];
>>>   }
>>> Then four inner loops are fused into:
>>>   for (b = 4; b > 0; b -= 4)
>>>   {
>>>     for (c = 0; c <= 6; c++)
>>>     {
>>>       a[c + 1][6] = a[c][5];  // S1
>>>       a[c + 1][5] = a[c][4];  // S2
>>>       a[c + 1][4] = a[c][3];
>>>       a[c + 1][3] = a[c][2];
>>>     }
>>>   }
>>
>> Note that they are not really "fused" but they are interleaved.  With
>> GIMPLE in mind
>> that makes a difference, you should get the equivalent of
>>
>>    for (c = 0; c <= 6; c++)
>>      {
>>        tem1 = a[c][5];
>>        tem2 = a[c][4];
>>        tem3 = a[c][3];
>>        tem4 = a[c][2];
>>        a[c+1][6] = tem1;
>>        a[c +1][5] = tem2;
>>         a[c+1][4] = tem3;
>>         a[c+1][3] = tem4;
>>      }
> Yeah, I will double check if this abstract breaks the patch and how.
Hmm, I think this doesn't break it, well at least for part of the
analysis, because it is loop carried (backward) dependence goes wrong,
interleaving or not with the same iteration doesn't matter here.

Thanks,
bin
>
>>
>>> The loop fusion needs to meet the dependence requirement.  Basically, GCC's data
>>> dependence analyzer does not model dep between references in sibling loops, but
>>> in practice, fusion requirement can be checked by analyzing all data references
>>> after fusion, and there is no backward data dependence.
>>>
>>> Apparently, the requirement is violated because we have backward data dependence
>>> between references (a[c][5], a[c+1][5]) in S1/S2.  Note, if we reverse the inner
>>> loop, the outer loop would become legal for vectorization.
>>>
>>> This patch fixes the issue by enforcing dependence check.  It also adds two tests
>>> with one shouldn't be vectorized and the other should.  Bootstrap and test on x86_64
>>> and AArch64.  Is it OK?
>>
>> I think you have identified the spot where things go wrong but I'm not
>> sure you fix the
>> problem fully.  The spot you pacth is (loop is the outer loop):
>>
>>   loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr));
>> ...
>>   FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
>>     {
>>       int dist = dist_v[loop_depth];
>> ...
>>       if (dist > 0 && DDR_REVERSED_P (ddr))
>>         {
>>           /* If DDR_REVERSED_P the order of the data-refs in DDR was
>>              reversed (to make distance vector positive), and the actual
>>              distance is negative.  */
>>           if (dump_enabled_p ())
>>             dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>                              "dependence distance negative.\n");
>>
>> where you add
>>
>> +         /* When doing outer loop vectorization, we need to check if there is
>> +            backward dependence at inner loop level if dependence at the outer
>> +            loop is reversed.  See PR81740 for more information.  */
>> +         if (nested_in_vect_loop_p (loop, DR_STMT (dra))
>> +             || nested_in_vect_loop_p (loop, DR_STMT (drb)))
>> +           {
>> +             unsigned inner_depth = index_in_loop_nest (loop->inner->num,
>> +                                                        DDR_LOOP_NEST (ddr));
>> +             if (dist_v[inner_depth] < 0)
>> +               return true;
>> +           }
>>
>> but I don't understand how the dependence direction with respect to the
>> outer loop matters here.
> If the direction wrto outer loop is positive by itself, i.e,
> reversed_p equals to false, then dist is checked against max_vf.  In
> this case, it's not possible to have references refer to the same
> object?
> On the other hand, dist is not checked at all for reversed case.
> Maybe an additional check "dist < max_vf" can relax the patch a bit.
>>
>> Given there's DDR_REVERSED on the outer loop distance what does that
>> mean for the inner loop distance given the quite non-obvious code handling
>> this case in tree-data-ref.c:
>>
>>       /* Verify a basic constraint: classic distance vectors should
>>          always be lexicographically positive.
>>
>>          Data references are collected in the order of execution of
>>          the program, thus for the following loop
>>
>>          | for (i = 1; i < 100; i++)
>>          |   for (j = 1; j < 100; j++)
>>          |     {
>>          |       t = T[j+1][i-1];  // A
>>          |       T[j][i] = t + 2;  // B
>>          |     }
>>
>>          references are collected following the direction of the wind:
>>          A then B.  The data dependence tests are performed also
>>          following this order, such that we're looking at the distance
>>          separating the elements accessed by A from the elements later
>>          accessed by B.  But in this example, the distance returned by
>>          test_dep (A, B) is lexicographically negative (-1, 1), that
>>          means that the access A occurs later than B with respect to
>>          the outer loop, ie. we're actually looking upwind.  In this
>>          case we solve test_dep (B, A) looking downwind to the
>>          lexicographically positive solution, that returns the
>>          distance vector (1, -1).  */
>>       if (!lambda_vector_lexico_pos (dist_v, DDR_NB_LOOPS (ddr)))
>>         {
>>           lambda_vector save_v = lambda_vector_new (DDR_NB_LOOPS (ddr));
>>           if (!subscript_dependence_tester_1 (ddr, 1, 0, loop_nest))
>>             return false;
>>           compute_subscript_distance (ddr);
>>           if (!build_classic_dist_vector_1 (ddr, 1, 0, save_v, &init_b,
>>                                             &index_carry))
>>             return false;
>>           save_dist_v (ddr, save_v);
>>           DDR_REVERSED_P (ddr) = true;
>>
>> that is, what does dist_v[inner_depth] < 0 tell us here?  Does it tell us
>> that the dependence direction is reverse of the outer loop?
> Hmm, this part of code is a bit confusing for me.  So we always
> compute classic distance by sorting two data references in lexico
> order, which could be the opposite given we collect references by dom
> order.  IIUC, the negative dist at inner loop means a backward
> dependence for that index/loop.  It's not related to outer loop or the
> reversed_p flag.
>
> Thanks,
> bin
>>
>> CCing Micha who might remember some more details from unroll-and-jam.
>>
>> Thanks,
>> Richard.
>>
>>> Thanks,
>>> bin
>>> 2017-12-15  Bin Cheng  <bin.cheng@arm.com>
>>>
>>>         PR tree-optimization/81740
>>>         * tree-vect-data-refs.c (vect_analyze_data_ref_dependence): In case
>>>         of outer loop vectorization, check backward dependence at inner loop
>>>         if dependence at outer loop is reversed.
>>>
>>> gcc/testsuite
>>> 2017-12-15  Bin Cheng  <bin.cheng@arm.com>
>>>
>>>         PR tree-optimization/81740
>>>         * gcc.dg/vect/pr81740-1.c: New test.
>>>         * gcc.dg/vect/pr81740-2.c: Refine test.
Richard Biener Dec. 15, 2017, 1:19 p.m. UTC | #4
On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>> Hi,
>>>> As explained in the PR, given below test case:
>>>> int a[8][10] = { [2][5] = 4 }, c;
>>>>
>>>> int
>>>> main ()
>>>> {
>>>>   short b;
>>>>   int i, d;
>>>>   for (b = 4; b >= 0; b--)
>>>>     for (c = 0; c <= 6; c++)
>>>>       a[c + 1][b + 2] = a[c][b + 1];
>>>>   for (i = 0; i < 8; i++)
>>>>     for (d = 0; d < 10; d++)
>>>>       if (a[i][d] != (i == 3 && d == 6) * 4)
>>>>         __builtin_abort ();
>>>>   return 0;
>>>>
>>>> the loop nest is illegal for vectorization without reversing inner loop.  The issue
>>>> is in data dependence checking of vectorizer, I believe the mentioned revision just
>>>> exposed this.  Previously the vectorization is skipped because of unsupported memory
>>>> operation.  The outer loop vectorization unrolls the outer loop into:
>>>>
>>>>   for (b = 4; b > 0; b -= 4)
>>>>   {
>>>>     for (c = 0; c <= 6; c++)
>>>>       a[c + 1][6] = a[c][5];
>>>>     for (c = 0; c <= 6; c++)
>>>>       a[c + 1][5] = a[c][4];
>>>>     for (c = 0; c <= 6; c++)
>>>>       a[c + 1][4] = a[c][3];
>>>>     for (c = 0; c <= 6; c++)
>>>>       a[c + 1][3] = a[c][2];
>>>>   }
>>>> Then four inner loops are fused into:
>>>>   for (b = 4; b > 0; b -= 4)
>>>>   {
>>>>     for (c = 0; c <= 6; c++)
>>>>     {
>>>>       a[c + 1][6] = a[c][5];  // S1
>>>>       a[c + 1][5] = a[c][4];  // S2
>>>>       a[c + 1][4] = a[c][3];
>>>>       a[c + 1][3] = a[c][2];
>>>>     }
>>>>   }
>>>
>>> Note that they are not really "fused" but they are interleaved.  With
>>> GIMPLE in mind
>>> that makes a difference, you should get the equivalent of
>>>
>>>    for (c = 0; c <= 6; c++)
>>>      {
>>>        tem1 = a[c][5];
>>>        tem2 = a[c][4];
>>>        tem3 = a[c][3];
>>>        tem4 = a[c][2];
>>>        a[c+1][6] = tem1;
>>>        a[c +1][5] = tem2;
>>>         a[c+1][4] = tem3;
>>>         a[c+1][3] = tem4;
>>>      }
>> Yeah, I will double check if this abstract breaks the patch and how.
> Hmm, I think this doesn't break it, well at least for part of the
> analysis, because it is loop carried (backward) dependence goes wrong,
> interleaving or not with the same iteration doesn't matter here.

I think the idea is that forward dependences are always fine (negative distance)
to vectorize.  But with backward dependences we have to adhere to max_vf.

It looks like for outer loop vectorization we only look at the distances in the
outer loop but never at inner ones.  But here the same applies but isn't that
independend on the distances with respect to the outer loop?

But maybe I'm misunderstanding how "distances" work here.

Richard.

> Thanks,
> bin
>>
>>>
>>>> The loop fusion needs to meet the dependence requirement.  Basically, GCC's data
>>>> dependence analyzer does not model dep between references in sibling loops, but
>>>> in practice, fusion requirement can be checked by analyzing all data references
>>>> after fusion, and there is no backward data dependence.
>>>>
>>>> Apparently, the requirement is violated because we have backward data dependence
>>>> between references (a[c][5], a[c+1][5]) in S1/S2.  Note, if we reverse the inner
>>>> loop, the outer loop would become legal for vectorization.
>>>>
>>>> This patch fixes the issue by enforcing dependence check.  It also adds two tests
>>>> with one shouldn't be vectorized and the other should.  Bootstrap and test on x86_64
>>>> and AArch64.  Is it OK?
>>>
>>> I think you have identified the spot where things go wrong but I'm not
>>> sure you fix the
>>> problem fully.  The spot you pacth is (loop is the outer loop):
>>>
>>>   loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr));
>>> ...
>>>   FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
>>>     {
>>>       int dist = dist_v[loop_depth];
>>> ...
>>>       if (dist > 0 && DDR_REVERSED_P (ddr))
>>>         {
>>>           /* If DDR_REVERSED_P the order of the data-refs in DDR was
>>>              reversed (to make distance vector positive), and the actual
>>>              distance is negative.  */
>>>           if (dump_enabled_p ())
>>>             dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>>                              "dependence distance negative.\n");
>>>
>>> where you add
>>>
>>> +         /* When doing outer loop vectorization, we need to check if there is
>>> +            backward dependence at inner loop level if dependence at the outer
>>> +            loop is reversed.  See PR81740 for more information.  */
>>> +         if (nested_in_vect_loop_p (loop, DR_STMT (dra))
>>> +             || nested_in_vect_loop_p (loop, DR_STMT (drb)))
>>> +           {
>>> +             unsigned inner_depth = index_in_loop_nest (loop->inner->num,
>>> +                                                        DDR_LOOP_NEST (ddr));
>>> +             if (dist_v[inner_depth] < 0)
>>> +               return true;
>>> +           }
>>>
>>> but I don't understand how the dependence direction with respect to the
>>> outer loop matters here.
>> If the direction wrto outer loop is positive by itself, i.e,
>> reversed_p equals to false, then dist is checked against max_vf.  In
>> this case, it's not possible to have references refer to the same
>> object?
>> On the other hand, dist is not checked at all for reversed case.
>> Maybe an additional check "dist < max_vf" can relax the patch a bit.
>>>
>>> Given there's DDR_REVERSED on the outer loop distance what does that
>>> mean for the inner loop distance given the quite non-obvious code handling
>>> this case in tree-data-ref.c:
>>>
>>>       /* Verify a basic constraint: classic distance vectors should
>>>          always be lexicographically positive.
>>>
>>>          Data references are collected in the order of execution of
>>>          the program, thus for the following loop
>>>
>>>          | for (i = 1; i < 100; i++)
>>>          |   for (j = 1; j < 100; j++)
>>>          |     {
>>>          |       t = T[j+1][i-1];  // A
>>>          |       T[j][i] = t + 2;  // B
>>>          |     }
>>>
>>>          references are collected following the direction of the wind:
>>>          A then B.  The data dependence tests are performed also
>>>          following this order, such that we're looking at the distance
>>>          separating the elements accessed by A from the elements later
>>>          accessed by B.  But in this example, the distance returned by
>>>          test_dep (A, B) is lexicographically negative (-1, 1), that
>>>          means that the access A occurs later than B with respect to
>>>          the outer loop, ie. we're actually looking upwind.  In this
>>>          case we solve test_dep (B, A) looking downwind to the
>>>          lexicographically positive solution, that returns the
>>>          distance vector (1, -1).  */
>>>       if (!lambda_vector_lexico_pos (dist_v, DDR_NB_LOOPS (ddr)))
>>>         {
>>>           lambda_vector save_v = lambda_vector_new (DDR_NB_LOOPS (ddr));
>>>           if (!subscript_dependence_tester_1 (ddr, 1, 0, loop_nest))
>>>             return false;
>>>           compute_subscript_distance (ddr);
>>>           if (!build_classic_dist_vector_1 (ddr, 1, 0, save_v, &init_b,
>>>                                             &index_carry))
>>>             return false;
>>>           save_dist_v (ddr, save_v);
>>>           DDR_REVERSED_P (ddr) = true;
>>>
>>> that is, what does dist_v[inner_depth] < 0 tell us here?  Does it tell us
>>> that the dependence direction is reverse of the outer loop?
>> Hmm, this part of code is a bit confusing for me.  So we always
>> compute classic distance by sorting two data references in lexico
>> order, which could be the opposite given we collect references by dom
>> order.  IIUC, the negative dist at inner loop means a backward
>> dependence for that index/loop.  It's not related to outer loop or the
>> reversed_p flag.
>>
>> Thanks,
>> bin
>>>
>>> CCing Micha who might remember some more details from unroll-and-jam.
>>>
>>> Thanks,
>>> Richard.
>>>
>>>> Thanks,
>>>> bin
>>>> 2017-12-15  Bin Cheng  <bin.cheng@arm.com>
>>>>
>>>>         PR tree-optimization/81740
>>>>         * tree-vect-data-refs.c (vect_analyze_data_ref_dependence): In case
>>>>         of outer loop vectorization, check backward dependence at inner loop
>>>>         if dependence at outer loop is reversed.
>>>>
>>>> gcc/testsuite
>>>> 2017-12-15  Bin Cheng  <bin.cheng@arm.com>
>>>>
>>>>         PR tree-optimization/81740
>>>>         * gcc.dg/vect/pr81740-1.c: New test.
>>>>         * gcc.dg/vect/pr81740-2.c: Refine test.
Bin.Cheng Dec. 15, 2017, 6:39 p.m. UTC | #5
On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>> Hi,
>>>>> As explained in the PR, given below test case:
>>>>> int a[8][10] = { [2][5] = 4 }, c;
>>>>>
>>>>> int
>>>>> main ()
>>>>> {
>>>>>   short b;
>>>>>   int i, d;
>>>>>   for (b = 4; b >= 0; b--)
>>>>>     for (c = 0; c <= 6; c++)
>>>>>       a[c + 1][b + 2] = a[c][b + 1];
>>>>>   for (i = 0; i < 8; i++)
>>>>>     for (d = 0; d < 10; d++)
>>>>>       if (a[i][d] != (i == 3 && d == 6) * 4)
>>>>>         __builtin_abort ();
>>>>>   return 0;
>>>>>
>>>>> the loop nest is illegal for vectorization without reversing inner loop.  The issue
>>>>> is in data dependence checking of vectorizer, I believe the mentioned revision just
>>>>> exposed this.  Previously the vectorization is skipped because of unsupported memory
>>>>> operation.  The outer loop vectorization unrolls the outer loop into:
>>>>>
>>>>>   for (b = 4; b > 0; b -= 4)
>>>>>   {
>>>>>     for (c = 0; c <= 6; c++)
>>>>>       a[c + 1][6] = a[c][5];
>>>>>     for (c = 0; c <= 6; c++)
>>>>>       a[c + 1][5] = a[c][4];
>>>>>     for (c = 0; c <= 6; c++)
>>>>>       a[c + 1][4] = a[c][3];
>>>>>     for (c = 0; c <= 6; c++)
>>>>>       a[c + 1][3] = a[c][2];
>>>>>   }
>>>>> Then four inner loops are fused into:
>>>>>   for (b = 4; b > 0; b -= 4)
>>>>>   {
>>>>>     for (c = 0; c <= 6; c++)
>>>>>     {
>>>>>       a[c + 1][6] = a[c][5];  // S1
>>>>>       a[c + 1][5] = a[c][4];  // S2
>>>>>       a[c + 1][4] = a[c][3];
>>>>>       a[c + 1][3] = a[c][2];
>>>>>     }
>>>>>   }
>>>>
>>>> Note that they are not really "fused" but they are interleaved.  With
>>>> GIMPLE in mind
>>>> that makes a difference, you should get the equivalent of
>>>>
>>>>    for (c = 0; c <= 6; c++)
>>>>      {
>>>>        tem1 = a[c][5];
>>>>        tem2 = a[c][4];
>>>>        tem3 = a[c][3];
>>>>        tem4 = a[c][2];
>>>>        a[c+1][6] = tem1;
>>>>        a[c +1][5] = tem2;
>>>>         a[c+1][4] = tem3;
>>>>         a[c+1][3] = tem4;
>>>>      }
>>> Yeah, I will double check if this abstract breaks the patch and how.
>> Hmm, I think this doesn't break it, well at least for part of the
>> analysis, because it is loop carried (backward) dependence goes wrong,
>> interleaving or not with the same iteration doesn't matter here.
>
> I think the idea is that forward dependences are always fine (negative distance)
> to vectorize.  But with backward dependences we have to adhere to max_vf.
>
> It looks like for outer loop vectorization we only look at the distances in the
> outer loop but never at inner ones.  But here the same applies but isn't that
> independend on the distances with respect to the outer loop?
>
> But maybe I'm misunderstanding how "distances" work here.
Hmm, I am not sure I understand "distance" correctly.  With
description as in book like "Optimizing compilers for Modern
Architectures", distance is "# of iteration of sink ref - # of
iteration of source ref".  Given below example:
  for (i = 0; i < N; ++i)
    {
      x = arr[idx_1];  // S1
      arr[idx_2] = x;  // S2
    }
if S1 is source ref, distance = idx_2 - idx_1, and distance > 0.  Also
this is forward dependence.  For example, idx_1 is i + 1 and idx_2 is
i;
If S2 is source ref, distance = idx_1 - idx_2, and distance < 0.  Also
this is backward dependence.  For example idx_1 is i and idx_2 is i +
1;

In GCC, we always try to subtract idx_2 from idx_1 first in computing
classic distance, we could result in negative distance in case of
backward dependence.  When this happens at dependence carried level,
we manually reverse it.  When this happens at inner level loop, we
simply keep the negative distance.  And it's meaningless to talk about
forward/backward given dependence is carried at outer level loop.

Outer loop vectorization is interesting.  The problematic case has
backward dependence carried by outer loop.  Because we don't check
dist vs. max_vf for such dep, the unrolled references could have outer
loop index equal to each other, as in a[c][5] and a[c+1][5].  So it's
like we have outer loop index resolved as equal.  Now it has
dependence only if c == c' + 1.  I think previous comment on fusion
still hold for this and we now need to check backward dependence
between the two reference at inner level loop.  I guess it's a bit
trick to model dependence between a[c][5] and a[c+1][5] using the
original references and dependence.  I think it's okay to do that
because distance of a[c][5] and a[c+1][5] is what we computed
previously for the original references at inner level loop.

Not sure if I have missed something important.

Thanks,
bin
>
> Richard.
>
>> Thanks,
>> bin
>>>
>>>>
>>>>> The loop fusion needs to meet the dependence requirement.  Basically, GCC's data
>>>>> dependence analyzer does not model dep between references in sibling loops, but
>>>>> in practice, fusion requirement can be checked by analyzing all data references
>>>>> after fusion, and there is no backward data dependence.
>>>>>
>>>>> Apparently, the requirement is violated because we have backward data dependence
>>>>> between references (a[c][5], a[c+1][5]) in S1/S2.  Note, if we reverse the inner
>>>>> loop, the outer loop would become legal for vectorization.
>>>>>
>>>>> This patch fixes the issue by enforcing dependence check.  It also adds two tests
>>>>> with one shouldn't be vectorized and the other should.  Bootstrap and test on x86_64
>>>>> and AArch64.  Is it OK?
>>>>
>>>> I think you have identified the spot where things go wrong but I'm not
>>>> sure you fix the
>>>> problem fully.  The spot you pacth is (loop is the outer loop):
>>>>
>>>>   loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr));
>>>> ...
>>>>   FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
>>>>     {
>>>>       int dist = dist_v[loop_depth];
>>>> ...
>>>>       if (dist > 0 && DDR_REVERSED_P (ddr))
>>>>         {
>>>>           /* If DDR_REVERSED_P the order of the data-refs in DDR was
>>>>              reversed (to make distance vector positive), and the actual
>>>>              distance is negative.  */
>>>>           if (dump_enabled_p ())
>>>>             dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>>>                              "dependence distance negative.\n");
>>>>
>>>> where you add
>>>>
>>>> +         /* When doing outer loop vectorization, we need to check if there is
>>>> +            backward dependence at inner loop level if dependence at the outer
>>>> +            loop is reversed.  See PR81740 for more information.  */
>>>> +         if (nested_in_vect_loop_p (loop, DR_STMT (dra))
>>>> +             || nested_in_vect_loop_p (loop, DR_STMT (drb)))
>>>> +           {
>>>> +             unsigned inner_depth = index_in_loop_nest (loop->inner->num,
>>>> +                                                        DDR_LOOP_NEST (ddr));
>>>> +             if (dist_v[inner_depth] < 0)
>>>> +               return true;
>>>> +           }
>>>>
>>>> but I don't understand how the dependence direction with respect to the
>>>> outer loop matters here.
>>> If the direction wrto outer loop is positive by itself, i.e,
>>> reversed_p equals to false, then dist is checked against max_vf.  In
>>> this case, it's not possible to have references refer to the same
>>> object?
>>> On the other hand, dist is not checked at all for reversed case.
>>> Maybe an additional check "dist < max_vf" can relax the patch a bit.
>>>>
>>>> Given there's DDR_REVERSED on the outer loop distance what does that
>>>> mean for the inner loop distance given the quite non-obvious code handling
>>>> this case in tree-data-ref.c:
>>>>
>>>>       /* Verify a basic constraint: classic distance vectors should
>>>>          always be lexicographically positive.
>>>>
>>>>          Data references are collected in the order of execution of
>>>>          the program, thus for the following loop
>>>>
>>>>          | for (i = 1; i < 100; i++)
>>>>          |   for (j = 1; j < 100; j++)
>>>>          |     {
>>>>          |       t = T[j+1][i-1];  // A
>>>>          |       T[j][i] = t + 2;  // B
>>>>          |     }
>>>>
>>>>          references are collected following the direction of the wind:
>>>>          A then B.  The data dependence tests are performed also
>>>>          following this order, such that we're looking at the distance
>>>>          separating the elements accessed by A from the elements later
>>>>          accessed by B.  But in this example, the distance returned by
>>>>          test_dep (A, B) is lexicographically negative (-1, 1), that
>>>>          means that the access A occurs later than B with respect to
>>>>          the outer loop, ie. we're actually looking upwind.  In this
>>>>          case we solve test_dep (B, A) looking downwind to the
>>>>          lexicographically positive solution, that returns the
>>>>          distance vector (1, -1).  */
>>>>       if (!lambda_vector_lexico_pos (dist_v, DDR_NB_LOOPS (ddr)))
>>>>         {
>>>>           lambda_vector save_v = lambda_vector_new (DDR_NB_LOOPS (ddr));
>>>>           if (!subscript_dependence_tester_1 (ddr, 1, 0, loop_nest))
>>>>             return false;
>>>>           compute_subscript_distance (ddr);
>>>>           if (!build_classic_dist_vector_1 (ddr, 1, 0, save_v, &init_b,
>>>>                                             &index_carry))
>>>>             return false;
>>>>           save_dist_v (ddr, save_v);
>>>>           DDR_REVERSED_P (ddr) = true;
>>>>
>>>> that is, what does dist_v[inner_depth] < 0 tell us here?  Does it tell us
>>>> that the dependence direction is reverse of the outer loop?
>>> Hmm, this part of code is a bit confusing for me.  So we always
>>> compute classic distance by sorting two data references in lexico
>>> order, which could be the opposite given we collect references by dom
>>> order.  IIUC, the negative dist at inner loop means a backward
>>> dependence for that index/loop.  It's not related to outer loop or the
>>> reversed_p flag.
>>>
>>> Thanks,
>>> bin
>>>>
>>>> CCing Micha who might remember some more details from unroll-and-jam.
>>>>
>>>> Thanks,
>>>> Richard.
>>>>
>>>>> Thanks,
>>>>> bin
>>>>> 2017-12-15  Bin Cheng  <bin.cheng@arm.com>
>>>>>
>>>>>         PR tree-optimization/81740
>>>>>         * tree-vect-data-refs.c (vect_analyze_data_ref_dependence): In case
>>>>>         of outer loop vectorization, check backward dependence at inner loop
>>>>>         if dependence at outer loop is reversed.
>>>>>
>>>>> gcc/testsuite
>>>>> 2017-12-15  Bin Cheng  <bin.cheng@arm.com>
>>>>>
>>>>>         PR tree-optimization/81740
>>>>>         * gcc.dg/vect/pr81740-1.c: New test.
>>>>>         * gcc.dg/vect/pr81740-2.c: Refine test.
Richard Biener Dec. 18, 2017, 12:37 p.m. UTC | #6
On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
>>>> <richard.guenther@gmail.com> wrote:
>>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>>> Hi,
>>>>>> As explained in the PR, given below test case:
>>>>>> int a[8][10] = { [2][5] = 4 }, c;
>>>>>>
>>>>>> int
>>>>>> main ()
>>>>>> {
>>>>>>   short b;
>>>>>>   int i, d;
>>>>>>   for (b = 4; b >= 0; b--)
>>>>>>     for (c = 0; c <= 6; c++)
>>>>>>       a[c + 1][b + 2] = a[c][b + 1];
>>>>>>   for (i = 0; i < 8; i++)
>>>>>>     for (d = 0; d < 10; d++)
>>>>>>       if (a[i][d] != (i == 3 && d == 6) * 4)
>>>>>>         __builtin_abort ();
>>>>>>   return 0;
>>>>>>
>>>>>> the loop nest is illegal for vectorization without reversing inner loop.  The issue
>>>>>> is in data dependence checking of vectorizer, I believe the mentioned revision just
>>>>>> exposed this.  Previously the vectorization is skipped because of unsupported memory
>>>>>> operation.  The outer loop vectorization unrolls the outer loop into:
>>>>>>
>>>>>>   for (b = 4; b > 0; b -= 4)
>>>>>>   {
>>>>>>     for (c = 0; c <= 6; c++)
>>>>>>       a[c + 1][6] = a[c][5];
>>>>>>     for (c = 0; c <= 6; c++)
>>>>>>       a[c + 1][5] = a[c][4];
>>>>>>     for (c = 0; c <= 6; c++)
>>>>>>       a[c + 1][4] = a[c][3];
>>>>>>     for (c = 0; c <= 6; c++)
>>>>>>       a[c + 1][3] = a[c][2];
>>>>>>   }
>>>>>> Then four inner loops are fused into:
>>>>>>   for (b = 4; b > 0; b -= 4)
>>>>>>   {
>>>>>>     for (c = 0; c <= 6; c++)
>>>>>>     {
>>>>>>       a[c + 1][6] = a[c][5];  // S1
>>>>>>       a[c + 1][5] = a[c][4];  // S2
>>>>>>       a[c + 1][4] = a[c][3];
>>>>>>       a[c + 1][3] = a[c][2];
>>>>>>     }
>>>>>>   }
>>>>>
>>>>> Note that they are not really "fused" but they are interleaved.  With
>>>>> GIMPLE in mind
>>>>> that makes a difference, you should get the equivalent of
>>>>>
>>>>>    for (c = 0; c <= 6; c++)
>>>>>      {
>>>>>        tem1 = a[c][5];
>>>>>        tem2 = a[c][4];
>>>>>        tem3 = a[c][3];
>>>>>        tem4 = a[c][2];
>>>>>        a[c+1][6] = tem1;
>>>>>        a[c +1][5] = tem2;
>>>>>         a[c+1][4] = tem3;
>>>>>         a[c+1][3] = tem4;
>>>>>      }
>>>> Yeah, I will double check if this abstract breaks the patch and how.
>>> Hmm, I think this doesn't break it, well at least for part of the
>>> analysis, because it is loop carried (backward) dependence goes wrong,
>>> interleaving or not with the same iteration doesn't matter here.
>>
>> I think the idea is that forward dependences are always fine (negative distance)
>> to vectorize.  But with backward dependences we have to adhere to max_vf.
>>
>> It looks like for outer loop vectorization we only look at the distances in the
>> outer loop but never at inner ones.  But here the same applies but isn't that
>> independend on the distances with respect to the outer loop?
>>
>> But maybe I'm misunderstanding how "distances" work here.
> Hmm, I am not sure I understand "distance" correctly.  With
> description as in book like "Optimizing compilers for Modern
> Architectures", distance is "# of iteration of sink ref - # of
> iteration of source ref".  Given below example:
>   for (i = 0; i < N; ++i)
>     {
>       x = arr[idx_1];  // S1
>       arr[idx_2] = x;  // S2
>     }
> if S1 is source ref, distance = idx_2 - idx_1, and distance > 0.  Also
> this is forward dependence.  For example, idx_1 is i + 1 and idx_2 is
> i;
> If S2 is source ref, distance = idx_1 - idx_2, and distance < 0.  Also
> this is backward dependence.  For example idx_1 is i and idx_2 is i +
> 1;
>
> In GCC, we always try to subtract idx_2 from idx_1 first in computing
> classic distance, we could result in negative distance in case of
> backward dependence.  When this happens at dependence carried level,
> we manually reverse it.  When this happens at inner level loop, we
> simply keep the negative distance.  And it's meaningless to talk about
> forward/backward given dependence is carried at outer level loop.
>
> Outer loop vectorization is interesting.  The problematic case has
> backward dependence carried by outer loop.  Because we don't check
> dist vs. max_vf for such dep, the unrolled references could have outer
> loop index equal to each other, as in a[c][5] and a[c+1][5].  So it's
> like we have outer loop index resolved as equal.  Now it has
> dependence only if c == c' + 1.  I think previous comment on fusion
> still hold for this and we now need to check backward dependence
> between the two reference at inner level loop.  I guess it's a bit
> trick to model dependence between a[c][5] and a[c+1][5] using the
> original references and dependence.  I think it's okay to do that
> because distance of a[c][5] and a[c+1][5] is what we computed
> previously for the original references at inner level loop.
>
> Not sure if I have missed something important.

Not sure either.  unroll-and-jam does

      FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
        {
          /* A distance (a,b) is at worst transformed into (a/N,b) by the
             unrolling (factor N), so the transformation is valid if
             a >= N, or b > 0, or b is zero and a > 0.  Otherwise the unroll
             factor needs to be limited so that the first condition holds.
             That may limit the factor down to zero in the worst case.  */
          int dist = dist_v[0];
          if (dist < 0)
            gcc_unreachable ();
          else if ((unsigned)dist >= *unroll)
            ;
          else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
                   || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
                       && dist > 0))
            ;
          else
            *unroll = dist;

where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost loop.

The vectorizer does way more complicated things and only looks at the
distance with respect to the outer loop as far as I can see which can
be negative.

Not sure if fusion and vectorizer "interleaving" makes a difference here.
I think the idea was that when interleaving stmt-by-stmt then forward
dependences would be preserved and thus we don't need to check the
inner loop dependences.  speaking with "forward vs. backward" dependences
again, not distances...

This also means that unroll-and-jam could be enhanced to "interleave"
stmts and thus cover more cases?

Again, I hope Micha can have a look here...

Richard.

> Thanks,
> bin
>>
>> Richard.
>>
>>> Thanks,
>>> bin
>>>>
>>>>>
>>>>>> The loop fusion needs to meet the dependence requirement.  Basically, GCC's data
>>>>>> dependence analyzer does not model dep between references in sibling loops, but
>>>>>> in practice, fusion requirement can be checked by analyzing all data references
>>>>>> after fusion, and there is no backward data dependence.
>>>>>>
>>>>>> Apparently, the requirement is violated because we have backward data dependence
>>>>>> between references (a[c][5], a[c+1][5]) in S1/S2.  Note, if we reverse the inner
>>>>>> loop, the outer loop would become legal for vectorization.
>>>>>>
>>>>>> This patch fixes the issue by enforcing dependence check.  It also adds two tests
>>>>>> with one shouldn't be vectorized and the other should.  Bootstrap and test on x86_64
>>>>>> and AArch64.  Is it OK?
>>>>>
>>>>> I think you have identified the spot where things go wrong but I'm not
>>>>> sure you fix the
>>>>> problem fully.  The spot you pacth is (loop is the outer loop):
>>>>>
>>>>>   loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr));
>>>>> ...
>>>>>   FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
>>>>>     {
>>>>>       int dist = dist_v[loop_depth];
>>>>> ...
>>>>>       if (dist > 0 && DDR_REVERSED_P (ddr))
>>>>>         {
>>>>>           /* If DDR_REVERSED_P the order of the data-refs in DDR was
>>>>>              reversed (to make distance vector positive), and the actual
>>>>>              distance is negative.  */
>>>>>           if (dump_enabled_p ())
>>>>>             dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>>>>                              "dependence distance negative.\n");
>>>>>
>>>>> where you add
>>>>>
>>>>> +         /* When doing outer loop vectorization, we need to check if there is
>>>>> +            backward dependence at inner loop level if dependence at the outer
>>>>> +            loop is reversed.  See PR81740 for more information.  */
>>>>> +         if (nested_in_vect_loop_p (loop, DR_STMT (dra))
>>>>> +             || nested_in_vect_loop_p (loop, DR_STMT (drb)))
>>>>> +           {
>>>>> +             unsigned inner_depth = index_in_loop_nest (loop->inner->num,
>>>>> +                                                        DDR_LOOP_NEST (ddr));
>>>>> +             if (dist_v[inner_depth] < 0)
>>>>> +               return true;
>>>>> +           }
>>>>>
>>>>> but I don't understand how the dependence direction with respect to the
>>>>> outer loop matters here.
>>>> If the direction wrto outer loop is positive by itself, i.e,
>>>> reversed_p equals to false, then dist is checked against max_vf.  In
>>>> this case, it's not possible to have references refer to the same
>>>> object?
>>>> On the other hand, dist is not checked at all for reversed case.
>>>> Maybe an additional check "dist < max_vf" can relax the patch a bit.
>>>>>
>>>>> Given there's DDR_REVERSED on the outer loop distance what does that
>>>>> mean for the inner loop distance given the quite non-obvious code handling
>>>>> this case in tree-data-ref.c:
>>>>>
>>>>>       /* Verify a basic constraint: classic distance vectors should
>>>>>          always be lexicographically positive.
>>>>>
>>>>>          Data references are collected in the order of execution of
>>>>>          the program, thus for the following loop
>>>>>
>>>>>          | for (i = 1; i < 100; i++)
>>>>>          |   for (j = 1; j < 100; j++)
>>>>>          |     {
>>>>>          |       t = T[j+1][i-1];  // A
>>>>>          |       T[j][i] = t + 2;  // B
>>>>>          |     }
>>>>>
>>>>>          references are collected following the direction of the wind:
>>>>>          A then B.  The data dependence tests are performed also
>>>>>          following this order, such that we're looking at the distance
>>>>>          separating the elements accessed by A from the elements later
>>>>>          accessed by B.  But in this example, the distance returned by
>>>>>          test_dep (A, B) is lexicographically negative (-1, 1), that
>>>>>          means that the access A occurs later than B with respect to
>>>>>          the outer loop, ie. we're actually looking upwind.  In this
>>>>>          case we solve test_dep (B, A) looking downwind to the
>>>>>          lexicographically positive solution, that returns the
>>>>>          distance vector (1, -1).  */
>>>>>       if (!lambda_vector_lexico_pos (dist_v, DDR_NB_LOOPS (ddr)))
>>>>>         {
>>>>>           lambda_vector save_v = lambda_vector_new (DDR_NB_LOOPS (ddr));
>>>>>           if (!subscript_dependence_tester_1 (ddr, 1, 0, loop_nest))
>>>>>             return false;
>>>>>           compute_subscript_distance (ddr);
>>>>>           if (!build_classic_dist_vector_1 (ddr, 1, 0, save_v, &init_b,
>>>>>                                             &index_carry))
>>>>>             return false;
>>>>>           save_dist_v (ddr, save_v);
>>>>>           DDR_REVERSED_P (ddr) = true;
>>>>>
>>>>> that is, what does dist_v[inner_depth] < 0 tell us here?  Does it tell us
>>>>> that the dependence direction is reverse of the outer loop?
>>>> Hmm, this part of code is a bit confusing for me.  So we always
>>>> compute classic distance by sorting two data references in lexico
>>>> order, which could be the opposite given we collect references by dom
>>>> order.  IIUC, the negative dist at inner loop means a backward
>>>> dependence for that index/loop.  It's not related to outer loop or the
>>>> reversed_p flag.
>>>>
>>>> Thanks,
>>>> bin
>>>>>
>>>>> CCing Micha who might remember some more details from unroll-and-jam.
>>>>>
>>>>> Thanks,
>>>>> Richard.
>>>>>
>>>>>> Thanks,
>>>>>> bin
>>>>>> 2017-12-15  Bin Cheng  <bin.cheng@arm.com>
>>>>>>
>>>>>>         PR tree-optimization/81740
>>>>>>         * tree-vect-data-refs.c (vect_analyze_data_ref_dependence): In case
>>>>>>         of outer loop vectorization, check backward dependence at inner loop
>>>>>>         if dependence at outer loop is reversed.
>>>>>>
>>>>>> gcc/testsuite
>>>>>> 2017-12-15  Bin Cheng  <bin.cheng@arm.com>
>>>>>>
>>>>>>         PR tree-optimization/81740
>>>>>>         * gcc.dg/vect/pr81740-1.c: New test.
>>>>>>         * gcc.dg/vect/pr81740-2.c: Refine test.
Richard Biener Dec. 18, 2017, 1:32 p.m. UTC | #7
On Mon, Dec 18, 2017 at 1:37 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
>>>>> <richard.guenther@gmail.com> wrote:
>>>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>>>> Hi,
>>>>>>> As explained in the PR, given below test case:
>>>>>>> int a[8][10] = { [2][5] = 4 }, c;
>>>>>>>
>>>>>>> int
>>>>>>> main ()
>>>>>>> {
>>>>>>>   short b;
>>>>>>>   int i, d;
>>>>>>>   for (b = 4; b >= 0; b--)
>>>>>>>     for (c = 0; c <= 6; c++)
>>>>>>>       a[c + 1][b + 2] = a[c][b + 1];
>>>>>>>   for (i = 0; i < 8; i++)
>>>>>>>     for (d = 0; d < 10; d++)
>>>>>>>       if (a[i][d] != (i == 3 && d == 6) * 4)
>>>>>>>         __builtin_abort ();
>>>>>>>   return 0;
>>>>>>>
>>>>>>> the loop nest is illegal for vectorization without reversing inner loop.  The issue
>>>>>>> is in data dependence checking of vectorizer, I believe the mentioned revision just
>>>>>>> exposed this.  Previously the vectorization is skipped because of unsupported memory
>>>>>>> operation.  The outer loop vectorization unrolls the outer loop into:
>>>>>>>
>>>>>>>   for (b = 4; b > 0; b -= 4)
>>>>>>>   {
>>>>>>>     for (c = 0; c <= 6; c++)
>>>>>>>       a[c + 1][6] = a[c][5];
>>>>>>>     for (c = 0; c <= 6; c++)
>>>>>>>       a[c + 1][5] = a[c][4];
>>>>>>>     for (c = 0; c <= 6; c++)
>>>>>>>       a[c + 1][4] = a[c][3];
>>>>>>>     for (c = 0; c <= 6; c++)
>>>>>>>       a[c + 1][3] = a[c][2];
>>>>>>>   }
>>>>>>> Then four inner loops are fused into:
>>>>>>>   for (b = 4; b > 0; b -= 4)
>>>>>>>   {
>>>>>>>     for (c = 0; c <= 6; c++)
>>>>>>>     {
>>>>>>>       a[c + 1][6] = a[c][5];  // S1
>>>>>>>       a[c + 1][5] = a[c][4];  // S2
>>>>>>>       a[c + 1][4] = a[c][3];
>>>>>>>       a[c + 1][3] = a[c][2];
>>>>>>>     }
>>>>>>>   }
>>>>>>
>>>>>> Note that they are not really "fused" but they are interleaved.  With
>>>>>> GIMPLE in mind
>>>>>> that makes a difference, you should get the equivalent of
>>>>>>
>>>>>>    for (c = 0; c <= 6; c++)
>>>>>>      {
>>>>>>        tem1 = a[c][5];
>>>>>>        tem2 = a[c][4];
>>>>>>        tem3 = a[c][3];
>>>>>>        tem4 = a[c][2];
>>>>>>        a[c+1][6] = tem1;
>>>>>>        a[c +1][5] = tem2;
>>>>>>         a[c+1][4] = tem3;
>>>>>>         a[c+1][3] = tem4;
>>>>>>      }
>>>>> Yeah, I will double check if this abstract breaks the patch and how.
>>>> Hmm, I think this doesn't break it, well at least for part of the
>>>> analysis, because it is loop carried (backward) dependence goes wrong,
>>>> interleaving or not with the same iteration doesn't matter here.
>>>
>>> I think the idea is that forward dependences are always fine (negative distance)
>>> to vectorize.  But with backward dependences we have to adhere to max_vf.
>>>
>>> It looks like for outer loop vectorization we only look at the distances in the
>>> outer loop but never at inner ones.  But here the same applies but isn't that
>>> independend on the distances with respect to the outer loop?
>>>
>>> But maybe I'm misunderstanding how "distances" work here.
>> Hmm, I am not sure I understand "distance" correctly.  With
>> description as in book like "Optimizing compilers for Modern
>> Architectures", distance is "# of iteration of sink ref - # of
>> iteration of source ref".  Given below example:
>>   for (i = 0; i < N; ++i)
>>     {
>>       x = arr[idx_1];  // S1
>>       arr[idx_2] = x;  // S2
>>     }
>> if S1 is source ref, distance = idx_2 - idx_1, and distance > 0.  Also
>> this is forward dependence.  For example, idx_1 is i + 1 and idx_2 is
>> i;
>> If S2 is source ref, distance = idx_1 - idx_2, and distance < 0.  Also
>> this is backward dependence.  For example idx_1 is i and idx_2 is i +
>> 1;
>>
>> In GCC, we always try to subtract idx_2 from idx_1 first in computing
>> classic distance, we could result in negative distance in case of
>> backward dependence.  When this happens at dependence carried level,
>> we manually reverse it.  When this happens at inner level loop, we
>> simply keep the negative distance.  And it's meaningless to talk about
>> forward/backward given dependence is carried at outer level loop.
>>
>> Outer loop vectorization is interesting.  The problematic case has
>> backward dependence carried by outer loop.  Because we don't check
>> dist vs. max_vf for such dep, the unrolled references could have outer
>> loop index equal to each other, as in a[c][5] and a[c+1][5].  So it's
>> like we have outer loop index resolved as equal.  Now it has
>> dependence only if c == c' + 1.  I think previous comment on fusion
>> still hold for this and we now need to check backward dependence
>> between the two reference at inner level loop.  I guess it's a bit
>> trick to model dependence between a[c][5] and a[c+1][5] using the
>> original references and dependence.  I think it's okay to do that
>> because distance of a[c][5] and a[c+1][5] is what we computed
>> previously for the original references at inner level loop.
>>
>> Not sure if I have missed something important.
>
> Not sure either.  unroll-and-jam does
>
>       FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
>         {
>           /* A distance (a,b) is at worst transformed into (a/N,b) by the
>              unrolling (factor N), so the transformation is valid if
>              a >= N, or b > 0, or b is zero and a > 0.  Otherwise the unroll
>              factor needs to be limited so that the first condition holds.
>              That may limit the factor down to zero in the worst case.  */
>           int dist = dist_v[0];
>           if (dist < 0)
>             gcc_unreachable ();
>           else if ((unsigned)dist >= *unroll)
>             ;
>           else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
>                    || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
>                        && dist > 0))
>             ;
>           else
>             *unroll = dist;
>
> where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost loop.
>
> The vectorizer does way more complicated things and only looks at the
> distance with respect to the outer loop as far as I can see which can
> be negative.
>
> Not sure if fusion and vectorizer "interleaving" makes a difference here.
> I think the idea was that when interleaving stmt-by-stmt then forward
> dependences would be preserved and thus we don't need to check the
> inner loop dependences.  speaking with "forward vs. backward" dependences
> again, not distances...
>
> This also means that unroll-and-jam could be enhanced to "interleave"
> stmts and thus cover more cases?
>
> Again, I hope Micha can have a look here...

And _maybe_ PR60276 and its fix (adding STMT_VINFO_MIN_NEG_DIST)
is related.

Richard.

> Richard.
>
>> Thanks,
>> bin
>>>
>>> Richard.
>>>
>>>> Thanks,
>>>> bin
>>>>>
>>>>>>
>>>>>>> The loop fusion needs to meet the dependence requirement.  Basically, GCC's data
>>>>>>> dependence analyzer does not model dep between references in sibling loops, but
>>>>>>> in practice, fusion requirement can be checked by analyzing all data references
>>>>>>> after fusion, and there is no backward data dependence.
>>>>>>>
>>>>>>> Apparently, the requirement is violated because we have backward data dependence
>>>>>>> between references (a[c][5], a[c+1][5]) in S1/S2.  Note, if we reverse the inner
>>>>>>> loop, the outer loop would become legal for vectorization.
>>>>>>>
>>>>>>> This patch fixes the issue by enforcing dependence check.  It also adds two tests
>>>>>>> with one shouldn't be vectorized and the other should.  Bootstrap and test on x86_64
>>>>>>> and AArch64.  Is it OK?
>>>>>>
>>>>>> I think you have identified the spot where things go wrong but I'm not
>>>>>> sure you fix the
>>>>>> problem fully.  The spot you pacth is (loop is the outer loop):
>>>>>>
>>>>>>   loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr));
>>>>>> ...
>>>>>>   FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
>>>>>>     {
>>>>>>       int dist = dist_v[loop_depth];
>>>>>> ...
>>>>>>       if (dist > 0 && DDR_REVERSED_P (ddr))
>>>>>>         {
>>>>>>           /* If DDR_REVERSED_P the order of the data-refs in DDR was
>>>>>>              reversed (to make distance vector positive), and the actual
>>>>>>              distance is negative.  */
>>>>>>           if (dump_enabled_p ())
>>>>>>             dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>>>>>                              "dependence distance negative.\n");
>>>>>>
>>>>>> where you add
>>>>>>
>>>>>> +         /* When doing outer loop vectorization, we need to check if there is
>>>>>> +            backward dependence at inner loop level if dependence at the outer
>>>>>> +            loop is reversed.  See PR81740 for more information.  */
>>>>>> +         if (nested_in_vect_loop_p (loop, DR_STMT (dra))
>>>>>> +             || nested_in_vect_loop_p (loop, DR_STMT (drb)))
>>>>>> +           {
>>>>>> +             unsigned inner_depth = index_in_loop_nest (loop->inner->num,
>>>>>> +                                                        DDR_LOOP_NEST (ddr));
>>>>>> +             if (dist_v[inner_depth] < 0)
>>>>>> +               return true;
>>>>>> +           }
>>>>>>
>>>>>> but I don't understand how the dependence direction with respect to the
>>>>>> outer loop matters here.
>>>>> If the direction wrto outer loop is positive by itself, i.e,
>>>>> reversed_p equals to false, then dist is checked against max_vf.  In
>>>>> this case, it's not possible to have references refer to the same
>>>>> object?
>>>>> On the other hand, dist is not checked at all for reversed case.
>>>>> Maybe an additional check "dist < max_vf" can relax the patch a bit.
>>>>>>
>>>>>> Given there's DDR_REVERSED on the outer loop distance what does that
>>>>>> mean for the inner loop distance given the quite non-obvious code handling
>>>>>> this case in tree-data-ref.c:
>>>>>>
>>>>>>       /* Verify a basic constraint: classic distance vectors should
>>>>>>          always be lexicographically positive.
>>>>>>
>>>>>>          Data references are collected in the order of execution of
>>>>>>          the program, thus for the following loop
>>>>>>
>>>>>>          | for (i = 1; i < 100; i++)
>>>>>>          |   for (j = 1; j < 100; j++)
>>>>>>          |     {
>>>>>>          |       t = T[j+1][i-1];  // A
>>>>>>          |       T[j][i] = t + 2;  // B
>>>>>>          |     }
>>>>>>
>>>>>>          references are collected following the direction of the wind:
>>>>>>          A then B.  The data dependence tests are performed also
>>>>>>          following this order, such that we're looking at the distance
>>>>>>          separating the elements accessed by A from the elements later
>>>>>>          accessed by B.  But in this example, the distance returned by
>>>>>>          test_dep (A, B) is lexicographically negative (-1, 1), that
>>>>>>          means that the access A occurs later than B with respect to
>>>>>>          the outer loop, ie. we're actually looking upwind.  In this
>>>>>>          case we solve test_dep (B, A) looking downwind to the
>>>>>>          lexicographically positive solution, that returns the
>>>>>>          distance vector (1, -1).  */
>>>>>>       if (!lambda_vector_lexico_pos (dist_v, DDR_NB_LOOPS (ddr)))
>>>>>>         {
>>>>>>           lambda_vector save_v = lambda_vector_new (DDR_NB_LOOPS (ddr));
>>>>>>           if (!subscript_dependence_tester_1 (ddr, 1, 0, loop_nest))
>>>>>>             return false;
>>>>>>           compute_subscript_distance (ddr);
>>>>>>           if (!build_classic_dist_vector_1 (ddr, 1, 0, save_v, &init_b,
>>>>>>                                             &index_carry))
>>>>>>             return false;
>>>>>>           save_dist_v (ddr, save_v);
>>>>>>           DDR_REVERSED_P (ddr) = true;
>>>>>>
>>>>>> that is, what does dist_v[inner_depth] < 0 tell us here?  Does it tell us
>>>>>> that the dependence direction is reverse of the outer loop?
>>>>> Hmm, this part of code is a bit confusing for me.  So we always
>>>>> compute classic distance by sorting two data references in lexico
>>>>> order, which could be the opposite given we collect references by dom
>>>>> order.  IIUC, the negative dist at inner loop means a backward
>>>>> dependence for that index/loop.  It's not related to outer loop or the
>>>>> reversed_p flag.
>>>>>
>>>>> Thanks,
>>>>> bin
>>>>>>
>>>>>> CCing Micha who might remember some more details from unroll-and-jam.
>>>>>>
>>>>>> Thanks,
>>>>>> Richard.
>>>>>>
>>>>>>> Thanks,
>>>>>>> bin
>>>>>>> 2017-12-15  Bin Cheng  <bin.cheng@arm.com>
>>>>>>>
>>>>>>>         PR tree-optimization/81740
>>>>>>>         * tree-vect-data-refs.c (vect_analyze_data_ref_dependence): In case
>>>>>>>         of outer loop vectorization, check backward dependence at inner loop
>>>>>>>         if dependence at outer loop is reversed.
>>>>>>>
>>>>>>> gcc/testsuite
>>>>>>> 2017-12-15  Bin Cheng  <bin.cheng@arm.com>
>>>>>>>
>>>>>>>         PR tree-optimization/81740
>>>>>>>         * gcc.dg/vect/pr81740-1.c: New test.
>>>>>>>         * gcc.dg/vect/pr81740-2.c: Refine test.
Michael Matz Dec. 18, 2017, 2:35 p.m. UTC | #8
Hi,

On Mon, 18 Dec 2017, Richard Biener wrote:

> where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost loop.

[0] is always outermost loop.

> The vectorizer does way more complicated things and only looks at the 
> distance with respect to the outer loop as far as I can see which can be 
> negative.
> 
> Not sure if fusion and vectorizer "interleaving" makes a difference 
> here. I think the idea was that when interleaving stmt-by-stmt then 
> forward dependences would be preserved and thus we don't need to check 
> the inner loop dependences.  speaking with "forward vs. backward" 
> dependences again, not distances...
> 
> This also means that unroll-and-jam could be enhanced to "interleave" 
> stmts and thus cover more cases?
> 
> Again, I hope Micha can have a look here...

Haven't yet looked at the patch, but some comments anyway:

fusion and interleaving interact in the following way in outer loop 
vectorization, conceptually:
* (1) the outer loop is unrolled
* (2) the inner loops are fused
* (3) the (now single) inner body is rescheduled/shuffled/interleaved.

(1) is always okay.  But (2) and (3) as individual transformations must 
both be checked for validity.  If fusion isn't possible the whole 
transformation is invalid, and if interleaving isn't possible the same is 
true.  In the specific example:

  for (b = 4; b >= 0; b--)
    for (c = 0; c <= 6; c++)
      t = a[c][b + 1];      // S1
      a[c + 1][b + 2] = t;  // S2

it's already the fusion step that's invalid.  There's a 
dependence between S1 and S2, e.g. for (b,c) = (4,1) comes-before (3,0) 
with S1(4,1) reading a[1][5] and S2(3,0) writing a[1][5].  So a 
write-after-read.  After fusing:

   for (c = 0; c <= 6; c++)
     {
       t = a[c][5];              // S1
       a[c + 1][6] = t;
       t = a[c][4];
       a[c + 1][5] = t;          // S2
       a[c + 1][4] = a[c][3];
       a[c + 1][3] = a[c][2];
     }

here we have at iterations (c) = (0) comes-before (1), at S2(0) writing 
a[1][5] and S1(1) writing a[1][5].  I.e. now it's a read-after-write (the 
write in iteration 0 overwrites the value that is going to be read at 
iteration 1, which wasn't the case in the original loop).  The dependence 
switched direction --> invalid.

The simple interleaving of statements can't rectify this.  
Interleaving is an inner-body reordering but the brokenness comes from a 
cross-iteration ordering.

This example can be unroll-jammed or outer-loop vectorized if one of the 
two loops is reversed.  Let's say we reverse the inner loop, so that it 
runs in the same direction as the outer loop (reversal is possible here).

It'd then be something like:

   for (c = 6; c >= 0; c--)
     {
       t = a[c][5];              // S1
       a[c + 1][6] = t;
       t = a[c][4];
       a[c + 1][5] = t;          // S2
       a[c + 1][4] = a[c][3];
       a[c + 1][3] = a[c][2];
     }

The dependence between S1/S2 would still be a write-after-read, and all 
would be well.  This reversal of the inner loop can partly be simulated by 
not only interleaving the inner insns, but by also _reodering_ them.  But 
AFAIK the vectorizer doesn't do this?


Ciao,
Michael.
Bin.Cheng Dec. 19, 2017, 11:58 a.m. UTC | #9
On Mon, Dec 18, 2017 at 2:35 PM, Michael Matz <matz@suse.de> wrote:
> Hi,
>
> On Mon, 18 Dec 2017, Richard Biener wrote:
>
>> where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost loop.
>
> [0] is always outermost loop.
>
>> The vectorizer does way more complicated things and only looks at the
>> distance with respect to the outer loop as far as I can see which can be
>> negative.
>>
>> Not sure if fusion and vectorizer "interleaving" makes a difference
>> here. I think the idea was that when interleaving stmt-by-stmt then
>> forward dependences would be preserved and thus we don't need to check
>> the inner loop dependences.  speaking with "forward vs. backward"
>> dependences again, not distances...
>>
>> This also means that unroll-and-jam could be enhanced to "interleave"
>> stmts and thus cover more cases?
>>
>> Again, I hope Micha can have a look here...
>
> Haven't yet looked at the patch, but some comments anyway:
>
> fusion and interleaving interact in the following way in outer loop
> vectorization, conceptually:
> * (1) the outer loop is unrolled
> * (2) the inner loops are fused
> * (3) the (now single) inner body is rescheduled/shuffled/interleaved.
>
Thanks Michael for explaining issue clearer, this is what I meant.  As
for PR60276, I think it's actually the other side of the problem,
which only relates to dependence validity of interleaving.

Thanks,
bin
> (1) is always okay.  But (2) and (3) as individual transformations must
> both be checked for validity.  If fusion isn't possible the whole
> transformation is invalid, and if interleaving isn't possible the same is
> true.  In the specific example:
>
>   for (b = 4; b >= 0; b--)
>     for (c = 0; c <= 6; c++)
>       t = a[c][b + 1];      // S1
>       a[c + 1][b + 2] = t;  // S2
>
> it's already the fusion step that's invalid.  There's a
> dependence between S1 and S2, e.g. for (b,c) = (4,1) comes-before (3,0)
> with S1(4,1) reading a[1][5] and S2(3,0) writing a[1][5].  So a
> write-after-read.  After fusing:
>
>    for (c = 0; c <= 6; c++)
>      {
>        t = a[c][5];              // S1
>        a[c + 1][6] = t;
>        t = a[c][4];
>        a[c + 1][5] = t;          // S2
>        a[c + 1][4] = a[c][3];
>        a[c + 1][3] = a[c][2];
>      }
>
> here we have at iterations (c) = (0) comes-before (1), at S2(0) writing
> a[1][5] and S1(1) writing a[1][5].  I.e. now it's a read-after-write (the
> write in iteration 0 overwrites the value that is going to be read at
> iteration 1, which wasn't the case in the original loop).  The dependence
> switched direction --> invalid.
>
> The simple interleaving of statements can't rectify this.
> Interleaving is an inner-body reordering but the brokenness comes from a
> cross-iteration ordering.
>
> This example can be unroll-jammed or outer-loop vectorized if one of the
> two loops is reversed.  Let's say we reverse the inner loop, so that it
> runs in the same direction as the outer loop (reversal is possible here).
>
> It'd then be something like:
>
>    for (c = 6; c >= 0; c--)
>      {
>        t = a[c][5];              // S1
>        a[c + 1][6] = t;
>        t = a[c][4];
>        a[c + 1][5] = t;          // S2
>        a[c + 1][4] = a[c][3];
>        a[c + 1][3] = a[c][2];
>      }
>
> The dependence between S1/S2 would still be a write-after-read, and all
> would be well.  This reversal of the inner loop can partly be simulated by
> not only interleaving the inner insns, but by also _reodering_ them.  But
> AFAIK the vectorizer doesn't do this?
>
>
> Ciao,
> Michael.
Richard Biener Dec. 19, 2017, 12:56 p.m. UTC | #10
On Tue, Dec 19, 2017 at 12:58 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Mon, Dec 18, 2017 at 2:35 PM, Michael Matz <matz@suse.de> wrote:
>> Hi,
>>
>> On Mon, 18 Dec 2017, Richard Biener wrote:
>>
>>> where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost loop.
>>
>> [0] is always outermost loop.
>>
>>> The vectorizer does way more complicated things and only looks at the
>>> distance with respect to the outer loop as far as I can see which can be
>>> negative.
>>>
>>> Not sure if fusion and vectorizer "interleaving" makes a difference
>>> here. I think the idea was that when interleaving stmt-by-stmt then
>>> forward dependences would be preserved and thus we don't need to check
>>> the inner loop dependences.  speaking with "forward vs. backward"
>>> dependences again, not distances...
>>>
>>> This also means that unroll-and-jam could be enhanced to "interleave"
>>> stmts and thus cover more cases?
>>>
>>> Again, I hope Micha can have a look here...
>>
>> Haven't yet looked at the patch, but some comments anyway:
>>
>> fusion and interleaving interact in the following way in outer loop
>> vectorization, conceptually:
>> * (1) the outer loop is unrolled
>> * (2) the inner loops are fused
>> * (3) the (now single) inner body is rescheduled/shuffled/interleaved.
>>
> Thanks Michael for explaining issue clearer, this is what I meant.  As
> for PR60276, I think it's actually the other side of the problem,
> which only relates to dependence validity of interleaving.

Interleaving validity is what is checked by the current code, I don't
see any checking for validity of (2).  Now, the current code only
looks at the outer loop distances to verify interleaving validity.

I think we need to verify whether fusion is valid, preferably clearly
separated from the current code checking interleaving validity.

I'm not 100% convinced the interleaving validity check is correct for
the outer loop vectorization case.

I think it helps to reduce the dependence checking code to what we do
for unroll-and-jam:

Index: gcc/tree-vect-data-refs.c
===================================================================
--- gcc/tree-vect-data-refs.c   (revision 255777)
+++ gcc/tree-vect-data-refs.c   (working copy)
@@ -378,7 +378,26 @@ vect_analyze_data_ref_dependence (struct
        dump_printf_loc (MSG_NOTE, vect_location,
                          "dependence distance  = %d.\n", dist);

-      if (dist == 0)
+      if (dist < 0)
+       gcc_unreachable ();
+
+      else if (dist >= *max_vf)
+       {
+         /* Dependence distance does not create dependence, as far as
+            vectorization is concerned, in this case.  */
+         if (dump_enabled_p ())
+           dump_printf_loc (MSG_NOTE, vect_location,
+                            "dependence distance >= VF.\n");
+         continue;
+       }
+
+      else if (DDR_NB_LOOPS (ddr) == 2
+              && (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
+                  || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
+                      && dist > 0)))
+       continue;
+
+      else if (dist == 0)
        {
          if (dump_enabled_p ())
            {
@@ -427,26 +446,7 @@ vect_analyze_data_ref_dependence (struct
          continue;
        }

-      if (dist > 0 && DDR_REVERSED_P (ddr))
-       {
-         /* If DDR_REVERSED_P the order of the data-refs in DDR was
-            reversed (to make distance vector positive), and the actual
-            distance is negative.  */
-         if (dump_enabled_p ())
-           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                            "dependence distance negative.\n");
-         /* Record a negative dependence distance to later limit the
-            amount of stmt copying / unrolling we can perform.
-            Only need to handle read-after-write dependence.  */
-         if (DR_IS_READ (drb)
-             && (STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) == 0
-                 || STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) > (unsigned)dist))
-           STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) = dist;
-         continue;
-       }
-
-      if (abs (dist) >= 2
-         && abs (dist) < *max_vf)
+      if (dist >= 2)
        {
          /* The dependence distance requires reduction of the maximal
             vectorization factor.  */
@@ -455,15 +455,6 @@ vect_analyze_data_ref_dependence (struct
            dump_printf_loc (MSG_NOTE, vect_location,
                             "adjusting maximal vectorization factor to %i\n",
                             *max_vf);
-       }
-
-      if (abs (dist) >= *max_vf)
-       {
-         /* Dependence distance does not create dependence, as far as
-            vectorization is concerned, in this case.  */
-         if (dump_enabled_p ())
-           dump_printf_loc (MSG_NOTE, vect_location,
-                            "dependence distance >= VF.\n");
          continue;
        }


that fixes the testcase (probably by removing the DDR_REVERSED
special-casing) exposes in C vect.exp:

FAIL: gcc.dg/vect/pr36630.c scan-tree-dump-times vect "vectorized 1
loops" 1 (found 0 times)
FAIL: gcc.dg/vect/pr57558-2.c scan-tree-dump vect "vectorized 1 loops"
FAIL: gcc.dg/vect/vect-outer-5.c execution test
FAIL: gcc.dg/vect/vect-outer-5.c scan-tree-dump-times vect "OUTER LOOP
VECTORIZED" 1 (found 2 times)

For the interleaving correctness a negative distance is fine since
we're only making it possibly
more negative by executing stmts from the n + 1 iteration before stmts
from the n'th iteration, right?

But of course if the outer loop has negative distance that says
nothing about the inner loop.

So sth like

      else if (DDR_REVERSED_P (ddr)
               && (! nested_in_vect_loop_p (loop, DR_STMT (dra))
                   || ! nested_in_vect_loop_p (loop, DR_STMT (drb))
                   || dist_v[1] >= 0
                   || abs (dist_v[1]) >= *max_vf))
        {
          /* If DDR_REVERSED_P the order of the data-refs in DDR was
             reversed (to make distance vector positive), and the actual
             distance is negative.  */
          if (dump_enabled_p ())
...

but then I wonder if PR60276 can also happen for inner loop refs.  Also
what about dependences between stmts in inner and stmts in the outer
loop?

With the above gcc.dg/vect/vect-outer-5.c still FAILs...

  /* Outer-loop 2: Not vectorizable because of dependence distance. */
  for (i = 0; i < 4; i++)
    {
      s = 0;
      for (j=0; j<N; j+=4)
        s += C[j];
      B[i+1] = B[i] + s;
    }

that's because

      else if (DDR_NB_LOOPS (ddr) == 2
               && (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
                   || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
                       && dist > 0)))
        continue;

hits.  We simply don't have a perfect nest to begin with...

Richard.

> Thanks,
> bin
>> (1) is always okay.  But (2) and (3) as individual transformations must
>> both be checked for validity.  If fusion isn't possible the whole
>> transformation is invalid, and if interleaving isn't possible the same is
>> true.  In the specific example:
>>
>>   for (b = 4; b >= 0; b--)
>>     for (c = 0; c <= 6; c++)
>>       t = a[c][b + 1];      // S1
>>       a[c + 1][b + 2] = t;  // S2
>>
>> it's already the fusion step that's invalid.  There's a
>> dependence between S1 and S2, e.g. for (b,c) = (4,1) comes-before (3,0)
>> with S1(4,1) reading a[1][5] and S2(3,0) writing a[1][5].  So a
>> write-after-read.  After fusing:
>>
>>    for (c = 0; c <= 6; c++)
>>      {
>>        t = a[c][5];              // S1
>>        a[c + 1][6] = t;
>>        t = a[c][4];
>>        a[c + 1][5] = t;          // S2
>>        a[c + 1][4] = a[c][3];
>>        a[c + 1][3] = a[c][2];
>>      }
>>
>> here we have at iterations (c) = (0) comes-before (1), at S2(0) writing
>> a[1][5] and S1(1) writing a[1][5].  I.e. now it's a read-after-write (the
>> write in iteration 0 overwrites the value that is going to be read at
>> iteration 1, which wasn't the case in the original loop).  The dependence
>> switched direction --> invalid.
>>
>> The simple interleaving of statements can't rectify this.
>> Interleaving is an inner-body reordering but the brokenness comes from a
>> cross-iteration ordering.
>>
>> This example can be unroll-jammed or outer-loop vectorized if one of the
>> two loops is reversed.  Let's say we reverse the inner loop, so that it
>> runs in the same direction as the outer loop (reversal is possible here).
>>
>> It'd then be something like:
>>
>>    for (c = 6; c >= 0; c--)
>>      {
>>        t = a[c][5];              // S1
>>        a[c + 1][6] = t;
>>        t = a[c][4];
>>        a[c + 1][5] = t;          // S2
>>        a[c + 1][4] = a[c][3];
>>        a[c + 1][3] = a[c][2];
>>      }
>>
>> The dependence between S1/S2 would still be a write-after-read, and all
>> would be well.  This reversal of the inner loop can partly be simulated by
>> not only interleaving the inner insns, but by also _reodering_ them.  But
>> AFAIK the vectorizer doesn't do this?
>>
>>
>> Ciao,
>> Michael.
Bin.Cheng Dec. 20, 2017, 2:56 p.m. UTC | #11
On Tue, Dec 19, 2017 at 12:56 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Tue, Dec 19, 2017 at 12:58 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Mon, Dec 18, 2017 at 2:35 PM, Michael Matz <matz@suse.de> wrote:
>>> Hi,
>>>
>>> On Mon, 18 Dec 2017, Richard Biener wrote:
>>>
>>>> where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost loop.
>>>
>>> [0] is always outermost loop.
>>>
>>>> The vectorizer does way more complicated things and only looks at the
>>>> distance with respect to the outer loop as far as I can see which can be
>>>> negative.
>>>>
>>>> Not sure if fusion and vectorizer "interleaving" makes a difference
>>>> here. I think the idea was that when interleaving stmt-by-stmt then
>>>> forward dependences would be preserved and thus we don't need to check
>>>> the inner loop dependences.  speaking with "forward vs. backward"
>>>> dependences again, not distances...
>>>>
>>>> This also means that unroll-and-jam could be enhanced to "interleave"
>>>> stmts and thus cover more cases?
>>>>
>>>> Again, I hope Micha can have a look here...
>>>
>>> Haven't yet looked at the patch, but some comments anyway:
>>>
>>> fusion and interleaving interact in the following way in outer loop
>>> vectorization, conceptually:
>>> * (1) the outer loop is unrolled
>>> * (2) the inner loops are fused
>>> * (3) the (now single) inner body is rescheduled/shuffled/interleaved.
>>>
>> Thanks Michael for explaining issue clearer, this is what I meant.  As
>> for PR60276, I think it's actually the other side of the problem,
>> which only relates to dependence validity of interleaving.
>
> Interleaving validity is what is checked by the current code, I don't
> see any checking for validity of (2).  Now, the current code only
> looks at the outer loop distances to verify interleaving validity.
>
> I think we need to verify whether fusion is valid, preferably clearly
> separated from the current code checking interleaving validity.
>
> I'm not 100% convinced the interleaving validity check is correct for
> the outer loop vectorization case.
>
> I think it helps to reduce the dependence checking code to what we do
> for unroll-and-jam:
>
> Index: gcc/tree-vect-data-refs.c
> ===================================================================
> --- gcc/tree-vect-data-refs.c   (revision 255777)
> +++ gcc/tree-vect-data-refs.c   (working copy)
> @@ -378,7 +378,26 @@ vect_analyze_data_ref_dependence (struct
>         dump_printf_loc (MSG_NOTE, vect_location,
>                           "dependence distance  = %d.\n", dist);
>
> -      if (dist == 0)
> +      if (dist < 0)
> +       gcc_unreachable ();
> +
> +      else if (dist >= *max_vf)
> +       {
> +         /* Dependence distance does not create dependence, as far as
> +            vectorization is concerned, in this case.  */
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_NOTE, vect_location,
> +                            "dependence distance >= VF.\n");
> +         continue;
> +       }
> +
> +      else if (DDR_NB_LOOPS (ddr) == 2
> +              && (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> +                  || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> +                      && dist > 0)))
> +       continue;
> +
> +      else if (dist == 0)
>         {
>           if (dump_enabled_p ())
>             {
> @@ -427,26 +446,7 @@ vect_analyze_data_ref_dependence (struct
>           continue;
>         }
>
> -      if (dist > 0 && DDR_REVERSED_P (ddr))
> -       {
> -         /* If DDR_REVERSED_P the order of the data-refs in DDR was
> -            reversed (to make distance vector positive), and the actual
> -            distance is negative.  */
> -         if (dump_enabled_p ())
> -           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> -                            "dependence distance negative.\n");
> -         /* Record a negative dependence distance to later limit the
> -            amount of stmt copying / unrolling we can perform.
> -            Only need to handle read-after-write dependence.  */
> -         if (DR_IS_READ (drb)
> -             && (STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) == 0
> -                 || STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) > (unsigned)dist))
> -           STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) = dist;
> -         continue;
> -       }
Simply removing DDR_REVERSED_P doesn't work, as below failures indicating.
> -
> -      if (abs (dist) >= 2
> -         && abs (dist) < *max_vf)
> +      if (dist >= 2)
>         {
>           /* The dependence distance requires reduction of the maximal
>              vectorization factor.  */
> @@ -455,15 +455,6 @@ vect_analyze_data_ref_dependence (struct
>             dump_printf_loc (MSG_NOTE, vect_location,
>                              "adjusting maximal vectorization factor to %i\n",
>                              *max_vf);
> -       }
> -
> -      if (abs (dist) >= *max_vf)
> -       {
> -         /* Dependence distance does not create dependence, as far as
> -            vectorization is concerned, in this case.  */
> -         if (dump_enabled_p ())
> -           dump_printf_loc (MSG_NOTE, vect_location,
> -                            "dependence distance >= VF.\n");
>           continue;
>         }
>
>
> that fixes the testcase (probably by removing the DDR_REVERSED
> special-casing) exposes in C vect.exp:
>
> FAIL: gcc.dg/vect/pr36630.c scan-tree-dump-times vect "vectorized 1
> loops" 1 (found 0 times)
> FAIL: gcc.dg/vect/pr57558-2.c scan-tree-dump vect "vectorized 1 loops"
> FAIL: gcc.dg/vect/vect-outer-5.c execution test
> FAIL: gcc.dg/vect/vect-outer-5.c scan-tree-dump-times vect "OUTER LOOP
> VECTORIZED" 1 (found 2 times)
>
> For the interleaving correctness a negative distance is fine since
> we're only making it possibly
> more negative by executing stmts from the n + 1 iteration before stmts
> from the n'th iteration, right?
>
> But of course if the outer loop has negative distance that says
> nothing about the inner loop.
>
> So sth like
>
>       else if (DDR_REVERSED_P (ddr)
>                && (! nested_in_vect_loop_p (loop, DR_STMT (dra))
>                    || ! nested_in_vect_loop_p (loop, DR_STMT (drb))
>                    || dist_v[1] >= 0
>                    || abs (dist_v[1]) >= *max_vf))
This is incomplete/bogus?  It still misses DDR_REVERSED_P in innermost
loop vectorization?  Because both "! nested_in_vect_loop_p" and
"dist_v[1]" are only happens for outer loop vectorization?
>         {
>           /* If DDR_REVERSED_P the order of the data-refs in DDR was
>              reversed (to make distance vector positive), and the actual
>              distance is negative.  */
>           if (dump_enabled_p ())
> ...
>
> but then I wonder if PR60276 can also happen for inner loop refs.  Also
> what about dependences between stmts in inner and stmts in the outer
> loop?
IIRC, we can't really compute dependence for such cases.

Thanks,
bin
>
> With the above gcc.dg/vect/vect-outer-5.c still FAILs...
>
>   /* Outer-loop 2: Not vectorizable because of dependence distance. */
>   for (i = 0; i < 4; i++)
>     {
>       s = 0;
>       for (j=0; j<N; j+=4)
>         s += C[j];
>       B[i+1] = B[i] + s;
>     }
>
> that's because
>
>       else if (DDR_NB_LOOPS (ddr) == 2
>                && (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
>                    || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
>                        && dist > 0)))
>         continue;
>
> hits.  We simply don't have a perfect nest to begin with...
>
> Richard.
>
>> Thanks,
>> bin
>>> (1) is always okay.  But (2) and (3) as individual transformations must
>>> both be checked for validity.  If fusion isn't possible the whole
>>> transformation is invalid, and if interleaving isn't possible the same is
>>> true.  In the specific example:
>>>
>>>   for (b = 4; b >= 0; b--)
>>>     for (c = 0; c <= 6; c++)
>>>       t = a[c][b + 1];      // S1
>>>       a[c + 1][b + 2] = t;  // S2
>>>
>>> it's already the fusion step that's invalid.  There's a
>>> dependence between S1 and S2, e.g. for (b,c) = (4,1) comes-before (3,0)
>>> with S1(4,1) reading a[1][5] and S2(3,0) writing a[1][5].  So a
>>> write-after-read.  After fusing:
>>>
>>>    for (c = 0; c <= 6; c++)
>>>      {
>>>        t = a[c][5];              // S1
>>>        a[c + 1][6] = t;
>>>        t = a[c][4];
>>>        a[c + 1][5] = t;          // S2
>>>        a[c + 1][4] = a[c][3];
>>>        a[c + 1][3] = a[c][2];
>>>      }
>>>
>>> here we have at iterations (c) = (0) comes-before (1), at S2(0) writing
>>> a[1][5] and S1(1) writing a[1][5].  I.e. now it's a read-after-write (the
>>> write in iteration 0 overwrites the value that is going to be read at
>>> iteration 1, which wasn't the case in the original loop).  The dependence
>>> switched direction --> invalid.
>>>
>>> The simple interleaving of statements can't rectify this.
>>> Interleaving is an inner-body reordering but the brokenness comes from a
>>> cross-iteration ordering.
>>>
>>> This example can be unroll-jammed or outer-loop vectorized if one of the
>>> two loops is reversed.  Let's say we reverse the inner loop, so that it
>>> runs in the same direction as the outer loop (reversal is possible here).
>>>
>>> It'd then be something like:
>>>
>>>    for (c = 6; c >= 0; c--)
>>>      {
>>>        t = a[c][5];              // S1
>>>        a[c + 1][6] = t;
>>>        t = a[c][4];
>>>        a[c + 1][5] = t;          // S2
>>>        a[c + 1][4] = a[c][3];
>>>        a[c + 1][3] = a[c][2];
>>>      }
>>>
>>> The dependence between S1/S2 would still be a write-after-read, and all
>>> would be well.  This reversal of the inner loop can partly be simulated by
>>> not only interleaving the inner insns, but by also _reodering_ them.  But
>>> AFAIK the vectorizer doesn't do this?
>>>
>>>
>>> Ciao,
>>> Michael.
Richard Biener March 21, 2019, 12:23 p.m. UTC | #12
On Mon, Dec 18, 2017 at 1:37 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener
> > <richard.guenther@gmail.com> wrote:
> >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> >>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
> >>>> <richard.guenther@gmail.com> wrote:
> >>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> >>>>>> Hi,
> >>>>>> As explained in the PR, given below test case:
> >>>>>> int a[8][10] = { [2][5] = 4 }, c;
> >>>>>>
> >>>>>> int
> >>>>>> main ()
> >>>>>> {
> >>>>>>   short b;
> >>>>>>   int i, d;
> >>>>>>   for (b = 4; b >= 0; b--)
> >>>>>>     for (c = 0; c <= 6; c++)
> >>>>>>       a[c + 1][b + 2] = a[c][b + 1];
> >>>>>>   for (i = 0; i < 8; i++)
> >>>>>>     for (d = 0; d < 10; d++)
> >>>>>>       if (a[i][d] != (i == 3 && d == 6) * 4)
> >>>>>>         __builtin_abort ();
> >>>>>>   return 0;
> >>>>>>
> >>>>>> the loop nest is illegal for vectorization without reversing inner loop.  The issue
> >>>>>> is in data dependence checking of vectorizer, I believe the mentioned revision just
> >>>>>> exposed this.  Previously the vectorization is skipped because of unsupported memory
> >>>>>> operation.  The outer loop vectorization unrolls the outer loop into:
> >>>>>>
> >>>>>>   for (b = 4; b > 0; b -= 4)
> >>>>>>   {
> >>>>>>     for (c = 0; c <= 6; c++)
> >>>>>>       a[c + 1][6] = a[c][5];
> >>>>>>     for (c = 0; c <= 6; c++)
> >>>>>>       a[c + 1][5] = a[c][4];
> >>>>>>     for (c = 0; c <= 6; c++)
> >>>>>>       a[c + 1][4] = a[c][3];
> >>>>>>     for (c = 0; c <= 6; c++)
> >>>>>>       a[c + 1][3] = a[c][2];
> >>>>>>   }
> >>>>>> Then four inner loops are fused into:
> >>>>>>   for (b = 4; b > 0; b -= 4)
> >>>>>>   {
> >>>>>>     for (c = 0; c <= 6; c++)
> >>>>>>     {
> >>>>>>       a[c + 1][6] = a[c][5];  // S1
> >>>>>>       a[c + 1][5] = a[c][4];  // S2
> >>>>>>       a[c + 1][4] = a[c][3];
> >>>>>>       a[c + 1][3] = a[c][2];
> >>>>>>     }
> >>>>>>   }
> >>>>>
> >>>>> Note that they are not really "fused" but they are interleaved.  With
> >>>>> GIMPLE in mind
> >>>>> that makes a difference, you should get the equivalent of
> >>>>>
> >>>>>    for (c = 0; c <= 6; c++)
> >>>>>      {
> >>>>>        tem1 = a[c][5];
> >>>>>        tem2 = a[c][4];
> >>>>>        tem3 = a[c][3];
> >>>>>        tem4 = a[c][2];
> >>>>>        a[c+1][6] = tem1;
> >>>>>        a[c +1][5] = tem2;
> >>>>>         a[c+1][4] = tem3;
> >>>>>         a[c+1][3] = tem4;
> >>>>>      }
> >>>> Yeah, I will double check if this abstract breaks the patch and how.
> >>> Hmm, I think this doesn't break it, well at least for part of the
> >>> analysis, because it is loop carried (backward) dependence goes wrong,
> >>> interleaving or not with the same iteration doesn't matter here.
> >>
> >> I think the idea is that forward dependences are always fine (negative distance)
> >> to vectorize.  But with backward dependences we have to adhere to max_vf.
> >>
> >> It looks like for outer loop vectorization we only look at the distances in the
> >> outer loop but never at inner ones.  But here the same applies but isn't that
> >> independend on the distances with respect to the outer loop?
> >>
> >> But maybe I'm misunderstanding how "distances" work here.
> > Hmm, I am not sure I understand "distance" correctly.  With
> > description as in book like "Optimizing compilers for Modern
> > Architectures", distance is "# of iteration of sink ref - # of
> > iteration of source ref".  Given below example:
> >   for (i = 0; i < N; ++i)
> >     {
> >       x = arr[idx_1];  // S1
> >       arr[idx_2] = x;  // S2
> >     }
> > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0.  Also
> > this is forward dependence.  For example, idx_1 is i + 1 and idx_2 is
> > i;
> > If S2 is source ref, distance = idx_1 - idx_2, and distance < 0.  Also
> > this is backward dependence.  For example idx_1 is i and idx_2 is i +
> > 1;
> >
> > In GCC, we always try to subtract idx_2 from idx_1 first in computing
> > classic distance, we could result in negative distance in case of
> > backward dependence.  When this happens at dependence carried level,
> > we manually reverse it.  When this happens at inner level loop, we
> > simply keep the negative distance.  And it's meaningless to talk about
> > forward/backward given dependence is carried at outer level loop.
> >
> > Outer loop vectorization is interesting.  The problematic case has
> > backward dependence carried by outer loop.  Because we don't check
> > dist vs. max_vf for such dep, the unrolled references could have outer
> > loop index equal to each other, as in a[c][5] and a[c+1][5].  So it's
> > like we have outer loop index resolved as equal.  Now it has
> > dependence only if c == c' + 1.  I think previous comment on fusion
> > still hold for this and we now need to check backward dependence
> > between the two reference at inner level loop.  I guess it's a bit
> > trick to model dependence between a[c][5] and a[c+1][5] using the
> > original references and dependence.  I think it's okay to do that
> > because distance of a[c][5] and a[c+1][5] is what we computed
> > previously for the original references at inner level loop.
> >
> > Not sure if I have missed something important.
>
> Not sure either.  unroll-and-jam does
>
>       FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
>         {
>           /* A distance (a,b) is at worst transformed into (a/N,b) by the
>              unrolling (factor N), so the transformation is valid if
>              a >= N, or b > 0, or b is zero and a > 0.  Otherwise the unroll
>              factor needs to be limited so that the first condition holds.
>              That may limit the factor down to zero in the worst case.  */
>           int dist = dist_v[0];
>           if (dist < 0)
>             gcc_unreachable ();
>           else if ((unsigned)dist >= *unroll)
>             ;
>           else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
>                    || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
>                        && dist > 0))
>             ;
>           else
>             *unroll = dist;
>
> where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost loop.
>
> The vectorizer does way more complicated things and only looks at the
> distance with respect to the outer loop as far as I can see which can
> be negative.
>
> Not sure if fusion and vectorizer "interleaving" makes a difference here.
> I think the idea was that when interleaving stmt-by-stmt then forward
> dependences would be preserved and thus we don't need to check the
> inner loop dependences.  speaking with "forward vs. backward" dependences
> again, not distances...
>
> This also means that unroll-and-jam could be enhanced to "interleave"
> stmts and thus cover more cases?
>
> Again, I hope Micha can have a look here...

Coming back to this...  the vectorizer doesn't look at the inner loop distances
at all at the moment.  Bins orginal patch adds that for the case where the
outer loop evolution results in a negative dependence distance.  But what
if the outer loop evolution results in a zero distance or a positive distance
within the bounds of a sensible max_vf?  So I think we need to "simply"
look at all distance vector components, not just that of the outer loop
and perform all regular checks.  That also allows adjustment of max_vf
for inner loop dependences in case vectorization with smaller vectors is
still possible.

Doing that by making loop_depth iterate
fixes the testcase (as expected) but also regresses vect-outer-simd-[123].c
if we make the code look at ->safelen of the inner loop when processing
inner loop distances (because safelen is 0 there).  Just using ->safelen
from the outer loop doesn't regress anything it seems.

So I am going to test the following attached.  I have added the
testcase also with reversed inner loop to verify we'd apply outer loop
vectorization to that one.

Any comments?

Bootstrap & regtest running on x86_64-unknown-linux-gnu.

Richard.

2019-03-21  Richard Biener  <rguenther@suse.de>

        PR tree-optimization/81740
        * tree-vect-data-refs.c (vect_analyze_data_ref_dependence):
        Perform distance vector related dependence checks for all
        loops.

        * gcc.dg/vect/pr81740-1.c: New testcase.
        * gcc.dg/vect/pr81740-2.c: Likewise.
Bin.Cheng March 22, 2019, 6:12 a.m. UTC | #13
On Thu, Mar 21, 2019 at 8:24 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Mon, Dec 18, 2017 at 1:37 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> > > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> > >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> > >>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
> > >>>> <richard.guenther@gmail.com> wrote:
> > >>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> > >>>>>> Hi,
> > >>>>>> As explained in the PR, given below test case:
> > >>>>>> int a[8][10] = { [2][5] = 4 }, c;
> > >>>>>>
> > >>>>>> int
> > >>>>>> main ()
> > >>>>>> {
> > >>>>>>   short b;
> > >>>>>>   int i, d;
> > >>>>>>   for (b = 4; b >= 0; b--)
> > >>>>>>     for (c = 0; c <= 6; c++)
> > >>>>>>       a[c + 1][b + 2] = a[c][b + 1];
> > >>>>>>   for (i = 0; i < 8; i++)
> > >>>>>>     for (d = 0; d < 10; d++)
> > >>>>>>       if (a[i][d] != (i == 3 && d == 6) * 4)
> > >>>>>>         __builtin_abort ();
> > >>>>>>   return 0;
> > >>>>>>
> > >>>>>> the loop nest is illegal for vectorization without reversing inner loop.  The issue
> > >>>>>> is in data dependence checking of vectorizer, I believe the mentioned revision just
> > >>>>>> exposed this.  Previously the vectorization is skipped because of unsupported memory
> > >>>>>> operation.  The outer loop vectorization unrolls the outer loop into:
> > >>>>>>
> > >>>>>>   for (b = 4; b > 0; b -= 4)
> > >>>>>>   {
> > >>>>>>     for (c = 0; c <= 6; c++)
> > >>>>>>       a[c + 1][6] = a[c][5];
> > >>>>>>     for (c = 0; c <= 6; c++)
> > >>>>>>       a[c + 1][5] = a[c][4];
> > >>>>>>     for (c = 0; c <= 6; c++)
> > >>>>>>       a[c + 1][4] = a[c][3];
> > >>>>>>     for (c = 0; c <= 6; c++)
> > >>>>>>       a[c + 1][3] = a[c][2];
> > >>>>>>   }
> > >>>>>> Then four inner loops are fused into:
> > >>>>>>   for (b = 4; b > 0; b -= 4)
> > >>>>>>   {
> > >>>>>>     for (c = 0; c <= 6; c++)
> > >>>>>>     {
> > >>>>>>       a[c + 1][6] = a[c][5];  // S1
> > >>>>>>       a[c + 1][5] = a[c][4];  // S2
> > >>>>>>       a[c + 1][4] = a[c][3];
> > >>>>>>       a[c + 1][3] = a[c][2];
> > >>>>>>     }
> > >>>>>>   }
> > >>>>>
> > >>>>> Note that they are not really "fused" but they are interleaved.  With
> > >>>>> GIMPLE in mind
> > >>>>> that makes a difference, you should get the equivalent of
> > >>>>>
> > >>>>>    for (c = 0; c <= 6; c++)
> > >>>>>      {
> > >>>>>        tem1 = a[c][5];
> > >>>>>        tem2 = a[c][4];
> > >>>>>        tem3 = a[c][3];
> > >>>>>        tem4 = a[c][2];
> > >>>>>        a[c+1][6] = tem1;
> > >>>>>        a[c +1][5] = tem2;
> > >>>>>         a[c+1][4] = tem3;
> > >>>>>         a[c+1][3] = tem4;
> > >>>>>      }
> > >>>> Yeah, I will double check if this abstract breaks the patch and how.
> > >>> Hmm, I think this doesn't break it, well at least for part of the
> > >>> analysis, because it is loop carried (backward) dependence goes wrong,
> > >>> interleaving or not with the same iteration doesn't matter here.
> > >>
> > >> I think the idea is that forward dependences are always fine (negative distance)
> > >> to vectorize.  But with backward dependences we have to adhere to max_vf.
> > >>
> > >> It looks like for outer loop vectorization we only look at the distances in the
> > >> outer loop but never at inner ones.  But here the same applies but isn't that
> > >> independend on the distances with respect to the outer loop?
> > >>
> > >> But maybe I'm misunderstanding how "distances" work here.
> > > Hmm, I am not sure I understand "distance" correctly.  With
> > > description as in book like "Optimizing compilers for Modern
> > > Architectures", distance is "# of iteration of sink ref - # of
> > > iteration of source ref".  Given below example:
> > >   for (i = 0; i < N; ++i)
> > >     {
> > >       x = arr[idx_1];  // S1
> > >       arr[idx_2] = x;  // S2
> > >     }
> > > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0.  Also
> > > this is forward dependence.  For example, idx_1 is i + 1 and idx_2 is
> > > i;
> > > If S2 is source ref, distance = idx_1 - idx_2, and distance < 0.  Also
> > > this is backward dependence.  For example idx_1 is i and idx_2 is i +
> > > 1;
> > >
> > > In GCC, we always try to subtract idx_2 from idx_1 first in computing
> > > classic distance, we could result in negative distance in case of
> > > backward dependence.  When this happens at dependence carried level,
> > > we manually reverse it.  When this happens at inner level loop, we
> > > simply keep the negative distance.  And it's meaningless to talk about
> > > forward/backward given dependence is carried at outer level loop.
> > >
> > > Outer loop vectorization is interesting.  The problematic case has
> > > backward dependence carried by outer loop.  Because we don't check
> > > dist vs. max_vf for such dep, the unrolled references could have outer
> > > loop index equal to each other, as in a[c][5] and a[c+1][5].  So it's
> > > like we have outer loop index resolved as equal.  Now it has
> > > dependence only if c == c' + 1.  I think previous comment on fusion
> > > still hold for this and we now need to check backward dependence
> > > between the two reference at inner level loop.  I guess it's a bit
> > > trick to model dependence between a[c][5] and a[c+1][5] using the
> > > original references and dependence.  I think it's okay to do that
> > > because distance of a[c][5] and a[c+1][5] is what we computed
> > > previously for the original references at inner level loop.
> > >
> > > Not sure if I have missed something important.
> >
> > Not sure either.  unroll-and-jam does
> >
> >       FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
> >         {
> >           /* A distance (a,b) is at worst transformed into (a/N,b) by the
> >              unrolling (factor N), so the transformation is valid if
> >              a >= N, or b > 0, or b is zero and a > 0.  Otherwise the unroll
> >              factor needs to be limited so that the first condition holds.
> >              That may limit the factor down to zero in the worst case.  */
> >           int dist = dist_v[0];
> >           if (dist < 0)
> >             gcc_unreachable ();
> >           else if ((unsigned)dist >= *unroll)
> >             ;
> >           else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> >                    || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> >                        && dist > 0))
> >             ;
> >           else
> >             *unroll = dist;
> >
> > where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost loop.
> >
> > The vectorizer does way more complicated things and only looks at the
> > distance with respect to the outer loop as far as I can see which can
> > be negative.
> >
> > Not sure if fusion and vectorizer "interleaving" makes a difference here.
> > I think the idea was that when interleaving stmt-by-stmt then forward
> > dependences would be preserved and thus we don't need to check the
> > inner loop dependences.  speaking with "forward vs. backward" dependences
> > again, not distances...
> >
> > This also means that unroll-and-jam could be enhanced to "interleave"
> > stmts and thus cover more cases?
> >
> > Again, I hope Micha can have a look here...
>
> Coming back to this...  the vectorizer doesn't look at the inner loop distances
> at all at the moment.  Bins orginal patch adds that for the case where the
> outer loop evolution results in a negative dependence distance.  But what
> if the outer loop evolution results in a zero distance or a positive distance
> within the bounds of a sensible max_vf?  So I think we need to "simply"
> look at all distance vector components, not just that of the outer loop
> and perform all regular checks.  That also allows adjustment of max_vf
> for inner loop dependences in case vectorization with smaller vectors is
> still possible.
>
> Doing that by making loop_depth iterate
> fixes the testcase (as expected) but also regresses vect-outer-simd-[123].c
> if we make the code look at ->safelen of the inner loop when processing
> inner loop distances (because safelen is 0 there).  Just using ->safelen
> from the outer loop doesn't regress anything it seems.
>
> So I am going to test the following attached.  I have added the
> testcase also with reversed inner loop to verify we'd apply outer loop
> vectorization to that one.
>
> Any comments?
I kind of think that we are checking for different things in this way.
IIUC, dependence checks in outer loop vectorizer could be categorized
into three parts.  The regular vect related check on outer loop;
dependence check to fuse inner loop iterations; and dependence check
to shuffle scalar statement/reference together.  To pass the fusion
check, we only need to make sure no backward dependence is generated
in the result.  So for case of zero/positive distance in outer loop
evolution, it won't result in backward dependence?  For example, if we
adjust the original loop like:

   for (b = 4; b >= 0; b--)
     for (c = 0; c <= 6; c++)
-       a[c + 1][b + 2] = a[c][b + 1];
+      a[c + 1][b + 1] = a[c][b + 2];

The fusion result would be
   for (b = 4; b > 0; b -= 4)
   {
     for (c = 0; c <= 6; c++)
     {
       a[c + 1][5] = a[c][6];  // S1
       a[c + 1][4] = a[c][5];  // S2
       a[c + 1][3] = a[c][4];
       a[c + 1][2] = a[c][3];
     }
   }
Though there is dependence between s1/s2, it's forward dependence now.
Even we need to check the negative distance for inner loop, using the
regular checks for fusion part for inner loop(s) is a bit strange.
Using outer loop's safelen here is also unclear.

I am not sure where we do the shuffle related check, is it (dist == 0) case?
Is that the reason that the patch employs the whole regular checks?

Thanks,
bin
>
> Bootstrap & regtest running on x86_64-unknown-linux-gnu.
>
> Richard.
>
> 2019-03-21  Richard Biener  <rguenther@suse.de>
>
>         PR tree-optimization/81740
>         * tree-vect-data-refs.c (vect_analyze_data_ref_dependence):
>         Perform distance vector related dependence checks for all
>         loops.
>
>         * gcc.dg/vect/pr81740-1.c: New testcase.
>         * gcc.dg/vect/pr81740-2.c: Likewise.
Richard Biener March 25, 2019, 1:25 p.m. UTC | #14
On Fri, Mar 22, 2019 at 7:12 AM Bin.Cheng <amker.cheng@gmail.com> wrote:
>
> On Thu, Mar 21, 2019 at 8:24 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Mon, Dec 18, 2017 at 1:37 PM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> > > > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener
> > > > <richard.guenther@gmail.com> wrote:
> > > >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> > > >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> > > >>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
> > > >>>> <richard.guenther@gmail.com> wrote:
> > > >>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> > > >>>>>> Hi,
> > > >>>>>> As explained in the PR, given below test case:
> > > >>>>>> int a[8][10] = { [2][5] = 4 }, c;
> > > >>>>>>
> > > >>>>>> int
> > > >>>>>> main ()
> > > >>>>>> {
> > > >>>>>>   short b;
> > > >>>>>>   int i, d;
> > > >>>>>>   for (b = 4; b >= 0; b--)
> > > >>>>>>     for (c = 0; c <= 6; c++)
> > > >>>>>>       a[c + 1][b + 2] = a[c][b + 1];
> > > >>>>>>   for (i = 0; i < 8; i++)
> > > >>>>>>     for (d = 0; d < 10; d++)
> > > >>>>>>       if (a[i][d] != (i == 3 && d == 6) * 4)
> > > >>>>>>         __builtin_abort ();
> > > >>>>>>   return 0;
> > > >>>>>>
> > > >>>>>> the loop nest is illegal for vectorization without reversing inner loop.  The issue
> > > >>>>>> is in data dependence checking of vectorizer, I believe the mentioned revision just
> > > >>>>>> exposed this.  Previously the vectorization is skipped because of unsupported memory
> > > >>>>>> operation.  The outer loop vectorization unrolls the outer loop into:
> > > >>>>>>
> > > >>>>>>   for (b = 4; b > 0; b -= 4)
> > > >>>>>>   {
> > > >>>>>>     for (c = 0; c <= 6; c++)
> > > >>>>>>       a[c + 1][6] = a[c][5];
> > > >>>>>>     for (c = 0; c <= 6; c++)
> > > >>>>>>       a[c + 1][5] = a[c][4];
> > > >>>>>>     for (c = 0; c <= 6; c++)
> > > >>>>>>       a[c + 1][4] = a[c][3];
> > > >>>>>>     for (c = 0; c <= 6; c++)
> > > >>>>>>       a[c + 1][3] = a[c][2];
> > > >>>>>>   }
> > > >>>>>> Then four inner loops are fused into:
> > > >>>>>>   for (b = 4; b > 0; b -= 4)
> > > >>>>>>   {
> > > >>>>>>     for (c = 0; c <= 6; c++)
> > > >>>>>>     {
> > > >>>>>>       a[c + 1][6] = a[c][5];  // S1
> > > >>>>>>       a[c + 1][5] = a[c][4];  // S2
> > > >>>>>>       a[c + 1][4] = a[c][3];
> > > >>>>>>       a[c + 1][3] = a[c][2];
> > > >>>>>>     }
> > > >>>>>>   }
> > > >>>>>
> > > >>>>> Note that they are not really "fused" but they are interleaved.  With
> > > >>>>> GIMPLE in mind
> > > >>>>> that makes a difference, you should get the equivalent of
> > > >>>>>
> > > >>>>>    for (c = 0; c <= 6; c++)
> > > >>>>>      {
> > > >>>>>        tem1 = a[c][5];
> > > >>>>>        tem2 = a[c][4];
> > > >>>>>        tem3 = a[c][3];
> > > >>>>>        tem4 = a[c][2];
> > > >>>>>        a[c+1][6] = tem1;
> > > >>>>>        a[c +1][5] = tem2;
> > > >>>>>         a[c+1][4] = tem3;
> > > >>>>>         a[c+1][3] = tem4;
> > > >>>>>      }
> > > >>>> Yeah, I will double check if this abstract breaks the patch and how.
> > > >>> Hmm, I think this doesn't break it, well at least for part of the
> > > >>> analysis, because it is loop carried (backward) dependence goes wrong,
> > > >>> interleaving or not with the same iteration doesn't matter here.
> > > >>
> > > >> I think the idea is that forward dependences are always fine (negative distance)
> > > >> to vectorize.  But with backward dependences we have to adhere to max_vf.
> > > >>
> > > >> It looks like for outer loop vectorization we only look at the distances in the
> > > >> outer loop but never at inner ones.  But here the same applies but isn't that
> > > >> independend on the distances with respect to the outer loop?
> > > >>
> > > >> But maybe I'm misunderstanding how "distances" work here.
> > > > Hmm, I am not sure I understand "distance" correctly.  With
> > > > description as in book like "Optimizing compilers for Modern
> > > > Architectures", distance is "# of iteration of sink ref - # of
> > > > iteration of source ref".  Given below example:
> > > >   for (i = 0; i < N; ++i)
> > > >     {
> > > >       x = arr[idx_1];  // S1
> > > >       arr[idx_2] = x;  // S2
> > > >     }
> > > > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0.  Also
> > > > this is forward dependence.  For example, idx_1 is i + 1 and idx_2 is
> > > > i;
> > > > If S2 is source ref, distance = idx_1 - idx_2, and distance < 0.  Also
> > > > this is backward dependence.  For example idx_1 is i and idx_2 is i +
> > > > 1;
> > > >
> > > > In GCC, we always try to subtract idx_2 from idx_1 first in computing
> > > > classic distance, we could result in negative distance in case of
> > > > backward dependence.  When this happens at dependence carried level,
> > > > we manually reverse it.  When this happens at inner level loop, we
> > > > simply keep the negative distance.  And it's meaningless to talk about
> > > > forward/backward given dependence is carried at outer level loop.
> > > >
> > > > Outer loop vectorization is interesting.  The problematic case has
> > > > backward dependence carried by outer loop.  Because we don't check
> > > > dist vs. max_vf for such dep, the unrolled references could have outer
> > > > loop index equal to each other, as in a[c][5] and a[c+1][5].  So it's
> > > > like we have outer loop index resolved as equal.  Now it has
> > > > dependence only if c == c' + 1.  I think previous comment on fusion
> > > > still hold for this and we now need to check backward dependence
> > > > between the two reference at inner level loop.  I guess it's a bit
> > > > trick to model dependence between a[c][5] and a[c+1][5] using the
> > > > original references and dependence.  I think it's okay to do that
> > > > because distance of a[c][5] and a[c+1][5] is what we computed
> > > > previously for the original references at inner level loop.
> > > >
> > > > Not sure if I have missed something important.
> > >
> > > Not sure either.  unroll-and-jam does
> > >
> > >       FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
> > >         {
> > >           /* A distance (a,b) is at worst transformed into (a/N,b) by the
> > >              unrolling (factor N), so the transformation is valid if
> > >              a >= N, or b > 0, or b is zero and a > 0.  Otherwise the unroll
> > >              factor needs to be limited so that the first condition holds.
> > >              That may limit the factor down to zero in the worst case.  */
> > >           int dist = dist_v[0];
> > >           if (dist < 0)
> > >             gcc_unreachable ();
> > >           else if ((unsigned)dist >= *unroll)
> > >             ;
> > >           else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> > >                    || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> > >                        && dist > 0))
> > >             ;
> > >           else
> > >             *unroll = dist;
> > >
> > > where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost loop.
> > >
> > > The vectorizer does way more complicated things and only looks at the
> > > distance with respect to the outer loop as far as I can see which can
> > > be negative.
> > >
> > > Not sure if fusion and vectorizer "interleaving" makes a difference here.
> > > I think the idea was that when interleaving stmt-by-stmt then forward
> > > dependences would be preserved and thus we don't need to check the
> > > inner loop dependences.  speaking with "forward vs. backward" dependences
> > > again, not distances...
> > >
> > > This also means that unroll-and-jam could be enhanced to "interleave"
> > > stmts and thus cover more cases?
> > >
> > > Again, I hope Micha can have a look here...
> >
> > Coming back to this...  the vectorizer doesn't look at the inner loop distances
> > at all at the moment.  Bins orginal patch adds that for the case where the
> > outer loop evolution results in a negative dependence distance.  But what
> > if the outer loop evolution results in a zero distance or a positive distance
> > within the bounds of a sensible max_vf?  So I think we need to "simply"
> > look at all distance vector components, not just that of the outer loop
> > and perform all regular checks.  That also allows adjustment of max_vf
> > for inner loop dependences in case vectorization with smaller vectors is
> > still possible.
> >
> > Doing that by making loop_depth iterate
> > fixes the testcase (as expected) but also regresses vect-outer-simd-[123].c
> > if we make the code look at ->safelen of the inner loop when processing
> > inner loop distances (because safelen is 0 there).  Just using ->safelen
> > from the outer loop doesn't regress anything it seems.
> >
> > So I am going to test the following attached.  I have added the
> > testcase also with reversed inner loop to verify we'd apply outer loop
> > vectorization to that one.
> >
> > Any comments?
> I kind of think that we are checking for different things in this way.
> IIUC, dependence checks in outer loop vectorizer could be categorized
> into three parts.  The regular vect related check on outer loop;
> dependence check to fuse inner loop iterations; and dependence check
> to shuffle scalar statement/reference together.  To pass the fusion
> check, we only need to make sure no backward dependence is generated
> in the result.  So for case of zero/positive distance in outer loop
> evolution, it won't result in backward dependence?  For example, if we
> adjust the original loop like:
>
>    for (b = 4; b >= 0; b--)
>      for (c = 0; c <= 6; c++)
> -       a[c + 1][b + 2] = a[c][b + 1];
> +      a[c + 1][b + 1] = a[c][b + 2];
>
> The fusion result would be
>    for (b = 4; b > 0; b -= 4)
>    {
>      for (c = 0; c <= 6; c++)
>      {
>        a[c + 1][5] = a[c][6];  // S1
>        a[c + 1][4] = a[c][5];  // S2
>        a[c + 1][3] = a[c][4];
>        a[c + 1][2] = a[c][3];
>      }
>    }
> Though there is dependence between s1/s2, it's forward dependence now.

Hmm, OK.  But this doesn't vectorize with or without the patch, we claim

t.c:9:3: note:   dependence distance  = 1.
t.c:11:29: missed:   not vectorized, possible dependence between
data-refs a[c.3_15][_48] and a[_3][_50]
t.c:9:3: missed:  bad data dependence.
t.c:9:3: missed: couldn't vectorize loop

> Even we need to check the negative distance for inner loop, using the
> regular checks for fusion part for inner loop(s) is a bit strange.
> Using outer loop's safelen here is also unclear.
>
> I am not sure where we do the shuffle related check, is it (dist == 0) case?
> Is that the reason that the patch employs the whole regular checks?

I think the suffling needs to look at the outer loop distance and it is the
== 0 and the forward dependence case where we check/constrain
against the vectorization factor?

I agree that formulating the outer loop dependence testing in a way to
check the unroll-and-jam condition and the shuffling part separately
would be best.  I was mainly worried that your patch only hooks into
the negative "outer" distance case and not the zero or positive
distance one.  For example for unroll-and-jam we constrain the
maximum unroll factor.  I wonder if we should simply call into
its adjust_unroll_factor from the vectorizer or copy&paste it
to the vectorizer.  I'll note that with known dependence distances
it seems to never compute failure...  at least it computes the
correct maximum vectorization factor for me.

I also wonder about dependences for DRs in the outer loop
which we'd even more heavily re-order.

Anyways, we should somehow fix this.

Attached is a patch variant cut&pasting the unroll-and-jam
testing.

Does this look better?

Thanks,
Richard.

> Thanks,
> bin
> >
> > Bootstrap & regtest running on x86_64-unknown-linux-gnu.
> >
> > Richard.
> >
> > 2019-03-21  Richard Biener  <rguenther@suse.de>
> >
> >         PR tree-optimization/81740
> >         * tree-vect-data-refs.c (vect_analyze_data_ref_dependence):
> >         Perform distance vector related dependence checks for all
> >         loops.
> >
> >         * gcc.dg/vect/pr81740-1.c: New testcase.
> >         * gcc.dg/vect/pr81740-2.c: Likewise.
Richard Sandiford March 26, 2019, 12:56 a.m. UTC | #15
Richard Biener <richard.guenther@gmail.com> writes:
> On Fri, Mar 22, 2019 at 7:12 AM Bin.Cheng <amker.cheng@gmail.com> wrote:
>>
>> On Thu, Mar 21, 2019 at 8:24 PM Richard Biener
>> <richard.guenther@gmail.com> wrote:
>> >
>> > On Mon, Dec 18, 2017 at 1:37 PM Richard Biener
>> > <richard.guenther@gmail.com> wrote:
>> > >
>> > > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> > > > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener
>> > > > <richard.guenther@gmail.com> wrote:
>> > > >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> > > >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> > > >>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
>> > > >>>> <richard.guenther@gmail.com> wrote:
>> > > >>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>> > > >>>>>> Hi,
>> > > >>>>>> As explained in the PR, given below test case:
>> > > >>>>>> int a[8][10] = { [2][5] = 4 }, c;
>> > > >>>>>>
>> > > >>>>>> int
>> > > >>>>>> main ()
>> > > >>>>>> {
>> > > >>>>>>   short b;
>> > > >>>>>>   int i, d;
>> > > >>>>>>   for (b = 4; b >= 0; b--)
>> > > >>>>>>     for (c = 0; c <= 6; c++)
>> > > >>>>>>       a[c + 1][b + 2] = a[c][b + 1];
>> > > >>>>>>   for (i = 0; i < 8; i++)
>> > > >>>>>>     for (d = 0; d < 10; d++)
>> > > >>>>>>       if (a[i][d] != (i == 3 && d == 6) * 4)
>> > > >>>>>>         __builtin_abort ();
>> > > >>>>>>   return 0;
>> > > >>>>>>
>> > > >>>>>> the loop nest is illegal for vectorization without reversing inner loop.  The issue
>> > > >>>>>> is in data dependence checking of vectorizer, I believe the mentioned revision just
>> > > >>>>>> exposed this.  Previously the vectorization is skipped because of unsupported memory
>> > > >>>>>> operation.  The outer loop vectorization unrolls the outer loop into:
>> > > >>>>>>
>> > > >>>>>>   for (b = 4; b > 0; b -= 4)
>> > > >>>>>>   {
>> > > >>>>>>     for (c = 0; c <= 6; c++)
>> > > >>>>>>       a[c + 1][6] = a[c][5];
>> > > >>>>>>     for (c = 0; c <= 6; c++)
>> > > >>>>>>       a[c + 1][5] = a[c][4];
>> > > >>>>>>     for (c = 0; c <= 6; c++)
>> > > >>>>>>       a[c + 1][4] = a[c][3];
>> > > >>>>>>     for (c = 0; c <= 6; c++)
>> > > >>>>>>       a[c + 1][3] = a[c][2];
>> > > >>>>>>   }
>> > > >>>>>> Then four inner loops are fused into:
>> > > >>>>>>   for (b = 4; b > 0; b -= 4)
>> > > >>>>>>   {
>> > > >>>>>>     for (c = 0; c <= 6; c++)
>> > > >>>>>>     {
>> > > >>>>>>       a[c + 1][6] = a[c][5];  // S1
>> > > >>>>>>       a[c + 1][5] = a[c][4];  // S2
>> > > >>>>>>       a[c + 1][4] = a[c][3];
>> > > >>>>>>       a[c + 1][3] = a[c][2];
>> > > >>>>>>     }
>> > > >>>>>>   }
>> > > >>>>>
>> > > >>>>> Note that they are not really "fused" but they are interleaved.  With
>> > > >>>>> GIMPLE in mind
>> > > >>>>> that makes a difference, you should get the equivalent of
>> > > >>>>>
>> > > >>>>>    for (c = 0; c <= 6; c++)
>> > > >>>>>      {
>> > > >>>>>        tem1 = a[c][5];
>> > > >>>>>        tem2 = a[c][4];
>> > > >>>>>        tem3 = a[c][3];
>> > > >>>>>        tem4 = a[c][2];
>> > > >>>>>        a[c+1][6] = tem1;
>> > > >>>>>        a[c +1][5] = tem2;
>> > > >>>>>         a[c+1][4] = tem3;
>> > > >>>>>         a[c+1][3] = tem4;
>> > > >>>>>      }
>> > > >>>> Yeah, I will double check if this abstract breaks the patch and how.
>> > > >>> Hmm, I think this doesn't break it, well at least for part of the
>> > > >>> analysis, because it is loop carried (backward) dependence goes wrong,
>> > > >>> interleaving or not with the same iteration doesn't matter here.
>> > > >>
>> > > >> I think the idea is that forward dependences are always fine (negative distance)
>> > > >> to vectorize.  But with backward dependences we have to adhere to max_vf.
>> > > >>
>> > > >> It looks like for outer loop vectorization we only look at the distances in the
>> > > >> outer loop but never at inner ones.  But here the same applies but isn't that
>> > > >> independend on the distances with respect to the outer loop?
>> > > >>
>> > > >> But maybe I'm misunderstanding how "distances" work here.
>> > > > Hmm, I am not sure I understand "distance" correctly.  With
>> > > > description as in book like "Optimizing compilers for Modern
>> > > > Architectures", distance is "# of iteration of sink ref - # of
>> > > > iteration of source ref".  Given below example:
>> > > >   for (i = 0; i < N; ++i)
>> > > >     {
>> > > >       x = arr[idx_1];  // S1
>> > > >       arr[idx_2] = x;  // S2
>> > > >     }
>> > > > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0.  Also
>> > > > this is forward dependence.  For example, idx_1 is i + 1 and idx_2 is
>> > > > i;
>> > > > If S2 is source ref, distance = idx_1 - idx_2, and distance < 0.  Also
>> > > > this is backward dependence.  For example idx_1 is i and idx_2 is i +
>> > > > 1;
>> > > >
>> > > > In GCC, we always try to subtract idx_2 from idx_1 first in computing
>> > > > classic distance, we could result in negative distance in case of
>> > > > backward dependence.  When this happens at dependence carried level,
>> > > > we manually reverse it.  When this happens at inner level loop, we
>> > > > simply keep the negative distance.  And it's meaningless to talk about
>> > > > forward/backward given dependence is carried at outer level loop.
>> > > >
>> > > > Outer loop vectorization is interesting.  The problematic case has
>> > > > backward dependence carried by outer loop.  Because we don't check
>> > > > dist vs. max_vf for such dep, the unrolled references could have outer
>> > > > loop index equal to each other, as in a[c][5] and a[c+1][5].  So it's
>> > > > like we have outer loop index resolved as equal.  Now it has
>> > > > dependence only if c == c' + 1.  I think previous comment on fusion
>> > > > still hold for this and we now need to check backward dependence
>> > > > between the two reference at inner level loop.  I guess it's a bit
>> > > > trick to model dependence between a[c][5] and a[c+1][5] using the
>> > > > original references and dependence.  I think it's okay to do that
>> > > > because distance of a[c][5] and a[c+1][5] is what we computed
>> > > > previously for the original references at inner level loop.
>> > > >
>> > > > Not sure if I have missed something important.
>> > >
>> > > Not sure either.  unroll-and-jam does
>> > >
>> > >       FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
>> > >         {
>> > >           /* A distance (a,b) is at worst transformed into (a/N,b) by the
>> > >              unrolling (factor N), so the transformation is valid if
>> > >              a >= N, or b > 0, or b is zero and a > 0.  Otherwise the unroll
>> > >              factor needs to be limited so that the first condition holds.
>> > >              That may limit the factor down to zero in the worst case.  */
>> > >           int dist = dist_v[0];
>> > >           if (dist < 0)
>> > >             gcc_unreachable ();
>> > >           else if ((unsigned)dist >= *unroll)
>> > >             ;
>> > >           else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
>> > >                    || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
>> > >                        && dist > 0))
>> > >             ;
>> > >           else
>> > >             *unroll = dist;
>> > >
>> > > where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost loop.
>> > >
>> > > The vectorizer does way more complicated things and only looks at the
>> > > distance with respect to the outer loop as far as I can see which can
>> > > be negative.
>> > >
>> > > Not sure if fusion and vectorizer "interleaving" makes a difference here.
>> > > I think the idea was that when interleaving stmt-by-stmt then forward
>> > > dependences would be preserved and thus we don't need to check the
>> > > inner loop dependences.  speaking with "forward vs. backward" dependences
>> > > again, not distances...
>> > >
>> > > This also means that unroll-and-jam could be enhanced to "interleave"
>> > > stmts and thus cover more cases?
>> > >
>> > > Again, I hope Micha can have a look here...
>> >
>> > Coming back to this...  the vectorizer doesn't look at the inner loop distances
>> > at all at the moment.  Bins orginal patch adds that for the case where the
>> > outer loop evolution results in a negative dependence distance.  But what
>> > if the outer loop evolution results in a zero distance or a positive distance
>> > within the bounds of a sensible max_vf?  So I think we need to "simply"
>> > look at all distance vector components, not just that of the outer loop
>> > and perform all regular checks.  That also allows adjustment of max_vf
>> > for inner loop dependences in case vectorization with smaller vectors is
>> > still possible.
>> >
>> > Doing that by making loop_depth iterate
>> > fixes the testcase (as expected) but also regresses vect-outer-simd-[123].c
>> > if we make the code look at ->safelen of the inner loop when processing
>> > inner loop distances (because safelen is 0 there).  Just using ->safelen
>> > from the outer loop doesn't regress anything it seems.
>> >
>> > So I am going to test the following attached.  I have added the
>> > testcase also with reversed inner loop to verify we'd apply outer loop
>> > vectorization to that one.
>> >
>> > Any comments?
>> I kind of think that we are checking for different things in this way.
>> IIUC, dependence checks in outer loop vectorizer could be categorized
>> into three parts.  The regular vect related check on outer loop;
>> dependence check to fuse inner loop iterations; and dependence check
>> to shuffle scalar statement/reference together.  To pass the fusion
>> check, we only need to make sure no backward dependence is generated
>> in the result.  So for case of zero/positive distance in outer loop
>> evolution, it won't result in backward dependence?  For example, if we
>> adjust the original loop like:
>>
>>    for (b = 4; b >= 0; b--)
>>      for (c = 0; c <= 6; c++)
>> -       a[c + 1][b + 2] = a[c][b + 1];
>> +      a[c + 1][b + 1] = a[c][b + 2];
>>
>> The fusion result would be
>>    for (b = 4; b > 0; b -= 4)
>>    {
>>      for (c = 0; c <= 6; c++)
>>      {
>>        a[c + 1][5] = a[c][6];  // S1
>>        a[c + 1][4] = a[c][5];  // S2
>>        a[c + 1][3] = a[c][4];
>>        a[c + 1][2] = a[c][3];
>>      }
>>    }
>> Though there is dependence between s1/s2, it's forward dependence now.
>
> Hmm, OK.  But this doesn't vectorize with or without the patch, we claim
>
> t.c:9:3: note:   dependence distance  = 1.
> t.c:11:29: missed:   not vectorized, possible dependence between
> data-refs a[c.3_15][_48] and a[_3][_50]
> t.c:9:3: missed:  bad data dependence.
> t.c:9:3: missed: couldn't vectorize loop
>
>> Even we need to check the negative distance for inner loop, using the
>> regular checks for fusion part for inner loop(s) is a bit strange.
>> Using outer loop's safelen here is also unclear.
>>
>> I am not sure where we do the shuffle related check, is it (dist == 0) case?
>> Is that the reason that the patch employs the whole regular checks?
>
> I think the suffling needs to look at the outer loop distance and it is the
> == 0 and the forward dependence case where we check/constrain
> against the vectorization factor?
>
> I agree that formulating the outer loop dependence testing in a way to
> check the unroll-and-jam condition and the shuffling part separately
> would be best.  I was mainly worried that your patch only hooks into
> the negative "outer" distance case and not the zero or positive
> distance one.  For example for unroll-and-jam we constrain the
> maximum unroll factor.  I wonder if we should simply call into
> its adjust_unroll_factor from the vectorizer or copy&paste it
> to the vectorizer.  I'll note that with known dependence distances
> it seems to never compute failure...  at least it computes the
> correct maximum vectorization factor for me.
>
> I also wonder about dependences for DRs in the outer loop
> which we'd even more heavily re-order.
>
> Anyways, we should somehow fix this.
>
> Attached is a patch variant cut&pasting the unroll-and-jam
> testing.
>
> Does this look better?
>
> Thanks,
> Richard.
>
>> Thanks,
>> bin
>> >
>> > Bootstrap & regtest running on x86_64-unknown-linux-gnu.
>> >
>> > Richard.
>> >
>> > 2019-03-21  Richard Biener  <rguenther@suse.de>
>> >
>> >         PR tree-optimization/81740
>> >         * tree-vect-data-refs.c (vect_analyze_data_ref_dependence):
>> >         Perform distance vector related dependence checks for all
>> >         loops.
>> >
>> >         * gcc.dg/vect/pr81740-1.c: New testcase.
>> >         * gcc.dg/vect/pr81740-2.c: Likewise.
>
> 2019-03-25  Richard Biener  <rguenther@suse.de>
>
>         PR tree-optimization/81740
> 	* tree-vect-data-refs.c (vect_analyze_data_ref_dependence):
> 	Add explicit unroll-and-jam dependence testing.
>
> 	* gcc.dg/vect/pr81740-1.c: New testcase.
> 	* gcc.dg/vect/pr81740-2.c: Likewise.
>
> Index: gcc/tree-vect-data-refs.c
> ===================================================================
> --- gcc/tree-vect-data-refs.c	(revision 269908)
> +++ gcc/tree-vect-data-refs.c	(working copy)
> @@ -411,6 +411,45 @@ vect_analyze_data_ref_dependence (struct
>      }
>  
>    loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr));
> +  gcc_assert (loop_depth == 0);
> +
> +  /* Perform unroll-and-jam test.  */
> +  if (nested_in_vect_loop_p (loop, stmtinfo_a))
> +    {
> +      FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
> +	{
> +	  /* A distance (a,b) is at worst transformed into (a/N,b) by the
> +	     unrolling (factor N), so the transformation is valid if
> +	     a >= N, or b > 0, or b is zero and a > 0.  Otherwise the unroll
> +	     factor needs to be limited so that the first condition holds.
> +	     That may limit the factor down to zero in the worst case.  */

I think it would be good to say that we're applying a two-stage
check here (unroll-and-jam, then statement-level interleaving).  As it
stands, it sounds like proving (a>0 && b==0) || b>0 is enough to prove
that outer-loop vectorisation is valid for any VF, whereas that isn't
true for the interleaving part.  That said...

> +	  int dist = dist_v[0];
> +	  if (dist < 0)
> +	    gcc_unreachable ();
> +	  else if ((unsigned)dist >= *max_vf)
> +	    ;
> +	  else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> +		   || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> +		       && dist > 0))
> +	    ;
> +	  else if (dist <= 1)
> +	    return opt_result::failure_at (stmtinfo_a->stmt,
> +					   "not vectorized, possible dependence"
> +					   " between data-refs %T and %T\n",
> +					   DR_REF (dra), DR_REF (drb));
> +	  else
> +	    {
> +	      /* Dependence distance does not create dependence, as far as
> +		 vectorization is concerned, in this case.  */
> +	      if (dump_enabled_p ())
> +		dump_printf_loc (MSG_NOTE, vect_location,
> +				 "dependence between data-refs %T and %T "
> +				 "constrains vectorization factor to %d\n",
> +				 DR_REF (dra), DR_REF (drb), dist);
> +	      *max_vf = dist;
> +	    }
> +	}
> +    }

...I guess this is looping back to the original discussion, sorry, but I'm
not sure taking it as a two-stage check really helps.  We're going from:

A:
   for i in [0:n]:
     for j in [0:n]:
       A(i,j)
       B(i,j)

to:

B:
   for i in [0:n:2]:
     for j in [0:n]:
       A(i,j)
       B(i,j)
     for j in [0:n]:
       A(i+1,j)
       B(i+1,j)

to:

C:
   for i in [0:n:2]:
     for j in [0:n]:
       A(i,j)
       B(i,j)
       A(i+1,j)
       B(i+1,j)

to:

D:
   for i in [0:n:2]:
     for j in [0:n]:
       A(i,j)
       A(i+1,j)
       B(i,j)
       B(i+1,j)

but since outer loop vectorisation doesn't change the order of
statements for each "i", IMO it's clearer to go directly from B to D
by treating the inner loop as though we were completely unrolling it.
C doesn't seem like a useful midway point.

I think the only cases the patch is handling on top of existing checks are:

(a) a<0 && b>0, normalised to a>0 && b<0 with a reverse flag.
(b) a==0 && b==0, which we now reject earlier

I don't think (b) makes any difference in practice because
(i) we already reject accesses with a zero step and (ii)
without the support for runtime aliasing checks for outer loops,
we're not going to be able to add any extra disambiguation for
things like:

    d[i][j] = ...;
    ... = d[i][j];

relative to other loop accesses, so there won't be any cases we can
handle here that wouldn't have been propagated earlier.

So I think we're really left with (a) being the useful case,
which is also the one added by Bin's original patch.

Based on the "complete unrolling" view, if we number statements as
(i, n), where i is the outer loop iteration and n is a statement number
in the (completely unrolled) loop body, then the original scalar code
executes in lexicographical order while for the vector loop:

(1) (i,n) executes before (i+ix,n+nx) for all ix>=0, nx>=1, regardless of VF
(2) (i,n) executes before (i+ix,n-nx) for all ix>=VF, nx>=0
      (well, nx unrestricted, but only nx>=0 is useful given (1))

So for any kind of dependence between (i,n) and (i+ix,n-nx), ix>=1, nx>=0
we need to restrict VF to ix so that (2) ensures the right order.
This means that the unnormalised distances of interest are:

- (ix, -nx), ix>=1, nx>=0
- (-ix, nx), ix>=1, nx>=0

But the second gets normalised to the first, which is actually useful
in this case :-).

In terms of the existing code, I think that means we want to change
the handling of nested statements (only) to:

- ignore DDR_REVERSED_P (ddr)
- restrict the main dist > 0 case to when the inner distance is <= 0.

This should have the side effect of allowing outer-loop vectorisation for:

void __attribute__ ((noipa))
f (int a[][N], int b[restrict])
{
  for (int i = N - 1; i-- > 0; )
    for (int j = 0; j < N - 1; ++j)
      a[j + 1][i] = a[j][i + 1] + b[i];
}

At the moment we reject this, but AFAICT it should be OK.
(We do allow it for s/i + 1/i/, since then the outer distance is 0.)

I think we can also skip the existing dist==0 code for nested statements
since any case that really matters (such as zero steps) should also have
a second dependence distance (1, 0) that we'd handle as above.  Doing that
is probably something for stage 1 though.

Hope at least some of that was vaguely right. ;-)

Thanks,
Richard
Bin.Cheng March 26, 2019, 6:37 a.m. UTC | #16
On Tue, Mar 26, 2019 at 8:56 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Fri, Mar 22, 2019 at 7:12 AM Bin.Cheng <amker.cheng@gmail.com> wrote:
> >>
> >> On Thu, Mar 21, 2019 at 8:24 PM Richard Biener
> >> <richard.guenther@gmail.com> wrote:
> >> >
> >> > On Mon, Dec 18, 2017 at 1:37 PM Richard Biener
> >> > <richard.guenther@gmail.com> wrote:
> >> > >
> >> > > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> >> > > > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener
> >> > > > <richard.guenther@gmail.com> wrote:
> >> > > >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> >> > > >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> >> > > >>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
> >> > > >>>> <richard.guenther@gmail.com> wrote:
> >> > > >>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> >> > > >>>>>> Hi,
> >> > > >>>>>> As explained in the PR, given below test case:
> >> > > >>>>>> int a[8][10] = { [2][5] = 4 }, c;
> >> > > >>>>>>
> >> > > >>>>>> int
> >> > > >>>>>> main ()
> >> > > >>>>>> {
> >> > > >>>>>>   short b;
> >> > > >>>>>>   int i, d;
> >> > > >>>>>>   for (b = 4; b >= 0; b--)
> >> > > >>>>>>     for (c = 0; c <= 6; c++)
> >> > > >>>>>>       a[c + 1][b + 2] = a[c][b + 1];
> >> > > >>>>>>   for (i = 0; i < 8; i++)
> >> > > >>>>>>     for (d = 0; d < 10; d++)
> >> > > >>>>>>       if (a[i][d] != (i == 3 && d == 6) * 4)
> >> > > >>>>>>         __builtin_abort ();
> >> > > >>>>>>   return 0;
> >> > > >>>>>>
> >> > > >>>>>> the loop nest is illegal for vectorization without reversing inner loop.  The issue
> >> > > >>>>>> is in data dependence checking of vectorizer, I believe the mentioned revision just
> >> > > >>>>>> exposed this.  Previously the vectorization is skipped because of unsupported memory
> >> > > >>>>>> operation.  The outer loop vectorization unrolls the outer loop into:
> >> > > >>>>>>
> >> > > >>>>>>   for (b = 4; b > 0; b -= 4)
> >> > > >>>>>>   {
> >> > > >>>>>>     for (c = 0; c <= 6; c++)
> >> > > >>>>>>       a[c + 1][6] = a[c][5];
> >> > > >>>>>>     for (c = 0; c <= 6; c++)
> >> > > >>>>>>       a[c + 1][5] = a[c][4];
> >> > > >>>>>>     for (c = 0; c <= 6; c++)
> >> > > >>>>>>       a[c + 1][4] = a[c][3];
> >> > > >>>>>>     for (c = 0; c <= 6; c++)
> >> > > >>>>>>       a[c + 1][3] = a[c][2];
> >> > > >>>>>>   }
> >> > > >>>>>> Then four inner loops are fused into:
> >> > > >>>>>>   for (b = 4; b > 0; b -= 4)
> >> > > >>>>>>   {
> >> > > >>>>>>     for (c = 0; c <= 6; c++)
> >> > > >>>>>>     {
> >> > > >>>>>>       a[c + 1][6] = a[c][5];  // S1
> >> > > >>>>>>       a[c + 1][5] = a[c][4];  // S2
> >> > > >>>>>>       a[c + 1][4] = a[c][3];
> >> > > >>>>>>       a[c + 1][3] = a[c][2];
> >> > > >>>>>>     }
> >> > > >>>>>>   }
> >> > > >>>>>
> >> > > >>>>> Note that they are not really "fused" but they are interleaved.  With
> >> > > >>>>> GIMPLE in mind
> >> > > >>>>> that makes a difference, you should get the equivalent of
> >> > > >>>>>
> >> > > >>>>>    for (c = 0; c <= 6; c++)
> >> > > >>>>>      {
> >> > > >>>>>        tem1 = a[c][5];
> >> > > >>>>>        tem2 = a[c][4];
> >> > > >>>>>        tem3 = a[c][3];
> >> > > >>>>>        tem4 = a[c][2];
> >> > > >>>>>        a[c+1][6] = tem1;
> >> > > >>>>>        a[c +1][5] = tem2;
> >> > > >>>>>         a[c+1][4] = tem3;
> >> > > >>>>>         a[c+1][3] = tem4;
> >> > > >>>>>      }
> >> > > >>>> Yeah, I will double check if this abstract breaks the patch and how.
> >> > > >>> Hmm, I think this doesn't break it, well at least for part of the
> >> > > >>> analysis, because it is loop carried (backward) dependence goes wrong,
> >> > > >>> interleaving or not with the same iteration doesn't matter here.
> >> > > >>
> >> > > >> I think the idea is that forward dependences are always fine (negative distance)
> >> > > >> to vectorize.  But with backward dependences we have to adhere to max_vf.
> >> > > >>
> >> > > >> It looks like for outer loop vectorization we only look at the distances in the
> >> > > >> outer loop but never at inner ones.  But here the same applies but isn't that
> >> > > >> independend on the distances with respect to the outer loop?
> >> > > >>
> >> > > >> But maybe I'm misunderstanding how "distances" work here.
> >> > > > Hmm, I am not sure I understand "distance" correctly.  With
> >> > > > description as in book like "Optimizing compilers for Modern
> >> > > > Architectures", distance is "# of iteration of sink ref - # of
> >> > > > iteration of source ref".  Given below example:
> >> > > >   for (i = 0; i < N; ++i)
> >> > > >     {
> >> > > >       x = arr[idx_1];  // S1
> >> > > >       arr[idx_2] = x;  // S2
> >> > > >     }
> >> > > > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0.  Also
> >> > > > this is forward dependence.  For example, idx_1 is i + 1 and idx_2 is
> >> > > > i;
> >> > > > If S2 is source ref, distance = idx_1 - idx_2, and distance < 0.  Also
> >> > > > this is backward dependence.  For example idx_1 is i and idx_2 is i +
> >> > > > 1;
> >> > > >
> >> > > > In GCC, we always try to subtract idx_2 from idx_1 first in computing
> >> > > > classic distance, we could result in negative distance in case of
> >> > > > backward dependence.  When this happens at dependence carried level,
> >> > > > we manually reverse it.  When this happens at inner level loop, we
> >> > > > simply keep the negative distance.  And it's meaningless to talk about
> >> > > > forward/backward given dependence is carried at outer level loop.
> >> > > >
> >> > > > Outer loop vectorization is interesting.  The problematic case has
> >> > > > backward dependence carried by outer loop.  Because we don't check
> >> > > > dist vs. max_vf for such dep, the unrolled references could have outer
> >> > > > loop index equal to each other, as in a[c][5] and a[c+1][5].  So it's
> >> > > > like we have outer loop index resolved as equal.  Now it has
> >> > > > dependence only if c == c' + 1.  I think previous comment on fusion
> >> > > > still hold for this and we now need to check backward dependence
> >> > > > between the two reference at inner level loop.  I guess it's a bit
> >> > > > trick to model dependence between a[c][5] and a[c+1][5] using the
> >> > > > original references and dependence.  I think it's okay to do that
> >> > > > because distance of a[c][5] and a[c+1][5] is what we computed
> >> > > > previously for the original references at inner level loop.
> >> > > >
> >> > > > Not sure if I have missed something important.
> >> > >
> >> > > Not sure either.  unroll-and-jam does
> >> > >
> >> > >       FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
> >> > >         {
> >> > >           /* A distance (a,b) is at worst transformed into (a/N,b) by the
> >> > >              unrolling (factor N), so the transformation is valid if
> >> > >              a >= N, or b > 0, or b is zero and a > 0.  Otherwise the unroll
> >> > >              factor needs to be limited so that the first condition holds.
> >> > >              That may limit the factor down to zero in the worst case.  */
> >> > >           int dist = dist_v[0];
> >> > >           if (dist < 0)
> >> > >             gcc_unreachable ();
> >> > >           else if ((unsigned)dist >= *unroll)
> >> > >             ;
> >> > >           else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> >> > >                    || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> >> > >                        && dist > 0))
> >> > >             ;
> >> > >           else
> >> > >             *unroll = dist;
> >> > >
> >> > > where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost loop.
> >> > >
> >> > > The vectorizer does way more complicated things and only looks at the
> >> > > distance with respect to the outer loop as far as I can see which can
> >> > > be negative.
> >> > >
> >> > > Not sure if fusion and vectorizer "interleaving" makes a difference here.
> >> > > I think the idea was that when interleaving stmt-by-stmt then forward
> >> > > dependences would be preserved and thus we don't need to check the
> >> > > inner loop dependences.  speaking with "forward vs. backward" dependences
> >> > > again, not distances...
> >> > >
> >> > > This also means that unroll-and-jam could be enhanced to "interleave"
> >> > > stmts and thus cover more cases?
> >> > >
> >> > > Again, I hope Micha can have a look here...
> >> >
> >> > Coming back to this...  the vectorizer doesn't look at the inner loop distances
> >> > at all at the moment.  Bins orginal patch adds that for the case where the
> >> > outer loop evolution results in a negative dependence distance.  But what
> >> > if the outer loop evolution results in a zero distance or a positive distance
> >> > within the bounds of a sensible max_vf?  So I think we need to "simply"
> >> > look at all distance vector components, not just that of the outer loop
> >> > and perform all regular checks.  That also allows adjustment of max_vf
> >> > for inner loop dependences in case vectorization with smaller vectors is
> >> > still possible.
> >> >
> >> > Doing that by making loop_depth iterate
> >> > fixes the testcase (as expected) but also regresses vect-outer-simd-[123].c
> >> > if we make the code look at ->safelen of the inner loop when processing
> >> > inner loop distances (because safelen is 0 there).  Just using ->safelen
> >> > from the outer loop doesn't regress anything it seems.
> >> >
> >> > So I am going to test the following attached.  I have added the
> >> > testcase also with reversed inner loop to verify we'd apply outer loop
> >> > vectorization to that one.
> >> >
> >> > Any comments?
> >> I kind of think that we are checking for different things in this way.
> >> IIUC, dependence checks in outer loop vectorizer could be categorized
> >> into three parts.  The regular vect related check on outer loop;
> >> dependence check to fuse inner loop iterations; and dependence check
> >> to shuffle scalar statement/reference together.  To pass the fusion
> >> check, we only need to make sure no backward dependence is generated
> >> in the result.  So for case of zero/positive distance in outer loop
> >> evolution, it won't result in backward dependence?  For example, if we
> >> adjust the original loop like:
> >>
> >>    for (b = 4; b >= 0; b--)
> >>      for (c = 0; c <= 6; c++)
> >> -       a[c + 1][b + 2] = a[c][b + 1];
> >> +      a[c + 1][b + 1] = a[c][b + 2];
> >>
> >> The fusion result would be
> >>    for (b = 4; b > 0; b -= 4)
> >>    {
> >>      for (c = 0; c <= 6; c++)
> >>      {
> >>        a[c + 1][5] = a[c][6];  // S1
> >>        a[c + 1][4] = a[c][5];  // S2
> >>        a[c + 1][3] = a[c][4];
> >>        a[c + 1][2] = a[c][3];
> >>      }
> >>    }
> >> Though there is dependence between s1/s2, it's forward dependence now.
> >
> > Hmm, OK.  But this doesn't vectorize with or without the patch, we claim
> >
> > t.c:9:3: note:   dependence distance  = 1.
> > t.c:11:29: missed:   not vectorized, possible dependence between
> > data-refs a[c.3_15][_48] and a[_3][_50]
> > t.c:9:3: missed:  bad data dependence.
> > t.c:9:3: missed: couldn't vectorize loop
> >
> >> Even we need to check the negative distance for inner loop, using the
> >> regular checks for fusion part for inner loop(s) is a bit strange.
> >> Using outer loop's safelen here is also unclear.
> >>
> >> I am not sure where we do the shuffle related check, is it (dist == 0) case?
> >> Is that the reason that the patch employs the whole regular checks?
> >
> > I think the suffling needs to look at the outer loop distance and it is the
> > == 0 and the forward dependence case where we check/constrain
> > against the vectorization factor?
> >
> > I agree that formulating the outer loop dependence testing in a way to
> > check the unroll-and-jam condition and the shuffling part separately
> > would be best.  I was mainly worried that your patch only hooks into
> > the negative "outer" distance case and not the zero or positive
> > distance one.  For example for unroll-and-jam we constrain the
> > maximum unroll factor.  I wonder if we should simply call into
> > its adjust_unroll_factor from the vectorizer or copy&paste it
> > to the vectorizer.  I'll note that with known dependence distances
> > it seems to never compute failure...  at least it computes the
> > correct maximum vectorization factor for me.
> >
> > I also wonder about dependences for DRs in the outer loop
> > which we'd even more heavily re-order.
> >
> > Anyways, we should somehow fix this.
> >
> > Attached is a patch variant cut&pasting the unroll-and-jam
> > testing.
> >
> > Does this look better?
> >
> > Thanks,
> > Richard.
> >
> >> Thanks,
> >> bin
> >> >
> >> > Bootstrap & regtest running on x86_64-unknown-linux-gnu.
> >> >
> >> > Richard.
> >> >
> >> > 2019-03-21  Richard Biener  <rguenther@suse.de>
> >> >
> >> >         PR tree-optimization/81740
> >> >         * tree-vect-data-refs.c (vect_analyze_data_ref_dependence):
> >> >         Perform distance vector related dependence checks for all
> >> >         loops.
> >> >
> >> >         * gcc.dg/vect/pr81740-1.c: New testcase.
> >> >         * gcc.dg/vect/pr81740-2.c: Likewise.
> >
> > 2019-03-25  Richard Biener  <rguenther@suse.de>
> >
> >         PR tree-optimization/81740
> >       * tree-vect-data-refs.c (vect_analyze_data_ref_dependence):
> >       Add explicit unroll-and-jam dependence testing.
> >
> >       * gcc.dg/vect/pr81740-1.c: New testcase.
> >       * gcc.dg/vect/pr81740-2.c: Likewise.
> >
> > Index: gcc/tree-vect-data-refs.c
> > ===================================================================
> > --- gcc/tree-vect-data-refs.c (revision 269908)
> > +++ gcc/tree-vect-data-refs.c (working copy)
> > @@ -411,6 +411,45 @@ vect_analyze_data_ref_dependence (struct
> >      }
> >
> >    loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr));
> > +  gcc_assert (loop_depth == 0);
> > +
> > +  /* Perform unroll-and-jam test.  */
> > +  if (nested_in_vect_loop_p (loop, stmtinfo_a))
> > +    {
> > +      FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
> > +     {
> > +       /* A distance (a,b) is at worst transformed into (a/N,b) by the
> > +          unrolling (factor N), so the transformation is valid if
> > +          a >= N, or b > 0, or b is zero and a > 0.  Otherwise the unroll
> > +          factor needs to be limited so that the first condition holds.
> > +          That may limit the factor down to zero in the worst case.  */
>
> I think it would be good to say that we're applying a two-stage
> check here (unroll-and-jam, then statement-level interleaving).  As it
> stands, it sounds like proving (a>0 && b==0) || b>0 is enough to prove
> that outer-loop vectorisation is valid for any VF, whereas that isn't
> true for the interleaving part.  That said...
>
> > +       int dist = dist_v[0];
> > +       if (dist < 0)
> > +         gcc_unreachable ();
> > +       else if ((unsigned)dist >= *max_vf)
> > +         ;
> > +       else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> > +                || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> > +                    && dist > 0))
> > +         ;
> > +       else if (dist <= 1)
> > +         return opt_result::failure_at (stmtinfo_a->stmt,
> > +                                        "not vectorized, possible dependence"
> > +                                        " between data-refs %T and %T\n",
> > +                                        DR_REF (dra), DR_REF (drb));
> > +       else
> > +         {
> > +           /* Dependence distance does not create dependence, as far as
> > +              vectorization is concerned, in this case.  */
> > +           if (dump_enabled_p ())
> > +             dump_printf_loc (MSG_NOTE, vect_location,
> > +                              "dependence between data-refs %T and %T "
> > +                              "constrains vectorization factor to %d\n",
> > +                              DR_REF (dra), DR_REF (drb), dist);
> > +           *max_vf = dist;
> > +         }
> > +     }
> > +    }
>
> ...I guess this is looping back to the original discussion, sorry, but I'm
> not sure taking it as a two-stage check really helps.  We're going from:
>
> A:
>    for i in [0:n]:
>      for j in [0:n]:
>        A(i,j)
>        B(i,j)
>
> to:
>
> B:
>    for i in [0:n:2]:
>      for j in [0:n]:
>        A(i,j)
>        B(i,j)
>      for j in [0:n]:
>        A(i+1,j)
>        B(i+1,j)
>
> to:
>
> C:
>    for i in [0:n:2]:
>      for j in [0:n]:
>        A(i,j)
>        B(i,j)
>        A(i+1,j)
>        B(i+1,j)
>
> to:
>
> D:
>    for i in [0:n:2]:
>      for j in [0:n]:
>        A(i,j)
>        A(i+1,j)
>        B(i,j)
>        B(i+1,j)
>
> but since outer loop vectorisation doesn't change the order of
> statements for each "i", IMO it's clearer to go directly from B to D
> by treating the inner loop as though we were completely unrolling it.
> C doesn't seem like a useful midway point.
>
> I think the only cases the patch is handling on top of existing checks are:
>
> (a) a<0 && b>0, normalised to a>0 && b<0 with a reverse flag.
> (b) a==0 && b==0, which we now reject earlier
>
> I don't think (b) makes any difference in practice because
> (i) we already reject accesses with a zero step and (ii)
> without the support for runtime aliasing checks for outer loops,
> we're not going to be able to add any extra disambiguation for
> things like:
>
>     d[i][j] = ...;
>     ... = d[i][j];
>
> relative to other loop accesses, so there won't be any cases we can
> handle here that wouldn't have been propagated earlier.
>
> So I think we're really left with (a) being the useful case,
> which is also the one added by Bin's original patch.
I felt that my patch is enough to fix the issue, but didn't explicitly
mention because I failed to reason about it case by case.  If it can
be somehow proven, I would suggest first apply it now and
revisit/refactor vect dependence check in the light of your comments
in Stage 1.  BTW, I am not fully following the unrolling view.  Seems
to me unroll doesn't change order, but there is possible reordering
here.

Thanks,
bin
>
> Based on the "complete unrolling" view, if we number statements as
> (i, n), where i is the outer loop iteration and n is a statement number
> in the (completely unrolled) loop body, then the original scalar code
> executes in lexicographical order while for the vector loop:
>
> (1) (i,n) executes before (i+ix,n+nx) for all ix>=0, nx>=1, regardless of VF
> (2) (i,n) executes before (i+ix,n-nx) for all ix>=VF, nx>=0
>       (well, nx unrestricted, but only nx>=0 is useful given (1))
>
> So for any kind of dependence between (i,n) and (i+ix,n-nx), ix>=1, nx>=0
> we need to restrict VF to ix so that (2) ensures the right order.
> This means that the unnormalised distances of interest are:
>
> - (ix, -nx), ix>=1, nx>=0
> - (-ix, nx), ix>=1, nx>=0
>
> But the second gets normalised to the first, which is actually useful
> in this case :-).
>
> In terms of the existing code, I think that means we want to change
> the handling of nested statements (only) to:
>
> - ignore DDR_REVERSED_P (ddr)
> - restrict the main dist > 0 case to when the inner distance is <= 0.
>
> This should have the side effect of allowing outer-loop vectorisation for:
>
> void __attribute__ ((noipa))
> f (int a[][N], int b[restrict])
> {
>   for (int i = N - 1; i-- > 0; )
>     for (int j = 0; j < N - 1; ++j)
>       a[j + 1][i] = a[j][i + 1] + b[i];
> }
>
> At the moment we reject this, but AFAICT it should be OK.
> (We do allow it for s/i + 1/i/, since then the outer distance is 0.)
>
> I think we can also skip the existing dist==0 code for nested statements
> since any case that really matters (such as zero steps) should also have
> a second dependence distance (1, 0) that we'd handle as above.  Doing that
> is probably something for stage 1 though.
>
> Hope at least some of that was vaguely right. ;-)
>
> Thanks,
> Richard
Richard Biener March 26, 2019, 8:30 a.m. UTC | #17
On Tue, Mar 26, 2019 at 1:56 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Fri, Mar 22, 2019 at 7:12 AM Bin.Cheng <amker.cheng@gmail.com> wrote:
> >>
> >> On Thu, Mar 21, 2019 at 8:24 PM Richard Biener
> >> <richard.guenther@gmail.com> wrote:
> >> >
> >> > On Mon, Dec 18, 2017 at 1:37 PM Richard Biener
> >> > <richard.guenther@gmail.com> wrote:
> >> > >
> >> > > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> >> > > > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener
> >> > > > <richard.guenther@gmail.com> wrote:
> >> > > >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> >> > > >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> >> > > >>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
> >> > > >>>> <richard.guenther@gmail.com> wrote:
> >> > > >>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> >> > > >>>>>> Hi,
> >> > > >>>>>> As explained in the PR, given below test case:
> >> > > >>>>>> int a[8][10] = { [2][5] = 4 }, c;
> >> > > >>>>>>
> >> > > >>>>>> int
> >> > > >>>>>> main ()
> >> > > >>>>>> {
> >> > > >>>>>>   short b;
> >> > > >>>>>>   int i, d;
> >> > > >>>>>>   for (b = 4; b >= 0; b--)
> >> > > >>>>>>     for (c = 0; c <= 6; c++)
> >> > > >>>>>>       a[c + 1][b + 2] = a[c][b + 1];
> >> > > >>>>>>   for (i = 0; i < 8; i++)
> >> > > >>>>>>     for (d = 0; d < 10; d++)
> >> > > >>>>>>       if (a[i][d] != (i == 3 && d == 6) * 4)
> >> > > >>>>>>         __builtin_abort ();
> >> > > >>>>>>   return 0;
> >> > > >>>>>>
> >> > > >>>>>> the loop nest is illegal for vectorization without reversing inner loop.  The issue
> >> > > >>>>>> is in data dependence checking of vectorizer, I believe the mentioned revision just
> >> > > >>>>>> exposed this.  Previously the vectorization is skipped because of unsupported memory
> >> > > >>>>>> operation.  The outer loop vectorization unrolls the outer loop into:
> >> > > >>>>>>
> >> > > >>>>>>   for (b = 4; b > 0; b -= 4)
> >> > > >>>>>>   {
> >> > > >>>>>>     for (c = 0; c <= 6; c++)
> >> > > >>>>>>       a[c + 1][6] = a[c][5];
> >> > > >>>>>>     for (c = 0; c <= 6; c++)
> >> > > >>>>>>       a[c + 1][5] = a[c][4];
> >> > > >>>>>>     for (c = 0; c <= 6; c++)
> >> > > >>>>>>       a[c + 1][4] = a[c][3];
> >> > > >>>>>>     for (c = 0; c <= 6; c++)
> >> > > >>>>>>       a[c + 1][3] = a[c][2];
> >> > > >>>>>>   }
> >> > > >>>>>> Then four inner loops are fused into:
> >> > > >>>>>>   for (b = 4; b > 0; b -= 4)
> >> > > >>>>>>   {
> >> > > >>>>>>     for (c = 0; c <= 6; c++)
> >> > > >>>>>>     {
> >> > > >>>>>>       a[c + 1][6] = a[c][5];  // S1
> >> > > >>>>>>       a[c + 1][5] = a[c][4];  // S2
> >> > > >>>>>>       a[c + 1][4] = a[c][3];
> >> > > >>>>>>       a[c + 1][3] = a[c][2];
> >> > > >>>>>>     }
> >> > > >>>>>>   }
> >> > > >>>>>
> >> > > >>>>> Note that they are not really "fused" but they are interleaved.  With
> >> > > >>>>> GIMPLE in mind
> >> > > >>>>> that makes a difference, you should get the equivalent of
> >> > > >>>>>
> >> > > >>>>>    for (c = 0; c <= 6; c++)
> >> > > >>>>>      {
> >> > > >>>>>        tem1 = a[c][5];
> >> > > >>>>>        tem2 = a[c][4];
> >> > > >>>>>        tem3 = a[c][3];
> >> > > >>>>>        tem4 = a[c][2];
> >> > > >>>>>        a[c+1][6] = tem1;
> >> > > >>>>>        a[c +1][5] = tem2;
> >> > > >>>>>         a[c+1][4] = tem3;
> >> > > >>>>>         a[c+1][3] = tem4;
> >> > > >>>>>      }
> >> > > >>>> Yeah, I will double check if this abstract breaks the patch and how.
> >> > > >>> Hmm, I think this doesn't break it, well at least for part of the
> >> > > >>> analysis, because it is loop carried (backward) dependence goes wrong,
> >> > > >>> interleaving or not with the same iteration doesn't matter here.
> >> > > >>
> >> > > >> I think the idea is that forward dependences are always fine (negative distance)
> >> > > >> to vectorize.  But with backward dependences we have to adhere to max_vf.
> >> > > >>
> >> > > >> It looks like for outer loop vectorization we only look at the distances in the
> >> > > >> outer loop but never at inner ones.  But here the same applies but isn't that
> >> > > >> independend on the distances with respect to the outer loop?
> >> > > >>
> >> > > >> But maybe I'm misunderstanding how "distances" work here.
> >> > > > Hmm, I am not sure I understand "distance" correctly.  With
> >> > > > description as in book like "Optimizing compilers for Modern
> >> > > > Architectures", distance is "# of iteration of sink ref - # of
> >> > > > iteration of source ref".  Given below example:
> >> > > >   for (i = 0; i < N; ++i)
> >> > > >     {
> >> > > >       x = arr[idx_1];  // S1
> >> > > >       arr[idx_2] = x;  // S2
> >> > > >     }
> >> > > > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0.  Also
> >> > > > this is forward dependence.  For example, idx_1 is i + 1 and idx_2 is
> >> > > > i;
> >> > > > If S2 is source ref, distance = idx_1 - idx_2, and distance < 0.  Also
> >> > > > this is backward dependence.  For example idx_1 is i and idx_2 is i +
> >> > > > 1;
> >> > > >
> >> > > > In GCC, we always try to subtract idx_2 from idx_1 first in computing
> >> > > > classic distance, we could result in negative distance in case of
> >> > > > backward dependence.  When this happens at dependence carried level,
> >> > > > we manually reverse it.  When this happens at inner level loop, we
> >> > > > simply keep the negative distance.  And it's meaningless to talk about
> >> > > > forward/backward given dependence is carried at outer level loop.
> >> > > >
> >> > > > Outer loop vectorization is interesting.  The problematic case has
> >> > > > backward dependence carried by outer loop.  Because we don't check
> >> > > > dist vs. max_vf for such dep, the unrolled references could have outer
> >> > > > loop index equal to each other, as in a[c][5] and a[c+1][5].  So it's
> >> > > > like we have outer loop index resolved as equal.  Now it has
> >> > > > dependence only if c == c' + 1.  I think previous comment on fusion
> >> > > > still hold for this and we now need to check backward dependence
> >> > > > between the two reference at inner level loop.  I guess it's a bit
> >> > > > trick to model dependence between a[c][5] and a[c+1][5] using the
> >> > > > original references and dependence.  I think it's okay to do that
> >> > > > because distance of a[c][5] and a[c+1][5] is what we computed
> >> > > > previously for the original references at inner level loop.
> >> > > >
> >> > > > Not sure if I have missed something important.
> >> > >
> >> > > Not sure either.  unroll-and-jam does
> >> > >
> >> > >       FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
> >> > >         {
> >> > >           /* A distance (a,b) is at worst transformed into (a/N,b) by the
> >> > >              unrolling (factor N), so the transformation is valid if
> >> > >              a >= N, or b > 0, or b is zero and a > 0.  Otherwise the unroll
> >> > >              factor needs to be limited so that the first condition holds.
> >> > >              That may limit the factor down to zero in the worst case.  */
> >> > >           int dist = dist_v[0];
> >> > >           if (dist < 0)
> >> > >             gcc_unreachable ();
> >> > >           else if ((unsigned)dist >= *unroll)
> >> > >             ;
> >> > >           else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> >> > >                    || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> >> > >                        && dist > 0))
> >> > >             ;
> >> > >           else
> >> > >             *unroll = dist;
> >> > >
> >> > > where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost loop.
> >> > >
> >> > > The vectorizer does way more complicated things and only looks at the
> >> > > distance with respect to the outer loop as far as I can see which can
> >> > > be negative.
> >> > >
> >> > > Not sure if fusion and vectorizer "interleaving" makes a difference here.
> >> > > I think the idea was that when interleaving stmt-by-stmt then forward
> >> > > dependences would be preserved and thus we don't need to check the
> >> > > inner loop dependences.  speaking with "forward vs. backward" dependences
> >> > > again, not distances...
> >> > >
> >> > > This also means that unroll-and-jam could be enhanced to "interleave"
> >> > > stmts and thus cover more cases?
> >> > >
> >> > > Again, I hope Micha can have a look here...
> >> >
> >> > Coming back to this...  the vectorizer doesn't look at the inner loop distances
> >> > at all at the moment.  Bins orginal patch adds that for the case where the
> >> > outer loop evolution results in a negative dependence distance.  But what
> >> > if the outer loop evolution results in a zero distance or a positive distance
> >> > within the bounds of a sensible max_vf?  So I think we need to "simply"
> >> > look at all distance vector components, not just that of the outer loop
> >> > and perform all regular checks.  That also allows adjustment of max_vf
> >> > for inner loop dependences in case vectorization with smaller vectors is
> >> > still possible.
> >> >
> >> > Doing that by making loop_depth iterate
> >> > fixes the testcase (as expected) but also regresses vect-outer-simd-[123].c
> >> > if we make the code look at ->safelen of the inner loop when processing
> >> > inner loop distances (because safelen is 0 there).  Just using ->safelen
> >> > from the outer loop doesn't regress anything it seems.
> >> >
> >> > So I am going to test the following attached.  I have added the
> >> > testcase also with reversed inner loop to verify we'd apply outer loop
> >> > vectorization to that one.
> >> >
> >> > Any comments?
> >> I kind of think that we are checking for different things in this way.
> >> IIUC, dependence checks in outer loop vectorizer could be categorized
> >> into three parts.  The regular vect related check on outer loop;
> >> dependence check to fuse inner loop iterations; and dependence check
> >> to shuffle scalar statement/reference together.  To pass the fusion
> >> check, we only need to make sure no backward dependence is generated
> >> in the result.  So for case of zero/positive distance in outer loop
> >> evolution, it won't result in backward dependence?  For example, if we
> >> adjust the original loop like:
> >>
> >>    for (b = 4; b >= 0; b--)
> >>      for (c = 0; c <= 6; c++)
> >> -       a[c + 1][b + 2] = a[c][b + 1];
> >> +      a[c + 1][b + 1] = a[c][b + 2];
> >>
> >> The fusion result would be
> >>    for (b = 4; b > 0; b -= 4)
> >>    {
> >>      for (c = 0; c <= 6; c++)
> >>      {
> >>        a[c + 1][5] = a[c][6];  // S1
> >>        a[c + 1][4] = a[c][5];  // S2
> >>        a[c + 1][3] = a[c][4];
> >>        a[c + 1][2] = a[c][3];
> >>      }
> >>    }
> >> Though there is dependence between s1/s2, it's forward dependence now.
> >
> > Hmm, OK.  But this doesn't vectorize with or without the patch, we claim
> >
> > t.c:9:3: note:   dependence distance  = 1.
> > t.c:11:29: missed:   not vectorized, possible dependence between
> > data-refs a[c.3_15][_48] and a[_3][_50]
> > t.c:9:3: missed:  bad data dependence.
> > t.c:9:3: missed: couldn't vectorize loop
> >
> >> Even we need to check the negative distance for inner loop, using the
> >> regular checks for fusion part for inner loop(s) is a bit strange.
> >> Using outer loop's safelen here is also unclear.
> >>
> >> I am not sure where we do the shuffle related check, is it (dist == 0) case?
> >> Is that the reason that the patch employs the whole regular checks?
> >
> > I think the suffling needs to look at the outer loop distance and it is the
> > == 0 and the forward dependence case where we check/constrain
> > against the vectorization factor?
> >
> > I agree that formulating the outer loop dependence testing in a way to
> > check the unroll-and-jam condition and the shuffling part separately
> > would be best.  I was mainly worried that your patch only hooks into
> > the negative "outer" distance case and not the zero or positive
> > distance one.  For example for unroll-and-jam we constrain the
> > maximum unroll factor.  I wonder if we should simply call into
> > its adjust_unroll_factor from the vectorizer or copy&paste it
> > to the vectorizer.  I'll note that with known dependence distances
> > it seems to never compute failure...  at least it computes the
> > correct maximum vectorization factor for me.
> >
> > I also wonder about dependences for DRs in the outer loop
> > which we'd even more heavily re-order.
> >
> > Anyways, we should somehow fix this.
> >
> > Attached is a patch variant cut&pasting the unroll-and-jam
> > testing.
> >
> > Does this look better?
> >
> > Thanks,
> > Richard.
> >
> >> Thanks,
> >> bin
> >> >
> >> > Bootstrap & regtest running on x86_64-unknown-linux-gnu.
> >> >
> >> > Richard.
> >> >
> >> > 2019-03-21  Richard Biener  <rguenther@suse.de>
> >> >
> >> >         PR tree-optimization/81740
> >> >         * tree-vect-data-refs.c (vect_analyze_data_ref_dependence):
> >> >         Perform distance vector related dependence checks for all
> >> >         loops.
> >> >
> >> >         * gcc.dg/vect/pr81740-1.c: New testcase.
> >> >         * gcc.dg/vect/pr81740-2.c: Likewise.
> >
> > 2019-03-25  Richard Biener  <rguenther@suse.de>
> >
> >         PR tree-optimization/81740
> >       * tree-vect-data-refs.c (vect_analyze_data_ref_dependence):
> >       Add explicit unroll-and-jam dependence testing.
> >
> >       * gcc.dg/vect/pr81740-1.c: New testcase.
> >       * gcc.dg/vect/pr81740-2.c: Likewise.
> >
> > Index: gcc/tree-vect-data-refs.c
> > ===================================================================
> > --- gcc/tree-vect-data-refs.c (revision 269908)
> > +++ gcc/tree-vect-data-refs.c (working copy)
> > @@ -411,6 +411,45 @@ vect_analyze_data_ref_dependence (struct
> >      }
> >
> >    loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr));
> > +  gcc_assert (loop_depth == 0);
> > +
> > +  /* Perform unroll-and-jam test.  */
> > +  if (nested_in_vect_loop_p (loop, stmtinfo_a))
> > +    {
> > +      FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
> > +     {
> > +       /* A distance (a,b) is at worst transformed into (a/N,b) by the
> > +          unrolling (factor N), so the transformation is valid if
> > +          a >= N, or b > 0, or b is zero and a > 0.  Otherwise the unroll
> > +          factor needs to be limited so that the first condition holds.
> > +          That may limit the factor down to zero in the worst case.  */
>
> I think it would be good to say that we're applying a two-stage
> check here (unroll-and-jam, then statement-level interleaving).  As it
> stands, it sounds like proving (a>0 && b==0) || b>0 is enough to prove
> that outer-loop vectorisation is valid for any VF, whereas that isn't
> true for the interleaving part.  That said...
>
> > +       int dist = dist_v[0];
> > +       if (dist < 0)
> > +         gcc_unreachable ();
> > +       else if ((unsigned)dist >= *max_vf)
> > +         ;
> > +       else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> > +                || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> > +                    && dist > 0))
> > +         ;
> > +       else if (dist <= 1)
> > +         return opt_result::failure_at (stmtinfo_a->stmt,
> > +                                        "not vectorized, possible dependence"
> > +                                        " between data-refs %T and %T\n",
> > +                                        DR_REF (dra), DR_REF (drb));
> > +       else
> > +         {
> > +           /* Dependence distance does not create dependence, as far as
> > +              vectorization is concerned, in this case.  */
> > +           if (dump_enabled_p ())
> > +             dump_printf_loc (MSG_NOTE, vect_location,
> > +                              "dependence between data-refs %T and %T "
> > +                              "constrains vectorization factor to %d\n",
> > +                              DR_REF (dra), DR_REF (drb), dist);
> > +           *max_vf = dist;
> > +         }
> > +     }
> > +    }
>
> ...I guess this is looping back to the original discussion, sorry, but I'm
> not sure taking it as a two-stage check really helps.  We're going from:
>
> A:
>    for i in [0:n]:
>      for j in [0:n]:
>        A(i,j)
>        B(i,j)
>
> to:
>
> B:
>    for i in [0:n:2]:
>      for j in [0:n]:
>        A(i,j)
>        B(i,j)
>      for j in [0:n]:
>        A(i+1,j)
>        B(i+1,j)
>
> to:
>
> C:
>    for i in [0:n:2]:
>      for j in [0:n]:
>        A(i,j)
>        B(i,j)
>        A(i+1,j)
>        B(i+1,j)
>
> to:
>
> D:
>    for i in [0:n:2]:
>      for j in [0:n]:
>        A(i,j)
>        A(i+1,j)
>        B(i,j)
>        B(i+1,j)
>
> but since outer loop vectorisation doesn't change the order of
> statements for each "i", IMO it's clearer to go directly from B to D
> by treating the inner loop as though we were completely unrolling it.
> C doesn't seem like a useful midway point.
>
> I think the only cases the patch is handling on top of existing checks are:
>
> (a) a<0 && b>0, normalised to a>0 && b<0 with a reverse flag.
> (b) a==0 && b==0, which we now reject earlier
>
> I don't think (b) makes any difference in practice because
> (i) we already reject accesses with a zero step and (ii)
> without the support for runtime aliasing checks for outer loops,
> we're not going to be able to add any extra disambiguation for
> things like:
>
>     d[i][j] = ...;
>     ... = d[i][j];
>
> relative to other loop accesses, so there won't be any cases we can
> handle here that wouldn't have been propagated earlier.
>
> So I think we're really left with (a) being the useful case,
> which is also the one added by Bin's original patch.
>
> Based on the "complete unrolling" view, if we number statements as
> (i, n), where i is the outer loop iteration and n is a statement number
> in the (completely unrolled) loop body, then the original scalar code
> executes in lexicographical order while for the vector loop:
>
> (1) (i,n) executes before (i+ix,n+nx) for all ix>=0, nx>=1, regardless of VF
> (2) (i,n) executes before (i+ix,n-nx) for all ix>=VF, nx>=0
>       (well, nx unrestricted, but only nx>=0 is useful given (1))
>
> So for any kind of dependence between (i,n) and (i+ix,n-nx), ix>=1, nx>=0
> we need to restrict VF to ix so that (2) ensures the right order.
> This means that the unnormalised distances of interest are:
>
> - (ix, -nx), ix>=1, nx>=0
> - (-ix, nx), ix>=1, nx>=0
>
> But the second gets normalised to the first, which is actually useful
> in this case :-).
>
> In terms of the existing code, I think that means we want to change
> the handling of nested statements (only) to:
>
> - ignore DDR_REVERSED_P (ddr)
> - restrict the main dist > 0 case to when the inner distance is <= 0.
>
> This should have the side effect of allowing outer-loop vectorisation for:
>
> void __attribute__ ((noipa))
> f (int a[][N], int b[restrict])
> {
>   for (int i = N - 1; i-- > 0; )
>     for (int j = 0; j < N - 1; ++j)
>       a[j + 1][i] = a[j][i + 1] + b[i];
> }
>
> At the moment we reject this, but AFAICT it should be OK.
> (We do allow it for s/i + 1/i/, since then the outer distance is 0.)

Can you file an enhancement request so we don't forget?

> I think we can also skip the existing dist==0 code for nested statements
> since any case that really matters (such as zero steps) should also have
> a second dependence distance (1, 0) that we'd handle as above.  Doing that
> is probably something for stage 1 though.
>
> Hope at least some of that was vaguely right. ;-)

OK, so there seems to be consensus that Bins patch catched all
cases so lets go with it.

I'm going to re-test & apply it.

Richard.

>
> Thanks,
> Richard
Richard Sandiford April 1, 2019, 5:26 p.m. UTC | #18
Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Mar 26, 2019 at 1:56 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>> Based on the "complete unrolling" view, if we number statements as
>> (i, n), where i is the outer loop iteration and n is a statement number
>> in the (completely unrolled) loop body, then the original scalar code
>> executes in lexicographical order while for the vector loop:
>>
>> (1) (i,n) executes before (i+ix,n+nx) for all ix>=0, nx>=1, regardless of VF
>> (2) (i,n) executes before (i+ix,n-nx) for all ix>=VF, nx>=0
>>       (well, nx unrestricted, but only nx>=0 is useful given (1))
>>
>> So for any kind of dependence between (i,n) and (i+ix,n-nx), ix>=1, nx>=0
>> we need to restrict VF to ix so that (2) ensures the right order.
>> This means that the unnormalised distances of interest are:
>>
>> - (ix, -nx), ix>=1, nx>=0
>> - (-ix, nx), ix>=1, nx>=0
>>
>> But the second gets normalised to the first, which is actually useful
>> in this case :-).
>>
>> In terms of the existing code, I think that means we want to change
>> the handling of nested statements (only) to:
>>
>> - ignore DDR_REVERSED_P (ddr)
>> - restrict the main dist > 0 case to when the inner distance is <= 0.
>>
>> This should have the side effect of allowing outer-loop vectorisation for:
>>
>> void __attribute__ ((noipa))
>> f (int a[][N], int b[restrict])
>> {
>>   for (int i = N - 1; i-- > 0; )
>>     for (int j = 0; j < N - 1; ++j)
>>       a[j + 1][i] = a[j][i + 1] + b[i];
>> }
>>
>> At the moment we reject this, but AFAICT it should be OK.
>> (We do allow it for s/i + 1/i/, since then the outer distance is 0.)
>
> Can you file an enhancement request so we don't forget?

OK, for the record it's https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89908
diff mbox series

Patch

diff --git a/gcc/testsuite/gcc.dg/vect/pr81740-1.c b/gcc/testsuite/gcc.dg/vect/pr81740-1.c
new file mode 100644
index 0000000..d90aba5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr81740-1.c
@@ -0,0 +1,17 @@ 
+/* { dg-do run } */
+int a[8][10] = { [2][5] = 4 }, c;
+
+int
+main ()
+{
+  short b;
+  int i, d;
+  for (b = 4; b >= 0; b--)
+    for (c = 0; c <= 6; c++)
+      a[c + 1][b + 2] = a[c][b + 1];
+  for (i = 0; i < 8; i++)
+    for (d = 0; d < 10; d++)
+      if (a[i][d] != (i == 3 && d == 6) * 4)
+        __builtin_abort ();
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/vect/pr81740-2.c b/gcc/testsuite/gcc.dg/vect/pr81740-2.c
new file mode 100644
index 0000000..fb5b300
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr81740-2.c
@@ -0,0 +1,21 @@ 
+/* { dg-do run } */
+/* { dg-require-effective-target vect_int } */
+
+int a[8][10] = { [2][5] = 4 }, c;
+
+int
+main ()
+{
+  short b;
+  int i, d;
+  for (b = 4; b >= 0; b--)
+    for (c = 6; c >= 0; c--)
+      a[c + 1][b + 2] = a[c][b + 1];
+  for (i = 0; i < 8; i++)
+    for (d = 0; d < 10; d++)
+      if (a[i][d] != (i == 3 && d == 6) * 4)
+        __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect"  } } */
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 996d156..3b780cf1 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -435,6 +435,17 @@  vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr,
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 	                     "dependence distance negative.\n");
+	  /* When doing outer loop vectorization, we need to check if there is
+	     backward dependence at inner loop level if dependence at the outer
+	     loop is reversed.  See PR81740 for more information.  */
+	  if (nested_in_vect_loop_p (loop, DR_STMT (dra))
+	      || nested_in_vect_loop_p (loop, DR_STMT (drb)))
+	    {
+	      unsigned inner_depth = index_in_loop_nest (loop->inner->num,
+							 DDR_LOOP_NEST (ddr));
+	      if (dist_v[inner_depth] < 0)
+		return true;
+	    }
 	  /* Record a negative dependence distance to later limit the
 	     amount of stmt copying / unrolling we can perform.
 	     Only need to handle read-after-write dependence.  */