Patchwork Fix PR53397

login
register
mail settings
Submitter venkataramanan.kumar@amd.com
Date Oct. 1, 2012, 3:50 p.m.
Message ID <20121001155033.14499.64527.sendpatchset@adcelk01.amd.com>
Download mbox | patch
Permalink /patch/188320/
State New
Headers show

Comments

venkataramanan.kumar@amd.com - Oct. 1, 2012, 3:50 p.m.
Hi, 

The below patch fixes the FFT/Scimark regression caused by useless prefetch
generation.

This fix tries to make prefetch less aggressive by prefetching arrays in the
inner loop, when the step is invariant in the entire loop nest.

GCC currently tries to prefetch invariant steps when they are in the inner
loop. But does not check if the step is variant in outer loops.

In the scimark FFT case, the trip count of the inner loop varies by a non
constant step, which is invariant in the inner loop. 
But the step variable is varying in outer loop. This makes
inner loop trip count small (at run time varies sometimes as small as 1
iteration) 

Prefetching ahead x iteration when the inner loop trip count is smaller than x
leads to useless prefetches. 

Flag used: -O3 -march=amdfam10 

Before 
**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:          550.50
FFT             Mflops:    38.66    (N=1024)
SOR             Mflops:   617.61    (100 x 100)
MonteCarlo:     Mflops:   173.74
Sparse matmult  Mflops:   675.63    (N=1000, nz=5000)
LU              Mflops:  1246.88    (M=100, N=100)


After 
**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:          639.20
FFT             Mflops:   479.19    (N=1024)
SOR             Mflops:   617.61    (100 x 100)
MonteCarlo:     Mflops:   173.18
Sparse matmult  Mflops:   679.13    (N=1000, nz=5000)
LU              Mflops:  1246.88    (M=100, N=100)

GCC regression "make check -k" passes with x86_64-unknown-linux-gnu
New tests that PASS:

gcc.dg/pr53397-1.c scan-assembler prefetcht0
gcc.dg/pr53397-1.c scan-tree-dump aprefetch "Issued prefetch"
gcc.dg/pr53397-1.c (test for excess errors)
gcc.dg/pr53397-2.c scan-tree-dump aprefetch "loop variant step"
gcc.dg/pr53397-2.c scan-tree-dump aprefetch "Not prefetching"
gcc.dg/pr53397-2.c (test for excess errors)


Checked CPU2006 and polyhedron on latest AMD processor, no regressions noted.

Ok to commit in trunk?

regards,
Venkat

gcc/ChangeLog
+2012-10-01  Venkataramanan Kumar  <venkataramanan.kumar@amd.com>
+
+   * tree-ssa-loop-prefetch.c (gather_memory_references_ref):$
+   Perform non constant step prefetching in inner loop, only $
+   when it is invariant in the entire loop nest.  $
+   * testsuite/gcc.dg/pr53397-1.c: New test case $
+   Checks we are prefecthing for loop invariant steps$
+   * testsuite/gcc.dg/pr53397-2.c: New test case$
+   Checks we are not prefecthing for loop variant steps
+
Richard Guenther - Oct. 2, 2012, 12:11 p.m.
On Mon, 1 Oct 2012, venkataramanan.kumar@amd.com wrote:

> Hi, 
> 
> The below patch fixes the FFT/Scimark regression caused by useless prefetch
> generation.
> 
> This fix tries to make prefetch less aggressive by prefetching arrays in the
> inner loop, when the step is invariant in the entire loop nest.
> 
> GCC currently tries to prefetch invariant steps when they are in the inner
> loop. But does not check if the step is variant in outer loops.
> 
> In the scimark FFT case, the trip count of the inner loop varies by a non
> constant step, which is invariant in the inner loop. 
> But the step variable is varying in outer loop. This makes
> inner loop trip count small (at run time varies sometimes as small as 1
> iteration) 
> 
> Prefetching ahead x iteration when the inner loop trip count is smaller than x
> leads to useless prefetches. 
> 
> Flag used: -O3 -march=amdfam10 
> 
> Before 
> **                                                              **
> ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
> ** for details. (Results can be submitted to pozo@nist.gov)     **
> **                                                              **
> Using       2.00 seconds min time per kenel.
> Composite Score:          550.50
> FFT             Mflops:    38.66    (N=1024)
> SOR             Mflops:   617.61    (100 x 100)
> MonteCarlo:     Mflops:   173.74
> Sparse matmult  Mflops:   675.63    (N=1000, nz=5000)
> LU              Mflops:  1246.88    (M=100, N=100)
> 
> 
> After 
> **                                                              **
> ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
> ** for details. (Results can be submitted to pozo@nist.gov)     **
> **                                                              **
> Using       2.00 seconds min time per kenel.
> Composite Score:          639.20
> FFT             Mflops:   479.19    (N=1024)
> SOR             Mflops:   617.61    (100 x 100)
> MonteCarlo:     Mflops:   173.18
> Sparse matmult  Mflops:   679.13    (N=1000, nz=5000)
> LU              Mflops:  1246.88    (M=100, N=100)
> 
> GCC regression "make check -k" passes with x86_64-unknown-linux-gnu
> New tests that PASS:
> 
> gcc.dg/pr53397-1.c scan-assembler prefetcht0
> gcc.dg/pr53397-1.c scan-tree-dump aprefetch "Issued prefetch"
> gcc.dg/pr53397-1.c (test for excess errors)
> gcc.dg/pr53397-2.c scan-tree-dump aprefetch "loop variant step"
> gcc.dg/pr53397-2.c scan-tree-dump aprefetch "Not prefetching"
> gcc.dg/pr53397-2.c (test for excess errors)
> 
> 
> Checked CPU2006 and polyhedron on latest AMD processor, no regressions noted.
> 
> Ok to commit in trunk?
> 
> regards,
> Venkat
> 
> gcc/ChangeLog
> +2012-10-01  Venkataramanan Kumar  <venkataramanan.kumar@amd.com>
> +
> +   * tree-ssa-loop-prefetch.c (gather_memory_references_ref):$
> +   Perform non constant step prefetching in inner loop, only $
> +   when it is invariant in the entire loop nest.  $
> +   * testsuite/gcc.dg/pr53397-1.c: New test case $
> +   Checks we are prefecthing for loop invariant steps$
> +   * testsuite/gcc.dg/pr53397-2.c: New test case$
> +   Checks we are not prefecthing for loop variant steps
> +
> 
> 
> Index: gcc/testsuite/gcc.dg/pr53397-1.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/pr53397-1.c	(revision 0)
> +++ gcc/testsuite/gcc.dg/pr53397-1.c	(revision 0)
> @@ -0,0 +1,28 @@
> +/* Prefetching when the step is loop invariant.  */
> +
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -fprefetch-loop-arrays -fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3 --param simultaneous-prefetches=10 -fdump-tree-aprefetch-details" } */
> +
> +
> +double data[16384];
> +void prefetch_when_non_constant_step_is_invariant(int step, int n)
> +{
> +     int a;
> +     int b;
> +     for (a = 1; a < step; a++) {
> +        for (b = 0; b < n; b += 2 * step) {
> +
> +          int i = 2*(b + a);
> +          int j = 2*(b + a + step);
> +
> +
> +          data[j]   = data[i];
> +          data[j+1] = data[i+1];
> +        }
> +     }
> +}
> +
> +/* { dg-final { scan-tree-dump "Issued prefetch" "aprefetch" } } */
> +/* { dg-final { scan-assembler "prefetcht0" } } */

This (and the case below) needs to be adjusted to only run on the
appropriate hardware.  See for example gcc.dg/tree-ssa/prefetch-8.c
for how to do this.

> +/* { dg-final { cleanup-tree-dump "aprefetch" } } */
> Index: gcc/testsuite/gcc.dg/pr53397-2.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/pr53397-2.c	(revision 0)
> +++ gcc/testsuite/gcc.dg/pr53397-2.c	(revision 0)
> @@ -0,0 +1,29 @@
> +/* Not prefetching when the step is loop variant.  */
> +
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -fprefetch-loop-arrays -fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3 --param simultaneous-prefetches=10 -fdump-tree-aprefetch-details" } */
> +
> +
> +double data[16384];
> +void donot_prefetch_when_non_constant_step_is_variant(int step, int n)
> +{ 
> +     int a;
> +     int b;
> +     for (a = 1; a < step; a++,step*=2) {
> +        for (b = 0; b < n; b += 2 * step) {
> +
> +          int i = 2*(b + a);
> +          int j = 2*(b + a + step);
> +
> +
> +          data[j]   = data[i];
> +          data[j+1] = data[i+1];
> +        }
> +     } 
> +}
> +
> +/* { dg-final { scan-tree-dump "Not prefetching" "aprefetch" } } */
> +/* { dg-final { scan-tree-dump "loop variant step" "aprefetch" } } */
> +
> +/* { dg-final { cleanup-tree-dump "aprefetch" } } */
> +
> Index: gcc/tree-ssa-loop-prefetch.c
> ===================================================================
> --- gcc/tree-ssa-loop-prefetch.c	(revision 191642)
> +++ gcc/tree-ssa-loop-prefetch.c	(working copy)
> @@ -523,6 +523,7 @@
>    tree base, step;
>    HOST_WIDE_INT delta;
>    struct mem_ref_group *agrp;
> +  loop_p ploop;
>  
>    if (get_base_address (ref) == NULL)
>      return false;
> @@ -537,10 +538,50 @@
>    if (may_be_nonaddressable_p (base))
>      return false;
>  
> -  /* Limit non-constant step prefetching only to the innermost loops.  */
> -  if (!cst_and_fits_in_hwi (step) && loop->inner != NULL)
> -    return false;
> +  /* Limit non-constant step prefetching only to the innermost loops and 
> +     only when the step is invariant in the entire loop nest. */
> +  if (!cst_and_fits_in_hwi (step))
> +    {
> +      if( loop->inner != NULL)
> +        {
> +          if (dump_file && (dump_flags & TDF_DETAILS))
> +            {
> +              fprintf (dump_file, "Reference %p:\n", (void *) ref);
> +              fprintf (dump_file, "(base " );
> +              print_generic_expr (dump_file, base, TDF_SLIM);
> +              fprintf (dump_file, ", step ");
> +              print_generic_expr (dump_file, step, TDF_TREE);
> +              fprintf (dump_file, ")\n");

No need to repeat this - all references are dumped when we gather them.

> +              fprintf (dump_file, "Ignoring %p, non-constant step prefetching\
> +                is limited to inner most loops \n",(void *) ref);
> +            }
> +            return false;    
> +         }
> +      else
> +        {
> +          ploop = loop;
>  
> +          while (loop_outer (ploop))
> +            {
> +              if (!expr_invariant_in_loop_p (ploop , step))
> +                {
> +                  if (dump_file && (dump_flags & TDF_DETAILS))
> +                    {
> +                      fprintf (dump_file, "Reference %p:\n", (void *) ref);
> +                      fprintf (dump_file, "(base " );
> +                      print_generic_expr (dump_file, base, TDF_SLIM);
> +                      fprintf (dump_file, ", step ");
> +                      print_generic_expr (dump_file, step, TDF_TREE);
> +                      fprintf (dump_file, ")\n");

Likewise.

> +                      fprintf (dump_file, "Not prefetching, ignoring %p due to loop variant step\n",(void *) ref);
> +                    }
> +                      return false;                 
> +                }
> +                ploop = loop_outer (ploop);
> +            }

Instead of the loop checking expr_invariant_in_loop_p for each sub-loop
you can query expr_invariant_in_loop_p once, for the outermost loop.

Please add a new function to cfgloop.h to access this outermost loop:

/* Returns the outermost loop of the loop nest that contains LOOP.  */

static inline struct loop *
loop_outermost (const struct loop *loop)
{
  unsigned n = VEC_length (loop_p, loop->superloops);

  if (n <= 1)
    return loop;

  return VEC_index (loop_p, loop->superloops, 1);
}

Thanks,
Richard.
venkataramanan.kumar@amd.com - Oct. 2, 2012, 4:40 p.m.
Hi Richi,

(Snip)
> + (!cst_and_fits_in_hwi (step))
> +    {
> +      if( loop->inner != NULL)
> +        {
> +          if (dump_file && (dump_flags & TDF_DETAILS))
> +            {
> +              fprintf (dump_file, "Reference %p:\n", (void *) ref);
> +              fprintf (dump_file, "(base " );
> +              print_generic_expr (dump_file, base, TDF_SLIM);
> +              fprintf (dump_file, ", step ");
> +              print_generic_expr (dump_file, step, TDF_TREE);
> +              fprintf (dump_file, ")\n");

No need to repeat this - all references are dumped when we gather them.
(Snip)

The dumping happens at "record_ref" which is called after these statements to record these references.

When the step is invariant  we return from the function without recording the references. 

 so I thought of dumping the references here.

Is there a cleaner way to dump the references at one place?

Regards,
Venkat.



-----Original Message-----
From: Richard Guenther [mailto:rguenther@suse.de] 
Sent: Tuesday, October 02, 2012 5:42 PM
To: Kumar, Venkataramanan
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [Patch] Fix PR53397

On Mon, 1 Oct 2012, venkataramanan.kumar@amd.com wrote:

> Hi,
> 
> The below patch fixes the FFT/Scimark regression caused by useless 
> prefetch generation.
> 
> This fix tries to make prefetch less aggressive by prefetching arrays 
> in the inner loop, when the step is invariant in the entire loop nest.
> 
> GCC currently tries to prefetch invariant steps when they are in the 
> inner loop. But does not check if the step is variant in outer loops.
> 
> In the scimark FFT case, the trip count of the inner loop varies by a 
> non constant step, which is invariant in the inner loop.
> But the step variable is varying in outer loop. This makes inner loop 
> trip count small (at run time varies sometimes as small as 1
> iteration)
> 
> Prefetching ahead x iteration when the inner loop trip count is 
> smaller than x leads to useless prefetches.
> 
> Flag used: -O3 -march=amdfam10
> 
> Before 
> **                                                              **
> ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
> ** for details. (Results can be submitted to pozo@nist.gov)     **
> **                                                              **
> Using       2.00 seconds min time per kenel.
> Composite Score:          550.50
> FFT             Mflops:    38.66    (N=1024)
> SOR             Mflops:   617.61    (100 x 100)
> MonteCarlo:     Mflops:   173.74
> Sparse matmult  Mflops:   675.63    (N=1000, nz=5000)
> LU              Mflops:  1246.88    (M=100, N=100)
> 
> 
> After 
> **                                                              **
> ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
> ** for details. (Results can be submitted to pozo@nist.gov)     **
> **                                                              **
> Using       2.00 seconds min time per kenel.
> Composite Score:          639.20
> FFT             Mflops:   479.19    (N=1024)
> SOR             Mflops:   617.61    (100 x 100)
> MonteCarlo:     Mflops:   173.18
> Sparse matmult  Mflops:   679.13    (N=1000, nz=5000)
> LU              Mflops:  1246.88    (M=100, N=100)
> 
> GCC regression "make check -k" passes with x86_64-unknown-linux-gnu 
> New tests that PASS:
> 
> gcc.dg/pr53397-1.c scan-assembler prefetcht0 gcc.dg/pr53397-1.c 
> scan-tree-dump aprefetch "Issued prefetch"
> gcc.dg/pr53397-1.c (test for excess errors) gcc.dg/pr53397-2.c 
> scan-tree-dump aprefetch "loop variant step"
> gcc.dg/pr53397-2.c scan-tree-dump aprefetch "Not prefetching"
> gcc.dg/pr53397-2.c (test for excess errors)
> 
> 
> Checked CPU2006 and polyhedron on latest AMD processor, no regressions noted.
> 
> Ok to commit in trunk?
> 
> regards,
> Venkat
> 
> gcc/ChangeLog
> +2012-10-01  Venkataramanan Kumar  <venkataramanan.kumar@amd.com>
> +
> +   * tree-ssa-loop-prefetch.c (gather_memory_references_ref):$
> +   Perform non constant step prefetching in inner loop, only $
> +   when it is invariant in the entire loop nest.  $
> +   * testsuite/gcc.dg/pr53397-1.c: New test case $
> +   Checks we are prefecthing for loop invariant steps$
> +   * testsuite/gcc.dg/pr53397-2.c: New test case$
> +   Checks we are not prefecthing for loop variant steps
> +
> 
> 
> Index: gcc/testsuite/gcc.dg/pr53397-1.c 
> ===================================================================
> --- gcc/testsuite/gcc.dg/pr53397-1.c	(revision 0)
> +++ gcc/testsuite/gcc.dg/pr53397-1.c	(revision 0)
> @@ -0,0 +1,28 @@
> +/* Prefetching when the step is loop invariant.  */
> +
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -fprefetch-loop-arrays 
> +-fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3 
> +--param simultaneous-prefetches=10 -fdump-tree-aprefetch-details" } 
> +*/
> +
> +
> +double data[16384];
> +void prefetch_when_non_constant_step_is_invariant(int step, int n) {
> +     int a;
> +     int b;
> +     for (a = 1; a < step; a++) {
> +        for (b = 0; b < n; b += 2 * step) {
> +
> +          int i = 2*(b + a);
> +          int j = 2*(b + a + step);
> +
> +
> +          data[j]   = data[i];
> +          data[j+1] = data[i+1];
> +        }
> +     }
> +}
> +
> +/* { dg-final { scan-tree-dump "Issued prefetch" "aprefetch" } } */
> +/* { dg-final { scan-assembler "prefetcht0" } } */

This (and the case below) needs to be adjusted to only run on the appropriate hardware.  See for example gcc.dg/tree-ssa/prefetch-8.c for how to do this.

> +/* { dg-final { cleanup-tree-dump "aprefetch" } } */
> Index: gcc/testsuite/gcc.dg/pr53397-2.c 
> ===================================================================
> --- gcc/testsuite/gcc.dg/pr53397-2.c	(revision 0)
> +++ gcc/testsuite/gcc.dg/pr53397-2.c	(revision 0)
> @@ -0,0 +1,29 @@
> +/* Not prefetching when the step is loop variant.  */
> +
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -fprefetch-loop-arrays 
> +-fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3 
> +--param simultaneous-prefetches=10 -fdump-tree-aprefetch-details" } 
> +*/
> +
> +
> +double data[16384];
> +void donot_prefetch_when_non_constant_step_is_variant(int step, int 
> +n) {
> +     int a;
> +     int b;
> +     for (a = 1; a < step; a++,step*=2) {
> +        for (b = 0; b < n; b += 2 * step) {
> +
> +          int i = 2*(b + a);
> +          int j = 2*(b + a + step);
> +
> +
> +          data[j]   = data[i];
> +          data[j+1] = data[i+1];
> +        }
> +     }
> +}
> +
> +/* { dg-final { scan-tree-dump "Not prefetching" "aprefetch" } } */
> +/* { dg-final { scan-tree-dump "loop variant step" "aprefetch" } } */
> +
> +/* { dg-final { cleanup-tree-dump "aprefetch" } } */
> +
> Index: gcc/tree-ssa-loop-prefetch.c
> ===================================================================
> --- gcc/tree-ssa-loop-prefetch.c	(revision 191642)
> +++ gcc/tree-ssa-loop-prefetch.c	(working copy)
> @@ -523,6 +523,7 @@
>    tree base, step;
>    HOST_WIDE_INT delta;
>    struct mem_ref_group *agrp;
> +  loop_p ploop;
>  
>    if (get_base_address (ref) == NULL)
>      return false;
> @@ -537,10 +538,50 @@
>    if (may_be_nonaddressable_p (base))
>      return false;
>  
> -  /* Limit non-constant step prefetching only to the innermost loops.  
> */
> -  if (!cst_and_fits_in_hwi (step) && loop->inner != NULL)
> -    return false;
> +  /* Limit non-constant step prefetching only to the innermost loops and 
> +     only when the step is invariant in the entire loop nest. */  if 
> + (!cst_and_fits_in_hwi (step))
> +    {
> +      if( loop->inner != NULL)
> +        {
> +          if (dump_file && (dump_flags & TDF_DETAILS))
> +            {
> +              fprintf (dump_file, "Reference %p:\n", (void *) ref);
> +              fprintf (dump_file, "(base " );
> +              print_generic_expr (dump_file, base, TDF_SLIM);
> +              fprintf (dump_file, ", step ");
> +              print_generic_expr (dump_file, step, TDF_TREE);
> +              fprintf (dump_file, ")\n");

No need to repeat this - all references are dumped when we gather them.

> +              fprintf (dump_file, "Ignoring %p, non-constant step prefetching\
> +                is limited to inner most loops \n",(void *) ref);
> +            }
> +            return false;    
> +         }
> +      else
> +        {
> +          ploop = loop;
>  
> +          while (loop_outer (ploop))
> +            {
> +              if (!expr_invariant_in_loop_p (ploop , step))
> +                {
> +                  if (dump_file && (dump_flags & TDF_DETAILS))
> +                    {
> +                      fprintf (dump_file, "Reference %p:\n", (void *) ref);
> +                      fprintf (dump_file, "(base " );
> +                      print_generic_expr (dump_file, base, TDF_SLIM);
> +                      fprintf (dump_file, ", step ");
> +                      print_generic_expr (dump_file, step, TDF_TREE);
> +                      fprintf (dump_file, ")\n");

Likewise.

> +                      fprintf (dump_file, "Not prefetching, ignoring %p due to loop variant step\n",(void *) ref);
> +                    }
> +                      return false;                 
> +                }
> +                ploop = loop_outer (ploop);
> +            }

Instead of the loop checking expr_invariant_in_loop_p for each sub-loop you can query expr_invariant_in_loop_p once, for the outermost loop.

Please add a new function to cfgloop.h to access this outermost loop:

/* Returns the outermost loop of the loop nest that contains LOOP.  */

static inline struct loop *
loop_outermost (const struct loop *loop) {
  unsigned n = VEC_length (loop_p, loop->superloops);

  if (n <= 1)
    return loop;

  return VEC_index (loop_p, loop->superloops, 1); }

Thanks,
Richard.
Richard Guenther - Oct. 4, 2012, 12:55 p.m.
On Tue, Oct 2, 2012 at 6:40 PM, Kumar, Venkataramanan
<Venkataramanan.Kumar@amd.com> wrote:
> Hi Richi,
>
> (Snip)
>> + (!cst_and_fits_in_hwi (step))
>> +    {
>> +      if( loop->inner != NULL)
>> +        {
>> +          if (dump_file && (dump_flags & TDF_DETAILS))
>> +            {
>> +              fprintf (dump_file, "Reference %p:\n", (void *) ref);
>> +              fprintf (dump_file, "(base " );
>> +              print_generic_expr (dump_file, base, TDF_SLIM);
>> +              fprintf (dump_file, ", step ");
>> +              print_generic_expr (dump_file, step, TDF_TREE);
>> +              fprintf (dump_file, ")\n");
>
> No need to repeat this - all references are dumped when we gather them.
> (Snip)
>
> The dumping happens at "record_ref" which is called after these statements to record these references.
>
> When the step is invariant  we return from the function without recording the references.
>
>  so I thought of dumping the references here.
>
> Is there a cleaner way to dump the references at one place?

Yes, call dump_mem_ref then, instead of repeating parts of its body.

Richard.

> Regards,
> Venkat.
>
>
>
> -----Original Message-----
> From: Richard Guenther [mailto:rguenther@suse.de]
> Sent: Tuesday, October 02, 2012 5:42 PM
> To: Kumar, Venkataramanan
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [Patch] Fix PR53397
>
> On Mon, 1 Oct 2012, venkataramanan.kumar@amd.com wrote:
>
>> Hi,
>>
>> The below patch fixes the FFT/Scimark regression caused by useless
>> prefetch generation.
>>
>> This fix tries to make prefetch less aggressive by prefetching arrays
>> in the inner loop, when the step is invariant in the entire loop nest.
>>
>> GCC currently tries to prefetch invariant steps when they are in the
>> inner loop. But does not check if the step is variant in outer loops.
>>
>> In the scimark FFT case, the trip count of the inner loop varies by a
>> non constant step, which is invariant in the inner loop.
>> But the step variable is varying in outer loop. This makes inner loop
>> trip count small (at run time varies sometimes as small as 1
>> iteration)
>>
>> Prefetching ahead x iteration when the inner loop trip count is
>> smaller than x leads to useless prefetches.
>>
>> Flag used: -O3 -march=amdfam10
>>
>> Before
>> **                                                              **
>> ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
>> ** for details. (Results can be submitted to pozo@nist.gov)     **
>> **                                                              **
>> Using       2.00 seconds min time per kenel.
>> Composite Score:          550.50
>> FFT             Mflops:    38.66    (N=1024)
>> SOR             Mflops:   617.61    (100 x 100)
>> MonteCarlo:     Mflops:   173.74
>> Sparse matmult  Mflops:   675.63    (N=1000, nz=5000)
>> LU              Mflops:  1246.88    (M=100, N=100)
>>
>>
>> After
>> **                                                              **
>> ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
>> ** for details. (Results can be submitted to pozo@nist.gov)     **
>> **                                                              **
>> Using       2.00 seconds min time per kenel.
>> Composite Score:          639.20
>> FFT             Mflops:   479.19    (N=1024)
>> SOR             Mflops:   617.61    (100 x 100)
>> MonteCarlo:     Mflops:   173.18
>> Sparse matmult  Mflops:   679.13    (N=1000, nz=5000)
>> LU              Mflops:  1246.88    (M=100, N=100)
>>
>> GCC regression "make check -k" passes with x86_64-unknown-linux-gnu
>> New tests that PASS:
>>
>> gcc.dg/pr53397-1.c scan-assembler prefetcht0 gcc.dg/pr53397-1.c
>> scan-tree-dump aprefetch "Issued prefetch"
>> gcc.dg/pr53397-1.c (test for excess errors) gcc.dg/pr53397-2.c
>> scan-tree-dump aprefetch "loop variant step"
>> gcc.dg/pr53397-2.c scan-tree-dump aprefetch "Not prefetching"
>> gcc.dg/pr53397-2.c (test for excess errors)
>>
>>
>> Checked CPU2006 and polyhedron on latest AMD processor, no regressions noted.
>>
>> Ok to commit in trunk?
>>
>> regards,
>> Venkat
>>
>> gcc/ChangeLog
>> +2012-10-01  Venkataramanan Kumar  <venkataramanan.kumar@amd.com>
>> +
>> +   * tree-ssa-loop-prefetch.c (gather_memory_references_ref):$
>> +   Perform non constant step prefetching in inner loop, only $
>> +   when it is invariant in the entire loop nest.  $
>> +   * testsuite/gcc.dg/pr53397-1.c: New test case $
>> +   Checks we are prefecthing for loop invariant steps$
>> +   * testsuite/gcc.dg/pr53397-2.c: New test case$
>> +   Checks we are not prefecthing for loop variant steps
>> +
>>
>>
>> Index: gcc/testsuite/gcc.dg/pr53397-1.c
>> ===================================================================
>> --- gcc/testsuite/gcc.dg/pr53397-1.c  (revision 0)
>> +++ gcc/testsuite/gcc.dg/pr53397-1.c  (revision 0)
>> @@ -0,0 +1,28 @@
>> +/* Prefetching when the step is loop invariant.  */
>> +
>> +/* { dg-do compile } */
>> +/* { dg-options "-O3 -fprefetch-loop-arrays
>> +-fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3
>> +--param simultaneous-prefetches=10 -fdump-tree-aprefetch-details" }
>> +*/
>> +
>> +
>> +double data[16384];
>> +void prefetch_when_non_constant_step_is_invariant(int step, int n) {
>> +     int a;
>> +     int b;
>> +     for (a = 1; a < step; a++) {
>> +        for (b = 0; b < n; b += 2 * step) {
>> +
>> +          int i = 2*(b + a);
>> +          int j = 2*(b + a + step);
>> +
>> +
>> +          data[j]   = data[i];
>> +          data[j+1] = data[i+1];
>> +        }
>> +     }
>> +}
>> +
>> +/* { dg-final { scan-tree-dump "Issued prefetch" "aprefetch" } } */
>> +/* { dg-final { scan-assembler "prefetcht0" } } */
>
> This (and the case below) needs to be adjusted to only run on the appropriate hardware.  See for example gcc.dg/tree-ssa/prefetch-8.c for how to do this.
>
>> +/* { dg-final { cleanup-tree-dump "aprefetch" } } */
>> Index: gcc/testsuite/gcc.dg/pr53397-2.c
>> ===================================================================
>> --- gcc/testsuite/gcc.dg/pr53397-2.c  (revision 0)
>> +++ gcc/testsuite/gcc.dg/pr53397-2.c  (revision 0)
>> @@ -0,0 +1,29 @@
>> +/* Not prefetching when the step is loop variant.  */
>> +
>> +/* { dg-do compile } */
>> +/* { dg-options "-O3 -fprefetch-loop-arrays
>> +-fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3
>> +--param simultaneous-prefetches=10 -fdump-tree-aprefetch-details" }
>> +*/
>> +
>> +
>> +double data[16384];
>> +void donot_prefetch_when_non_constant_step_is_variant(int step, int
>> +n) {
>> +     int a;
>> +     int b;
>> +     for (a = 1; a < step; a++,step*=2) {
>> +        for (b = 0; b < n; b += 2 * step) {
>> +
>> +          int i = 2*(b + a);
>> +          int j = 2*(b + a + step);
>> +
>> +
>> +          data[j]   = data[i];
>> +          data[j+1] = data[i+1];
>> +        }
>> +     }
>> +}
>> +
>> +/* { dg-final { scan-tree-dump "Not prefetching" "aprefetch" } } */
>> +/* { dg-final { scan-tree-dump "loop variant step" "aprefetch" } } */
>> +
>> +/* { dg-final { cleanup-tree-dump "aprefetch" } } */
>> +
>> Index: gcc/tree-ssa-loop-prefetch.c
>> ===================================================================
>> --- gcc/tree-ssa-loop-prefetch.c      (revision 191642)
>> +++ gcc/tree-ssa-loop-prefetch.c      (working copy)
>> @@ -523,6 +523,7 @@
>>    tree base, step;
>>    HOST_WIDE_INT delta;
>>    struct mem_ref_group *agrp;
>> +  loop_p ploop;
>>
>>    if (get_base_address (ref) == NULL)
>>      return false;
>> @@ -537,10 +538,50 @@
>>    if (may_be_nonaddressable_p (base))
>>      return false;
>>
>> -  /* Limit non-constant step prefetching only to the innermost loops.
>> */
>> -  if (!cst_and_fits_in_hwi (step) && loop->inner != NULL)
>> -    return false;
>> +  /* Limit non-constant step prefetching only to the innermost loops and
>> +     only when the step is invariant in the entire loop nest. */  if
>> + (!cst_and_fits_in_hwi (step))
>> +    {
>> +      if( loop->inner != NULL)
>> +        {
>> +          if (dump_file && (dump_flags & TDF_DETAILS))
>> +            {
>> +              fprintf (dump_file, "Reference %p:\n", (void *) ref);
>> +              fprintf (dump_file, "(base " );
>> +              print_generic_expr (dump_file, base, TDF_SLIM);
>> +              fprintf (dump_file, ", step ");
>> +              print_generic_expr (dump_file, step, TDF_TREE);
>> +              fprintf (dump_file, ")\n");
>
> No need to repeat this - all references are dumped when we gather them.
>
>> +              fprintf (dump_file, "Ignoring %p, non-constant step prefetching\
>> +                is limited to inner most loops \n",(void *) ref);
>> +            }
>> +            return false;
>> +         }
>> +      else
>> +        {
>> +          ploop = loop;
>>
>> +          while (loop_outer (ploop))
>> +            {
>> +              if (!expr_invariant_in_loop_p (ploop , step))
>> +                {
>> +                  if (dump_file && (dump_flags & TDF_DETAILS))
>> +                    {
>> +                      fprintf (dump_file, "Reference %p:\n", (void *) ref);
>> +                      fprintf (dump_file, "(base " );
>> +                      print_generic_expr (dump_file, base, TDF_SLIM);
>> +                      fprintf (dump_file, ", step ");
>> +                      print_generic_expr (dump_file, step, TDF_TREE);
>> +                      fprintf (dump_file, ")\n");
>
> Likewise.
>
>> +                      fprintf (dump_file, "Not prefetching, ignoring %p due to loop variant step\n",(void *) ref);
>> +                    }
>> +                      return false;
>> +                }
>> +                ploop = loop_outer (ploop);
>> +            }
>
> Instead of the loop checking expr_invariant_in_loop_p for each sub-loop you can query expr_invariant_in_loop_p once, for the outermost loop.
>
> Please add a new function to cfgloop.h to access this outermost loop:
>
> /* Returns the outermost loop of the loop nest that contains LOOP.  */
>
> static inline struct loop *
> loop_outermost (const struct loop *loop) {
>   unsigned n = VEC_length (loop_p, loop->superloops);
>
>   if (n <= 1)
>     return loop;
>
>   return VEC_index (loop_p, loop->superloops, 1); }
>
> Thanks,
> Richard.
>
>

Patch

Index: gcc/testsuite/gcc.dg/pr53397-1.c
===================================================================
--- gcc/testsuite/gcc.dg/pr53397-1.c	(revision 0)
+++ gcc/testsuite/gcc.dg/pr53397-1.c	(revision 0)
@@ -0,0 +1,28 @@ 
+/* Prefetching when the step is loop invariant.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O3 -fprefetch-loop-arrays -fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3 --param simultaneous-prefetches=10 -fdump-tree-aprefetch-details" } */
+
+
+double data[16384];
+void prefetch_when_non_constant_step_is_invariant(int step, int n)
+{
+     int a;
+     int b;
+     for (a = 1; a < step; a++) {
+        for (b = 0; b < n; b += 2 * step) {
+
+          int i = 2*(b + a);
+          int j = 2*(b + a + step);
+
+
+          data[j]   = data[i];
+          data[j+1] = data[i+1];
+        }
+     }
+}
+
+/* { dg-final { scan-tree-dump "Issued prefetch" "aprefetch" } } */
+/* { dg-final { scan-assembler "prefetcht0" } } */
+
+/* { dg-final { cleanup-tree-dump "aprefetch" } } */
Index: gcc/testsuite/gcc.dg/pr53397-2.c
===================================================================
--- gcc/testsuite/gcc.dg/pr53397-2.c	(revision 0)
+++ gcc/testsuite/gcc.dg/pr53397-2.c	(revision 0)
@@ -0,0 +1,29 @@ 
+/* Not prefetching when the step is loop variant.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O3 -fprefetch-loop-arrays -fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3 --param simultaneous-prefetches=10 -fdump-tree-aprefetch-details" } */
+
+
+double data[16384];
+void donot_prefetch_when_non_constant_step_is_variant(int step, int n)
+{ 
+     int a;
+     int b;
+     for (a = 1; a < step; a++,step*=2) {
+        for (b = 0; b < n; b += 2 * step) {
+
+          int i = 2*(b + a);
+          int j = 2*(b + a + step);
+
+
+          data[j]   = data[i];
+          data[j+1] = data[i+1];
+        }
+     } 
+}
+
+/* { dg-final { scan-tree-dump "Not prefetching" "aprefetch" } } */
+/* { dg-final { scan-tree-dump "loop variant step" "aprefetch" } } */
+
+/* { dg-final { cleanup-tree-dump "aprefetch" } } */
+
Index: gcc/tree-ssa-loop-prefetch.c
===================================================================
--- gcc/tree-ssa-loop-prefetch.c	(revision 191642)
+++ gcc/tree-ssa-loop-prefetch.c	(working copy)
@@ -523,6 +523,7 @@ 
   tree base, step;
   HOST_WIDE_INT delta;
   struct mem_ref_group *agrp;
+  loop_p ploop;
 
   if (get_base_address (ref) == NULL)
     return false;
@@ -537,10 +538,50 @@ 
   if (may_be_nonaddressable_p (base))
     return false;
 
-  /* Limit non-constant step prefetching only to the innermost loops.  */
-  if (!cst_and_fits_in_hwi (step) && loop->inner != NULL)
-    return false;
+  /* Limit non-constant step prefetching only to the innermost loops and 
+     only when the step is invariant in the entire loop nest. */
+  if (!cst_and_fits_in_hwi (step))
+    {
+      if( loop->inner != NULL)
+        {
+          if (dump_file && (dump_flags & TDF_DETAILS))
+            {
+              fprintf (dump_file, "Reference %p:\n", (void *) ref);
+              fprintf (dump_file, "(base " );
+              print_generic_expr (dump_file, base, TDF_SLIM);
+              fprintf (dump_file, ", step ");
+              print_generic_expr (dump_file, step, TDF_TREE);
+              fprintf (dump_file, ")\n");
+              fprintf (dump_file, "Ignoring %p, non-constant step prefetching\
+                is limited to inner most loops \n",(void *) ref);
+            }
+            return false;    
+         }
+      else
+        {
+          ploop = loop;
 
+          while (loop_outer (ploop))
+            {
+              if (!expr_invariant_in_loop_p (ploop , step))
+                {
+                  if (dump_file && (dump_flags & TDF_DETAILS))
+                    {
+                      fprintf (dump_file, "Reference %p:\n", (void *) ref);
+                      fprintf (dump_file, "(base " );
+                      print_generic_expr (dump_file, base, TDF_SLIM);
+                      fprintf (dump_file, ", step ");
+                      print_generic_expr (dump_file, step, TDF_TREE);
+                      fprintf (dump_file, ")\n");
+                      fprintf (dump_file, "Not prefetching, ignoring %p due to loop variant step\n",(void *) ref);
+                    }
+                      return false;                 
+                }
+                ploop = loop_outer (ploop);
+            }
+        }
+    }
+
   /* Now we know that REF = &BASE + STEP * iter + DELTA, where DELTA and STEP
      are integer constants.  */
   agrp = find_or_create_group (refs, base, step);