diff mbox

[Openacc] Adjust automatic loop partitioning

Message ID b0e06bc4-1414-6858-5ca9-a0782fa70094@acm.org
State New
Headers show

Commit Message

Nathan Sidwell April 29, 2016, 2 p.m. UTC
Jakub,
currently automatic loop partitioning assigns from the innermost loop outwards 
-- that was the simplest thing to implement.  A better algorithm is to assign 
the outermost loop to the outermost available axis, and then assign from the 
innermost loop outwards.   That way we (generally) get gang partitioning on the 
outermost loop.  Just inside that we'll get non-partitioned loops if the nest is 
too deep, and the two innermost nested loops will get worker and vector 
partitioning.

This patch has been on the gomp4 branch for a while.  ok for trunk?

nathan

Comments

Jakub Jelinek May 2, 2016, 7:14 a.m. UTC | #1
On Fri, Apr 29, 2016 at 10:00:43AM -0400, Nathan Sidwell wrote:
> Jakub,
> currently automatic loop partitioning assigns from the innermost loop
> outwards -- that was the simplest thing to implement.  A better algorithm is
> to assign the outermost loop to the outermost available axis, and then
> assign from the innermost loop outwards.   That way we (generally) get gang
> partitioning on the outermost loop.  Just inside that we'll get
> non-partitioned loops if the nest is too deep, and the two innermost nested
> loops will get worker and vector partitioning.
> 
> This patch has been on the gomp4 branch for a while.  ok for trunk?
> 
> nathan

> 2016-04-29  Nathan Sidwell  <nathan@codesourcery.com>
> 
> 	gcc/
> 	* omp-low.c (struct oacc_loop): Add 'inner' field.
> 	(new_oacc_loop_raw): Initialize it to zero.
> 	(oacc_loop_fixed_partitions): Initialize it.
> 	(oacc_loop_auto_partitions): Partition outermost loop to outermost
> 	available partitioning.
> 
> 	gcc/testsuite/
> 	* c-c++-common/goacc/loop-auto-1.c: Adjust expected warnings.
> 
> 	libgomp/
> 	* testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c: Adjust
> 	expected partitioning.

Ok.

	Jakub
Thomas Schwinge May 3, 2016, 10:34 a.m. UTC | #2
Hi Nathan!

On Fri, 29 Apr 2016 10:00:43 -0400, Nathan Sidwell <nathan@acm.org> wrote:
> currently automatic loop partitioning assigns from the innermost loop outwards 
> -- that was the simplest thing to implement.  A better algorithm is to assign 
> the outermost loop to the outermost available axis, and then assign from the 
> innermost loop outwards.   That way we (generally) get gang partitioning on the 
> outermost loop.  Just inside that we'll get non-partitioned loops if the nest is 
> too deep, and the two innermost nested loops will get worker and vector 
> partitioning.

> 	gcc/
> 	* omp-low.c (struct oacc_loop): Add 'inner' field.
> 	(new_oacc_loop_raw): Initialize it to zero.
> 	(oacc_loop_fixed_partitions): Initialize it.
> 	(oacc_loop_auto_partitions): Partition outermost loop to outermost
> 	available partitioning.

I'm now observing the sporadic failures (that you had mentioned before)
of libgomp.oacc-c-c++-common/atomic_capture-1.c and
libgomp.oacc-fortran/atomic_capture-1.f90.  I suppose the problem is that
constructs such as libgomp.oacc-c-c++-common/atomic_capture-1.c:

      fgot = 1.0;
      fexp = 0.0;
    
    #pragma acc data copy (fgot, fdata[0:N])
      {
    #pragma acc parallel loop
        for (i = 0; i < N; i++)
          {
            float expr = 32.0;
    
    #pragma acc atomic capture
            fdata[i] = fgot = expr - fgot;
          }
      }
    
      for (i = 0; i < N; i++)
        if (i % 2 == 0)
          {
            if (fdata[i] != 31.0)
              abort ();
          }
        else
          {
            if (fdata[i] != 1.0)
              abort ();
          }

... are no longer executed in stable/ascending order, and instead of the
exact "i % 2 == 0" classifier, we should now instead verify what the 31.0
and 1.0 cases each appear with probability 0.5?  Are you looking into
resolving that, or should somebody else have a look?


I'm also seeing the following regression for C and C++,
libgomp.oacc-c-c++-common/loop-auto-1.c with -O2:

    source-gcc/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c: In function 'vector_1._omp_fn.0':
    source-gcc/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c:104:9: internal compiler error: Segmentation fault
     #pragma acc parallel num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size)
             ^

    #4  0x0000000000f73d46 in internal_error (gmsgid=gmsgid@entry=0x105be63 "%s")
        at [...]/source-gcc/gcc/diagnostic.c:1270
    #5  0x00000000009fccb0 in crash_signal (signo=<optimized out>)
        at [...]/source-gcc/gcc/toplev.c:333
    #6  <signal handler called>
    #7  0x0000000000beaf2e in same_succ_flush_bb (bb=<optimized out>, bb=<optimized out>)
        at [...]/source-gcc/gcc/hash-table.h:919
    #8  0x0000000000bec499 in same_succ_flush_bbs (bbs=<optimized out>)
        at [...]/source-gcc/gcc/tree-ssa-tail-merge.c:823
    #9  update_worklist () at [...]/source-gcc/gcc/tree-ssa-tail-merge.c:870
    #10 tail_merge_optimize (todo=todo@entry=32)
        at [...]/source-gcc/gcc/tree-ssa-tail-merge.c:1716
    #11 0x0000000000b99057 in (anonymous namespace)::pass_pre::execute (this=<optimized out>, fun=<optimized out>)
        at [...]/source-gcc/gcc/tree-ssa-pre.c:4818
    #12 0x0000000000937e9d in execute_one_pass (pass=pass@entry=0x1530970)
        at [...]/source-gcc/gcc/passes.c:2348
    #13 0x00000000009384b8 in execute_pass_list_1 (pass=0x1530970)
        at [...]/source-gcc/gcc/passes.c:2432
    #14 0x00000000009384ca in execute_pass_list_1 (pass=0x152fa10)
        at [...]/source-gcc/gcc/passes.c:2433
    #15 0x0000000000938515 in execute_pass_list (fn=0x7ffff69a5930, pass=<optimized out>)
        at [...]/source-gcc/gcc/passes.c:2443
    #16 0x00000000005fdded in cgraph_node::expand (this=this@entry=0x7ffff6990170)
        at [...]/source-gcc/gcc/cgraphunit.c:1982
    #17 0x00000000005ff8c4 in expand_all_functions ()
        at [...]/source-gcc/gcc/cgraphunit.c:2118
    #18 symbol_table::compile (this=0x7ffff68d2000) at [...]/source-gcc/gcc/cgraphunit.c:2474
    #19 0x0000000000561db8 in lto_main () at [...]/source-gcc/gcc/lto/lto.c:3328
    #20 0x00000000009fccef in compile_file () at [...]/source-gcc/gcc/toplev.c:463
    #21 0x000000000052e5ba in do_compile () at [...]/source-gcc/gcc/toplev.c:1987
    #22 toplev::main (this=this@entry=0x7fffffffcc80, argc=argc@entry=18, argv=0x150aec0, argv@entry=0x7fffffffcd88)
        at [...]/source-gcc/gcc/toplev.c:2095
    #23 0x0000000000530247 in main (argc=18, argv=0x7fffffffcd88)
        at [...]/source-gcc/gcc/main.c:39

Are you seeing that, too?  I can't remember seeing that on
gomp-4_0-branch, so it may be due to a recent trunk change, independent
of your omp-low change.  Are you going to have a look, or want me to?


Grüße
 Thomas
diff mbox

Patch

2016-04-29  Nathan Sidwell  <nathan@codesourcery.com>

	gcc/
	* omp-low.c (struct oacc_loop): Add 'inner' field.
	(new_oacc_loop_raw): Initialize it to zero.
	(oacc_loop_fixed_partitions): Initialize it.
	(oacc_loop_auto_partitions): Partition outermost loop to outermost
	available partitioning.

	gcc/testsuite/
	* c-c++-common/goacc/loop-auto-1.c: Adjust expected warnings.

	libgomp/
	* testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c: Adjust
	expected partitioning.

Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c	(revision 235511)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c	(working copy)
@@ -103,9 +103,11 @@  int vector_1 (int *ary, int size)
   
 #pragma acc parallel num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size)
   {
+#pragma acc loop gang
+    for (int jx = 0; jx < 1; jx++)
 #pragma acc loop auto
-    for (int ix = 0; ix < size; ix++)
-      ary[ix] = place ();
+      for (int ix = 0; ix < size; ix++)
+	ary[ix] = place ();
   }
 
   return check (ary, size, 0, 0, 1);
@@ -118,7 +120,7 @@  int vector_2 (int *ary, int size)
 #pragma acc parallel num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size)
   {
 #pragma acc loop worker
-    for (int jx = 0; jx <  size  / 64; jx++)
+    for (int jx = 0; jx < size  / 64; jx++)
 #pragma acc loop auto
       for (int ix = 0; ix < 64; ix++)
 	ary[ix + jx * 64] = place ();
@@ -133,30 +135,16 @@  int worker_1 (int *ary, int size)
   
 #pragma acc parallel num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size)
   {
+#pragma acc loop gang
+    for (int kx = 0; kx < 1; kx++)
 #pragma acc loop auto
-    for (int jx = 0; jx <  size  / 64; jx++)
+      for (int jx = 0; jx <  size  / 64; jx++)
 #pragma acc loop vector
-      for (int ix = 0; ix < 64; ix++)
-	ary[ix + jx * 64] = place ();
-  }
-
-  return check (ary, size, 0, 1, 1);
-}
-
-int worker_2 (int *ary, int size)
-{
-  clear (ary, size);
-  
-#pragma acc parallel num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size)
-  {
-#pragma acc loop auto
-    for (int jx = 0; jx <  size  / 64; jx++)
-#pragma acc loop auto
-      for (int ix = 0; ix < 64; ix++)
-	ary[ix + jx * 64] = place ();
+	for (int ix = 0; ix < 64; ix++)
+	  ary[ix + jx * 64] = place ();
   }
 
-  return check (ary, size, 0, 1, 1);
+  return check (ary, size, 0,  1, 1);
 }
 
 int gang_1 (int *ary, int size)
@@ -193,6 +181,22 @@  int gang_2 (int *ary, int size)
   return check (ary, size, 1, 1, 1);
 }
 
+int gang_3 (int *ary, int size)
+{
+  clear (ary, size);
+  
+#pragma acc parallel num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size)
+  {
+#pragma acc loop auto
+    for (int jx = 0; jx <  size  / 64; jx++)
+#pragma acc loop auto
+      for (int ix = 0; ix < 64; ix++)
+	ary[ix + jx * 64] = place ();
+  }
+
+  return check (ary, size, 1, 0, 1);
+}
+
 #define N (32*32*32)
 int main ()
 {
@@ -214,13 +218,13 @@  int main ()
 
   if (worker_1 (ary,  N))
     return 1;
-  if (worker_2 (ary,  N))
-    return 1;
   
   if (gang_1 (ary,  N))
     return 1;
   if (gang_2 (ary,  N))
     return 1;
+  if (gang_3 (ary,  N))
+    return 1;
 
   return 0;
 }
Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 235511)
+++ gcc/omp-low.c	(working copy)
@@ -241,6 +241,7 @@  struct oacc_loop
   tree routine;  /* Pseudo-loop enclosing a routine.  */
 
   unsigned mask;   /* Partitioning mask.  */
+  unsigned inner;  /* Partitioning of inner loops.  */
   unsigned flags;  /* Partitioning flags.  */
   unsigned ifns;   /* Contained loop abstraction functions.  */
   tree chunk_size; /* Chunk size.  */
@@ -18921,7 +18922,7 @@  new_oacc_loop_raw (oacc_loop *parent, lo
   memset (loop->tails, 0, sizeof (loop->tails));
   loop->routine = NULL_TREE;
 
-  loop->mask = loop->flags = 0;
+  loop->mask = loop->flags = loop->inner = 0;
   loop->ifns = 0;
   loop->chunk_size = 0;
   loop->head_end = NULL;
@@ -19449,8 +19450,11 @@  oacc_loop_fixed_partitions (oacc_loop *l
   mask_all |= this_mask;
   
   if (loop->child)
-    mask_all |= oacc_loop_fixed_partitions (loop->child,
-					    outer_mask | this_mask);
+    {
+      loop->inner = oacc_loop_fixed_partitions (loop->child,
+						outer_mask | this_mask); 
+      mask_all |= loop->inner;
+    }
 
   if (loop->sibling)
     mask_all |= oacc_loop_fixed_partitions (loop->sibling, outer_mask);
@@ -19466,7 +19470,7 @@  oacc_loop_fixed_partitions (oacc_loop *l
 static unsigned
 oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask)
 {
-  unsigned inner_mask = 0;
+  bool assign = (loop->flags & OLF_AUTO) && (loop->flags & OLF_INDEPENDENT);
   bool noisy = true;
 
 #ifdef ACCEL_COMPILER
@@ -19475,16 +19479,33 @@  oacc_loop_auto_partitions (oacc_loop *lo
   noisy = false;
 #endif
 
+  if (assign && outer_mask < GOMP_DIM_MASK (GOMP_DIM_MAX - 1))
+    {
+      /* Allocate the outermost loop at the outermost available
+	 level.  */
+      unsigned this_mask = outer_mask + 1;
+
+      if (!(this_mask & loop->inner))
+	loop->mask = this_mask;
+    }
+
   if (loop->child)
-    inner_mask |= oacc_loop_auto_partitions (loop->child,
-					     outer_mask | loop->mask);
+    {
+      unsigned child_mask = outer_mask | loop->mask;
+
+      if (loop->mask || assign)
+	child_mask |= GOMP_DIM_MASK (GOMP_DIM_MAX);
 
-  if ((loop->flags & OLF_AUTO) && (loop->flags & OLF_INDEPENDENT))
+      loop->inner = oacc_loop_auto_partitions (loop->child, child_mask);
+    }
+
+  if (assign && !loop->mask)
     {
+      /* Allocate the loop at the innermost available level.  */
       unsigned this_mask = 0;
       
       /* Determine the outermost partitioning used within this loop. */
-      this_mask = inner_mask | GOMP_DIM_MASK (GOMP_DIM_MAX);
+      this_mask = loop->inner | GOMP_DIM_MASK (GOMP_DIM_MAX);
       this_mask = (this_mask & -this_mask);
 
       /* Pick the partitioning just inside that one.  */
@@ -19497,17 +19518,20 @@  oacc_loop_auto_partitions (oacc_loop *lo
 	warning_at (loop->loc, 0,
 		    "insufficient partitioning available to parallelize loop");
 
-      if (dump_file)
-	fprintf (dump_file, "Auto loop %s:%d assigned %d\n",
-		 LOCATION_FILE (loop->loc), LOCATION_LINE (loop->loc),
-		 this_mask);
-
       loop->mask = this_mask;
     }
-  inner_mask |= loop->mask;
+
+  if (assign && dump_file)
+    fprintf (dump_file, "Auto loop %s:%d assigned %d\n",
+	     LOCATION_FILE (loop->loc), LOCATION_LINE (loop->loc),
+	     loop->mask);
+
+  unsigned inner_mask = 0;
   
   if (loop->sibling)
     inner_mask |= oacc_loop_auto_partitions (loop->sibling, outer_mask);
+  
+  inner_mask |= loop->inner | loop->mask;
 
   return inner_mask;
 }
Index: gcc/testsuite/c-c++-common/goacc/loop-auto-1.c
===================================================================
--- gcc/testsuite/c-c++-common/goacc/loop-auto-1.c	(revision 235511)
+++ gcc/testsuite/c-c++-common/goacc/loop-auto-1.c	(working copy)
@@ -186,10 +186,10 @@  void Worker (void)
 	for (int jx = 0; jx < 10; jx++) {}
       }
 
-#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */
+#pragma acc loop auto
     for (int ix = 0; ix < 10; ix++)
       {
-#pragma acc loop auto
+#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */
 	for (int jx = 0; jx < 10; jx++)
 	  {
 #pragma acc loop auto
@@ -214,10 +214,10 @@  void Vector (void)
 #pragma acc loop auto
     for (int ix = 0; ix < 10; ix++) {}
 
-#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */
+#pragma acc loop auto
     for (int ix = 0; ix < 10; ix++)
       {
-#pragma acc loop auto
+#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */
 	for (int jx = 0; jx < 10; jx++) {}
       }
 }