diff mbox series

[RFC] openmp: ensure variables in offload table are streamed out (PRs 94848 + 95551) (was: Re: [Patch][RFC] openmp: don't add artificial const decl to offload table (PRs 94848 + 95551))

Message ID 22cdd4ec-7d14-03b1-6789-54788a573bca@codesourcery.com
State New
Headers show
Series [RFC] openmp: ensure variables in offload table are streamed out (PRs 94848 + 95551) (was: Re: [Patch][RFC] openmp: don't add artificial const decl to offload table (PRs 94848 + 95551)) | expand

Commit Message

Tobias Burnus June 8, 2020, 7:36 p.m. UTC
Hi Jakub,

On 6/8/20 5:30 PM, Jakub Jelinek wrote:

> I really don't see what is special exactly on TREE_READONLY DECL_ARTIFICIAL

I have now split-off the missed-optimization task to a new
PR, PR95583, to be handled in a proper way instead of trying
to cook-up a hackish special-case version.

This patch now simply sets the force_output flag.

(a) As output_offload_tables() (i.e. LTO streamout)
     comes very early, one could just set the force_output flag
     in this file without further checks or omp-offload.c changes
(b) Alternatively, one check that it really works by using
       gcc_assert (symtab_node::get (it));
     in either or both files.
(c) or assuming that some optimization worked, one could use:
   if (!symtab_node::get (it))
     continue;

The patch does (c) as trimming it to (b) or (a) is trival.

All should give currently the same result; the assert checks
for this, the "if (...)" is future-optimizations proof, but
I fear that before adding passes before output_offload_tables()
it makes no difference. (→new PR).

(The omp_finish_file comes late enough, but as the LTO has been
written before, it does not help.)

OK? What about backporting to GCC 10?

Tobias

-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstraße 201, 80634 München / Germany
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Alexander Walter

Comments

Jakub Jelinek June 8, 2020, 7:58 p.m. UTC | #1
On Mon, Jun 08, 2020 at 09:36:29PM +0200, Tobias Burnus wrote:
> I have now split-off the missed-optimization task to a new
> PR, PR95583, to be handled in a proper way instead of trying
> to cook-up a hackish special-case version.
> 
> This patch now simply sets the force_output flag.
> 
> (a) As output_offload_tables() (i.e. LTO streamout)
>     comes very early, one could just set the force_output flag
>     in this file without further checks or omp-offload.c changes
> (b) Alternatively, one check that it really works by using
>       gcc_assert (symtab_node::get (it));
>     in either or both files.
> (c) or assuming that some optimization worked, one could use:
>   if (!symtab_node::get (it))
>     continue;
> 
> The patch does (c) as trimming it to (b) or (a) is trival.

I prefer the patch as is, output_offload_tables() isn't actually that early,
there are all the early optimizations before that.
And if we don't optimize it early enough, perhaps we need a targeted unused
target variable removal subpass (early ipa).

> OK? What about backporting to GCC 10?

Ok.  Please wait a few days before backporting.

	Jakub
Tobias Burnus June 9, 2020, 2:02 p.m. UTC | #2
It turned out that this patch fails with LTO and partitions,
causing fails at runtime such as
   libgomp: Duplicate node
via libgomp/splay-tree.c's splay_tree_insert.

In the test case, the problem occurred for functions - namely
main._omp_fn.* on the host.
If the code is run in LTO context, the filtering-out should
have already happen via the stream-out/stream-in and hence no
additional check is needed for omp_finish_file.

OK?

Tobias

PS: The streaming-in is done via:
   input_offload_tables (/* do_force_output = */ !flag_ltrans);

-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstraße 201, 80634 München / Germany
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Alexander Walter
Jakub Jelinek June 9, 2020, 2:11 p.m. UTC | #3
On Tue, Jun 09, 2020 at 04:02:19PM +0200, Tobias Burnus wrote:
> It turned out that this patch fails with LTO and partitions,
> causing fails at runtime such as
>   libgomp: Duplicate node
> via libgomp/splay-tree.c's splay_tree_insert.
> 
> In the test case, the problem occurred for functions - namely
> main._omp_fn.* on the host.
> If the code is run in LTO context, the filtering-out should
> have already happen via the stream-out/stream-in and hence no
> additional check is needed for omp_finish_file.
> 
> OK?

Was this caught in the testsuite, or do you have some short testcase
that could be used in the testsuite?
> 
> Tobias
> 
> PS: The streaming-in is done via:
>   input_offload_tables (/* do_force_output = */ !flag_ltrans);
> 
> -----------------
> Mentor Graphics (Deutschland) GmbH, Arnulfstraße 201, 80634 München / Germany
> Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Alexander Walter

> openmp: ensure variables in offload table are streamed out (PRs 94848 + 95551)
> 
> gcc/ChangeLog:
> 
> 	* omp-offload.c (add_decls_addresses_to_decl_constructor,
> 	omp_finish_file): With in_lto_p, stream out all offload-table
> 	items even if the symtab_node does not exist.

Ok with or without the testcase.

	Jakub
Thomas Schwinge June 12, 2020, 10:15 a.m. UTC | #4
Hi!

On 2020-06-09T16:11:03+0200, Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> On Tue, Jun 09, 2020 at 04:02:19PM +0200, Tobias Burnus wrote:
>> It turned out that this patch fails with LTO and partitions,
>> causing fails at runtime such as
>>   libgomp: Duplicate node
>> via libgomp/splay-tree.c's splay_tree_insert.
>>
>> In the test case, the problem occurred for functions - namely
>> main._omp_fn.* on the host.
>> If the code is run in LTO context, the filtering-out should
>> have already happen via the stream-out/stream-in and hence no
>> additional check is needed for omp_finish_file.
>
> Was this caught in the testsuite

I saw it show up as:

    PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-clauses-kernels-ipa-pta.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0  (test for excess errors)
    [-PASS:-]{+WARNING: program timed out.+}
    {+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-clauses-kernels-ipa-pta.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0  execution test
    PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-clauses-kernels-ipa-pta.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O2  (test for excess errors)
    [-PASS:-]{+WARNING: program timed out.+}
    {+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-clauses-kernels-ipa-pta.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O2  execution test

    PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-clauses-parallel-ipa-pta.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0  (test for excess errors)
    [-PASS:-]{+WARNING: program timed out.+}
    {+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-clauses-parallel-ipa-pta.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0  execution test
    PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-clauses-parallel-ipa-pta.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O2  (test for excess errors)
    [-PASS:-]{+WARNING: program timed out.+}
    {+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-clauses-parallel-ipa-pta.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O2  execution test

Same for C++.

Not sure if that constitutes sufficient testsuite coverage.

> do you have some short testcase
> that could be used in the testsuite?

Can we do some LTO-compile-time tree scanning?


Grüße
 Thomas
-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstraße 201, 80634 München / Germany
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Alexander Walter
diff mbox series

Patch

openmp: ensure variables in offload table are streamed out (PRs 94848 + 95551)

gcc/ChangeLog:

	PR lto/94848
	PR middle-end/95551
	* omp-offload.c (add_decls_addresses_to_decl_constructor,
	omp_finish_file): Skip removed items.
	* lto-cgraph.c (output_offload_tables): Likewise; set force_output
	to this node for variables and functions.

libgomp/ChangeLog:

	PR lto/94848
	PR middle-end/95551
	* testsuite/libgomp.fortran/target-var.f90: New test.

 gcc/lto-cgraph.c                                 |  8 ++++++
 gcc/omp-offload.c                                | 12 ++++++++-
 libgomp/testsuite/libgomp.fortran/target-var.f90 | 32 ++++++++++++++++++++++++
 3 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/gcc/lto-cgraph.c b/gcc/lto-cgraph.c
index a671c671fa7..93a99f3465b 100644
--- a/gcc/lto-cgraph.c
+++ b/gcc/lto-cgraph.c
@@ -1069,6 +1069,10 @@  output_offload_tables (void)
 
   for (unsigned i = 0; i < vec_safe_length (offload_funcs); i++)
     {
+      symtab_node *node = symtab_node::get ((*offload_funcs)[i]);
+      if (!node)
+	continue;
+      node->force_output = true;
       streamer_write_enum (ob->main_stream, LTO_symtab_tags,
 			   LTO_symtab_last_tag, LTO_symtab_unavail_node);
       lto_output_fn_decl_ref (ob->decl_state, ob->main_stream,
@@ -1077,6 +1081,10 @@  output_offload_tables (void)
 
   for (unsigned i = 0; i < vec_safe_length (offload_vars); i++)
     {
+      symtab_node *node = symtab_node::get ((*offload_vars)[i]);
+      if (!node)
+	continue;
+      node->force_output = true;
       streamer_write_enum (ob->main_stream, LTO_symtab_tags,
 			   LTO_symtab_last_tag, LTO_symtab_variable);
       lto_output_var_decl_ref (ob->decl_state, ob->main_stream,
diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index b2df91a5724..4e44cfc9d0a 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -125,6 +125,10 @@  add_decls_addresses_to_decl_constructor (vec<tree, va_gc> *v_decls,
 #endif
 	  && lookup_attribute ("omp declare target link", DECL_ATTRIBUTES (it));
 
+      /* See also omp_finish_file and output_offload_tables in lto-cgraph.c.  */
+      if (!symtab_node::get (it))
+	continue;
+
       tree size = NULL_TREE;
       if (is_var)
 	size = fold_convert (const_ptr_type_node, DECL_SIZE_UNIT (it));
@@ -341,7 +345,7 @@  omp_finish_file (void)
       add_decls_addresses_to_decl_constructor (offload_vars, v_v);
 
       tree vars_decl_type = build_array_type_nelts (pointer_sized_int_node,
-						    num_vars * 2);
+						    vec_safe_length (v_v));
       tree funcs_decl_type = build_array_type_nelts (pointer_sized_int_node,
 						     num_funcs);
       SET_TYPE_ALIGN (vars_decl_type, TYPE_ALIGN (pointer_sized_int_node));
@@ -376,11 +380,17 @@  omp_finish_file (void)
       for (unsigned i = 0; i < num_funcs; i++)
 	{
 	  tree it = (*offload_funcs)[i];
+	  /* See also add_decls_addresses_to_decl_constructor
+	     and output_offload_tables in lto-cgraph.c.  */
+	  if (!symtab_node::get (it))
+	    continue;
 	  targetm.record_offload_symbol (it);
 	}
       for (unsigned i = 0; i < num_vars; i++)
 	{
 	  tree it = (*offload_vars)[i];
+	  if (!symtab_node::get (it))
+	    continue;
 #ifdef ACCEL_COMPILER
 	  if (DECL_HAS_VALUE_EXPR_P (it)
 	      && lookup_attribute ("omp declare target link",
diff --git a/libgomp/testsuite/libgomp.fortran/target-var.f90 b/libgomp/testsuite/libgomp.fortran/target-var.f90
new file mode 100644
index 00000000000..5e5ccd47c96
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/target-var.f90
@@ -0,0 +1,32 @@ 
+! { dg-additional-options "-O3" }
+!
+! With -O3 the static local variable A.10 generated for
+! the array constructor [-2, -4, ..., -20] is optimized
+! away - which has to be handled in the offload_vars table.
+!
+program main
+  implicit none (type, external)
+  integer :: j
+  integer, allocatable :: A(:)
+
+  A = [(3*j, j=1, 10)]
+  call bar (A)
+  deallocate (A)
+contains
+  subroutine bar (array)
+    integer :: i
+    integer :: array(:)
+
+    !$omp target map(from:array)
+    !$acc parallel copyout(array)
+    array = [(-2*i, i = 1, size(array))]
+    !$omp do private(array)
+    !$acc loop gang private(array)
+    do i = 1, 10
+      array(i) = 9*i
+    end do
+    if (any (array /= [(-2*i, i = 1, 10)])) error stop 2
+    !$omp end target
+    !$acc end parallel
+  end subroutine bar
+end