Message ID | 564270D6.6090303@acm.org |
---|---|
State | New |
Headers | show |
> I've been unable to introduce a testcase for this. The difficulty is we want > to check an rtl dump from the acceleration compiler, and there doesn't > appear to be existing machinery for that in the testsuite. Perhaps > something to be added later? I haven't tried it, but doesn't /* { dg-options "-foffload=-fdump-rtl-..." } */ with /* { dg-final { scan-rtl-dump ... } } */ work? -- Ilya
On 11/10/2015 11:33 PM, Nathan Sidwell wrote: > I've committed this patch to trunk. It implements a partitioning > optimization for a loop partitioned over both vector and worker axes. > We can elide the inner vector partitioning state propagation, if there > are no intervening instructions in the worker-partitioned outer loop > other than the forking and joining. We simply execute the worker > propagation on all vectors. Patch LGTM, although I wonder if you really need the extra option rather than just optimize. > I've been unable to introduce a testcase for this. The difficulty is we > want to check an rtl dump from the acceleration compiler, and there > doesn't appear to be existing machinery for that in the testsuite. > Perhaps something to be added later? What's the difficulty exactly? Getting a dump should be possible with -foffload=-fdump-whatever, does the testsuite have a problem finding the right filename? Bernd
On 11/10/15 17:45, Ilya Verbin wrote: >> I've been unable to introduce a testcase for this. The difficulty is we want >> to check an rtl dump from the acceleration compiler, and there doesn't >> appear to be existing machinery for that in the testsuite. Perhaps >> something to be added later? > > I haven't tried it, but doesn't > /* { dg-options "-foffload=-fdump-rtl-..." } */ > with > /* { dg-final { scan-rtl-dump ... } } */ > work? in the gcc testsuite directories? That's the approach I was going for. The issue is detecting when the test should be run. target==nvptx-*-* isn't right, as the target is the x86 host machine. There doesn't seem to be an existing dejagnu predicate there to select for 'accel_target==FOO'. Am I missing something? nathan
On 11/11/15 07:06, Bernd Schmidt wrote: > On 11/10/2015 11:33 PM, Nathan Sidwell wrote: >> I've committed this patch to trunk. It implements a partitioning >> optimization for a loop partitioned over both vector and worker axes. >> We can elide the inner vector partitioning state propagation, if there >> are no intervening instructions in the worker-partitioned outer loop >> other than the forking and joining. We simply execute the worker >> propagation on all vectors. > > Patch LGTM, although I wonder if you really need the extra option rather than > just optimize. The reason I added the option was to be able to turn it off independent of the other optimizations, (in cases of debugging) >> I've been unable to introduce a testcase for this. The difficulty is we >> want to check an rtl dump from the acceleration compiler, and there >> doesn't appear to be existing machinery for that in the testsuite. >> Perhaps something to be added later? > > What's the difficulty exactly? Getting a dump should be possible with > -foffload=-fdump-whatever, does the testsuite have a problem finding the right > filename? That's not the problem. How to conditionally enable the test is the difficulty. I suspect porting something concerning accel_compiler from the libgomp testsuite is needed? nathan
On 11/11/2015 02:59 PM, Nathan Sidwell wrote: > That's not the problem. How to conditionally enable the test is the > difficulty. I suspect porting something concerning accel_compiler from > the libgomp testsuite is needed? Maybe a check_effective_target_offload_nvptx which tries to see if -foffload=nvptx gives an error (I would hope it does if it's unsupported). Bernd
Hi! On Wed, 11 Nov 2015 08:59:17 -0500, Nathan Sidwell <nathan@acm.org> wrote: > On 11/11/15 07:06, Bernd Schmidt wrote: > > On 11/10/2015 11:33 PM, Nathan Sidwell wrote: > >> I've been unable to introduce a testcase for this. (But you still committed an update to gcc/testsuite/ChangeLog.) You'll need to put such an offloading test into the libgomp testsuite -- offloading complation requires linking, and during that, the offloading compiler(s) will be invoked, which only the libgomp testsuite is set up to do, as discussed before. > >> The difficulty is we > >> want to check an rtl dump from the acceleration compiler, and there > >> doesn't appear to be existing machinery for that in the testsuite. > >> Perhaps something to be added later? > > > > What's the difficulty exactly? Getting a dump should be possible with > > -foffload=-fdump-whatever, does the testsuite have a problem finding the right > > filename? Currently, this will create cc* files, for example ccdjj2z9.o.271r.final for -foffload=-fdump-rtl-final. (I don't know if you can come up with dg-* directives to scan these.) The reason is -- I think -- because of the lto-wrapper and/or mkoffloads not specifying a more suitable "base name" for the temporary input files to lto1. > That's not the problem. How to conditionally enable the test is the difficulty. > I suspect porting something concerning accel_compiler from the libgomp > testsuite is needed? Use "{ target openacc_nvidia_accel_selected }", as implemented by libgomp/testsuite/lib/libgomp.exp:check_effective_target_openacc_nvidia_accel_selected (already present on trunk). Grüße Thomas
2015-11-10 Nathan Sidwell <nathan@codesourcery.com> * config/nvptx/nvptx.opt (moptimize): New flag. * config/nvptx/nvptx.c (nvptx_option_override): Set nvptx_optimize default. (nvptx_optimize_inner): New. (nvptx_process_pars): Call it when optimizing. * doc/invoke.texi (Nvidia PTX Options): Document -moptimize. Index: config/nvptx/nvptx.c =================================================================== --- config/nvptx/nvptx.c (revision 230112) +++ config/nvptx/nvptx.c (working copy) @@ -137,6 +137,9 @@ nvptx_option_override (void) write_symbols = NO_DEBUG; debug_info_level = DINFO_LEVEL_NONE; + if (nvptx_optimize < 0) + nvptx_optimize = optimize > 0; + declared_fndecls_htab = hash_table<tree_hasher>::create_ggc (17); needed_fndecls_htab = hash_table<tree_hasher>::create_ggc (17); declared_libfuncs_htab @@ -2942,6 +2945,69 @@ nvptx_skip_par (unsigned mask, parallel nvptx_single (mask, par->forked_block, pre_tail); } +/* If PAR has a single inner parallel and PAR itself only contains + empty entry and exit blocks, swallow the inner PAR. */ + +static void +nvptx_optimize_inner (parallel *par) +{ + parallel *inner = par->inner; + + /* We mustn't be the outer dummy par. */ + if (!par->mask) + return; + + /* We must have a single inner par. */ + if (!inner || inner->next) + return; + + /* We must only contain 2 blocks ourselves -- the head and tail of + the inner par. */ + if (par->blocks.length () != 2) + return; + + /* We must be disjoint partitioning. As we only have vector and + worker partitioning, this is sufficient to guarantee the pars + have adjacent partitioning. */ + if ((par->mask & inner->mask) & (GOMP_DIM_MASK (GOMP_DIM_MAX) - 1)) + /* This indicates malformed code generation. */ + return; + + /* The outer forked insn should be immediately followed by the inner + fork insn. */ + rtx_insn *forked = par->forked_insn; + rtx_insn *fork = BB_END (par->forked_block); + + if (NEXT_INSN (forked) != fork) + return; + gcc_checking_assert (recog_memoized (fork) == CODE_FOR_nvptx_fork); + + /* The outer joining insn must immediately follow the inner join + insn. */ + rtx_insn *joining = par->joining_insn; + rtx_insn *join = inner->join_insn; + if (NEXT_INSN (join) != joining) + return; + + /* Preconditions met. Swallow the inner par. */ + if (dump_file) + fprintf (dump_file, "Merging loop %x [%d,%d] into %x [%d,%d]\n", + inner->mask, inner->forked_block->index, + inner->join_block->index, + par->mask, par->forked_block->index, par->join_block->index); + + par->mask |= inner->mask & (GOMP_DIM_MASK (GOMP_DIM_MAX) - 1); + + par->blocks.reserve (inner->blocks.length ()); + while (inner->blocks.length ()) + par->blocks.quick_push (inner->blocks.pop ()); + + par->inner = inner->inner; + inner->inner = NULL; + + delete inner; +} + /* Process the parallel PAR and all its contained parallels. We do everything but the neutering. Return mask of partitioned modes used within this parallel. */ @@ -2949,6 +3015,9 @@ nvptx_skip_par (unsigned mask, parallel static unsigned nvptx_process_pars (parallel *par) { + if (nvptx_optimize) + nvptx_optimize_inner (par); + unsigned inner_mask = par->mask; /* Do the inner parallels first. */ Index: config/nvptx/nvptx.opt =================================================================== --- config/nvptx/nvptx.opt (revision 230112) +++ config/nvptx/nvptx.opt (working copy) @@ -28,3 +28,7 @@ Generate code for a 64-bit ABI. mmainkernel Target Report RejectNegative Link in code for a __main kernel. + +moptimize +Target Report Var(nvptx_optimize) Init(-1) +Optimize partition neutering Index: doc/invoke.texi =================================================================== --- doc/invoke.texi (revision 230112) +++ doc/invoke.texi (working copy) @@ -873,7 +873,7 @@ Objective-C and Objective-C++ Dialects}. -march=@var{arch} -mbmx -mno-bmx -mcdx -mno-cdx} @emph{Nvidia PTX Options} -@gccoptlist{-m32 -m64 -mmainkernel} +@gccoptlist{-m32 -m64 -mmainkernel -moptimize} @emph{PDP-11 Options} @gccoptlist{-mfpu -msoft-float -mac0 -mno-ac0 -m40 -m45 -m10 @gol @@ -18960,6 +18960,11 @@ Generate code for 32-bit or 64-bit ABI. Link in code for a __main kernel. This is for stand-alone instead of offloading execution. +@item -moptimize +@opindex moptimize +Apply partitioned execution optimizations. This is the default when any +level of optimization is selected. + @end table @node PDP-11 Options