diff mbox

[gomp4] Preserve NVPTX "reconvergence" points

Message ID 5583E68A.9020608@codesourcery.com
State New
Headers show

Commit Message

Bernd Schmidt June 19, 2015, 9:53 a.m. UTC
On 05/28/2015 05:08 PM, Jakub Jelinek wrote:

> I understand it is more work, I'd just like to ask that when designing stuff
> for the OpenACC offloading you (plural) try to take the other offloading
> devices and host fallback into account.

The problem is that many of the transformations we need to do are really 
GPU specific, and with the current structure of omplow/ompexp they are 
being done in the host compiler. The offloading scheme we decided on 
does not give us the means to write out multiple versions of an 
offloaded function where each target gets a different one. For that 
reason I think we should postpone these lowering decisions until we're 
in the accel compiler, where they could be controlled by target hooks, 
and over the last two weeks I've been doing some experiments to see how 
that could be achieved.

The basic idea is to delay expanding the inner regions of an OpenACC 
target region during ompexp, write out offload LTO (almost) immediately 
afterwards, and then have another ompexp phase which runs on the accel 
compiler to take the offloaded function to its final form. The first 
attempt really did write LTO immediately after, before moving to SSA 
phase. It seems that this could be made to work, but the pass manager 
and LTO code rather expects that what is being read in is in SSA form 
already. Also, some offloaded code is produced by OpenACC kernels 
expansion much later in the compilation, so with this approach we have 
an inconsistency where functions we get back from LTO are at very 
different levels of lowering.

The next attempt was to run the into-ssa passes after ompexpand, and 
only then write things out. I've changed the gimple representation of 
some OMP statements (primarily gimple_omp_for) so that they are 
relatively normal statements with operands that can be transformed into 
SSA form. As far as what's easier to work with - I believe some of the 
transformations we have to do could benefit from being in SSA, but on 
the other hand the OpenACC predication code has given me some trouble. 
I've still not sompletely convinced myself that the update_ssa call I've 
added will actually do the right thing after we've mucked up the CFG.

I'm appending a proof-of-concept patch. This is intended to show the 
general outline of what I have in mind, rather than pass the testsuite. 
It's good enough to compile some of the OpenACC testcases (let's say 
worker-single-3 if you need one). Let me know what you think.


Bernd

Comments

Jakub Jelinek June 19, 2015, 12:25 p.m. UTC | #1
On Fri, Jun 19, 2015 at 11:53:14AM +0200, Bernd Schmidt wrote:
> On 05/28/2015 05:08 PM, Jakub Jelinek wrote:
> 
> >I understand it is more work, I'd just like to ask that when designing stuff
> >for the OpenACC offloading you (plural) try to take the other offloading
> >devices and host fallback into account.
> 
> The problem is that many of the transformations we need to do are really GPU
> specific, and with the current structure of omplow/ompexp they are being
> done in the host compiler. The offloading scheme we decided on does not give
> us the means to write out multiple versions of an offloaded function where
> each target gets a different one. For that reason I think we should postpone
> these lowering decisions until we're in the accel compiler, where they could
> be controlled by target hooks, and over the last two weeks I've been doing
> some experiments to see how that could be achieved.

Emitting PTX specific code from current ompexp is highly undesirable of
course, but I must say I'm not a big fan of keeping the GOMP_* gimple trees
around for too long either, they've never meant to be used in low gimple,
and even all the early optimization passes could screw them up badly,
they are also very much OpenMP or OpenACC specific, rather than representing
language neutral behavior, so there is a problem that you'd need M x N
different expansions of those constructs, which is not really maintainable
(M being number of supported offloading standards, right now 2, and N
number of different offloading devices (host, XeonPhi, PTX, HSA, ...)).

I wonder why struct loop flags and other info together with function
attributes and/or cgraph flags and other info aren't sufficient for the
OpenACC needs.
Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd?

Why can't the execution model (normal, vector-single and worker-single)
be simply attributes on functions or cgraph node flags and the kind of
#acc loop simply be flags on struct loop, like already OpenMP simd
/ Cilk+ simd is?

I mean, you need to implement the PTX broadcasting etc. for the 3 different
modes (one where each thread executes everything, another one where
only first thread in a warp executes everything, other threads only
call functions with the same mode, or specially marked loops), another one
where only a single thread (in the CTA) executes everything, other threads
only call functions with the same mode or specially marked loops, because
if you have #acc routine (something) ... that is just an attribute of a
function, not really some construct in the body of it.

The vector level parallelism is something where on the host/host_noshm/XeonPhi
(dunno about HSA) you want vectorization to happen, and that is already
implemented in the vectorizer pass, implementing it again elsewhere is
highly undesirable.  For PTX the implementation is of course different,
and the vectorizer is likely not the right pass to handle them, but why
can't the same struct loop flags be used by the pass that handles the
conditionalization of execution for the 2 of the 3 above modes?

Then there is the worker level parallelism, but I'd hope it can be handled
similarly, and supposedly the pass that handles vector-single and
worker-single lowering for PTX could do the same for non-PTX targets
- if the OpenACC execution model is that all the (e.g. pthread based)
threads are started immediately and you skip in worker-single mode work on
other than the first thread, then it needs to behave similarly to PTX,
just probably needs to use library calls rather than PTX builtins to query
the thread number.

	Jakub
Bernd Schmidt June 19, 2015, 1:03 p.m. UTC | #2
On 06/19/2015 02:25 PM, Jakub Jelinek wrote:
> Emitting PTX specific code from current ompexp is highly undesirable of
> course, but I must say I'm not a big fan of keeping the GOMP_* gimple trees
> around for too long either, they've never meant to be used in low gimple,
> and even all the early optimization passes could screw them up badly,

The idea is not to keep them around for very long, but I think there's 
no reason why they couldn't survive a while longer. Between ompexpand 
and the end of build_ssa_passes, we have (ignoring things like chkp and 
ubsan which can just be turned off for offloaded functions if necessary):
   NEXT_PASS (pass_ipa_free_lang_data);
   NEXT_PASS (pass_ipa_function_and_variable_visibility);
       NEXT_PASS (pass_fixup_cfg);
       NEXT_PASS (pass_init_datastructures);
       NEXT_PASS (pass_build_ssa);
       NEXT_PASS (pass_early_warn_uninitialized);
       NEXT_PASS (pass_nothrow);

Nothing in there strikes me as particularly problematic if we can make 
things like GIMPLE_OMP_FOR survive into-ssa - which I think I did in my 
patch. Besides, the OpenACC kernels path generates them in SSA form 
anyway during parloops so one could make the argument that this is a 
step towards better consistency.

> they are also very much OpenMP or OpenACC specific, rather than representing
> language neutral behavior, so there is a problem that you'd need M x N
> different expansions of those constructs, which is not really maintainable
> (M being number of supported offloading standards, right now 2, and N
> number of different offloading devices (host, XeonPhi, PTX, HSA, ...)).

Well, that's a problem we have anyway, independent on how we implement 
all these devices and standards. I don't see how that's relevant to the 
discussion.

> I wonder why struct loop flags and other info together with function
> attributes and/or cgraph flags and other info aren't sufficient for the
> OpenACC needs.
> Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd?

> Why can't the execution model (normal, vector-single and worker-single)
> be simply attributes on functions or cgraph node flags and the kind of
> #acc loop simply be flags on struct loop, like already OpenMP simd
> / Cilk+ simd is?

We haven't looked at Cilk+ or anything like that. You suggest using 
attributes and flags, but at what point do you intend to actually lower 
the IR to actually represent what's going on?

> The vector level parallelism is something where on the host/host_noshm/XeonPhi
> (dunno about HSA) you want vectorization to happen, and that is already
> implemented in the vectorizer pass, implementing it again elsewhere is
> highly undesirable.  For PTX the implementation is of course different,
> and the vectorizer is likely not the right pass to handle them, but why
> can't the same struct loop flags be used by the pass that handles the
> conditionalization of execution for the 2 of the 3 above modes?

Agreed on wanting the vectorizer to handle things for "normal" machines, 
that is one of the motivations for pushing the lowering past the offload 
LTO writeout stage. The problem with OpenACC on GPUs is that the 
predication really changes the CFG and the data flow - I fear 
unpredictable effects if we let any optimizers run before lowering 
OpenACC to the point where we actually represent what's going on in the 
function.


Bernd
Jakub Jelinek June 19, 2015, 1:45 p.m. UTC | #3
On Fri, Jun 19, 2015 at 03:03:38PM +0200, Bernd Schmidt wrote:
> >they are also very much OpenMP or OpenACC specific, rather than representing
> >language neutral behavior, so there is a problem that you'd need M x N
> >different expansions of those constructs, which is not really maintainable
> >(M being number of supported offloading standards, right now 2, and N
> >number of different offloading devices (host, XeonPhi, PTX, HSA, ...)).
> 
> Well, that's a problem we have anyway, independent on how we implement all
> these devices and standards. I don't see how that's relevant to the
> discussion.

It is relevant, because if you lower early (omplower/ompexp) into some IL
form common to all the offloading standards, then it is M + N.

> >I wonder why struct loop flags and other info together with function
> >attributes and/or cgraph flags and other info aren't sufficient for the
> >OpenACC needs.
> >Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd?
> 
> >Why can't the execution model (normal, vector-single and worker-single)
> >be simply attributes on functions or cgraph node flags and the kind of
> >#acc loop simply be flags on struct loop, like already OpenMP simd
> >/ Cilk+ simd is?
> 
> We haven't looked at Cilk+ or anything like that. You suggest using
> attributes and flags, but at what point do you intend to actually lower the
> IR to actually represent what's going on?

I think around where the vectorizer is, perhaps before the loop optimization
pass queue (or after it, some investigation is needed).

> >The vector level parallelism is something where on the host/host_noshm/XeonPhi
> >(dunno about HSA) you want vectorization to happen, and that is already
> >implemented in the vectorizer pass, implementing it again elsewhere is
> >highly undesirable.  For PTX the implementation is of course different,
> >and the vectorizer is likely not the right pass to handle them, but why
> >can't the same struct loop flags be used by the pass that handles the
> >conditionalization of execution for the 2 of the 3 above modes?
> 
> Agreed on wanting the vectorizer to handle things for "normal" machines,
> that is one of the motivations for pushing the lowering past the offload LTO
> writeout stage. The problem with OpenACC on GPUs is that the predication
> really changes the CFG and the data flow - I fear unpredictable effects if
> we let any optimizers run before lowering OpenACC to the point where we
> actually represent what's going on in the function.

I actually believe having some optimization passes in between the ompexp
and the lowering of the IR into the form PTX wants is highly desirable,
the form with the worker-single or vector-single mode lowered will contain
too complex CFG for many optimizations to be really effective, especially
if it uses abnormal edges.  E.g. inlining supposedly would have harder job
etc.  What exact unpredictable effects do you fear?
If the loop remains in the IL (isn't optimized away as unreachable or
isn't removed, e.g. as a non-loop - say if it contains a noreturn call),
the flags on struct loop should be still there.  For the loop clauses
(reduction always, and private/lastprivate if addressable etc.) for
OpenMP simd / Cilk+ simd we use special arrays indexed by internal
functions, which then during vectorization are shrunk (but in theory could
be expanded too) to the right vectorization factor if vectorized, of course
accesses within the loop vectorized using SIMD, and if not vectorized,
shrunk to 1 element.  So the PTX IL lowering pass could use the same
arrays ("omp simd array" attribute) to transform the decls into thread local
vars as opposed to vars shared by the whole CTA.

	Jakub
Julian Brown June 22, 2015, 1:55 p.m. UTC | #4
On Fri, 19 Jun 2015 14:25:57 +0200
Jakub Jelinek <jakub@redhat.com> wrote:

> On Fri, Jun 19, 2015 at 11:53:14AM +0200, Bernd Schmidt wrote:
> > On 05/28/2015 05:08 PM, Jakub Jelinek wrote:
> > 
> > >I understand it is more work, I'd just like to ask that when
> > >designing stuff for the OpenACC offloading you (plural) try to
> > >take the other offloading devices and host fallback into account.
> > 
> > The problem is that many of the transformations we need to do are
> > really GPU specific, and with the current structure of
> > omplow/ompexp they are being done in the host compiler. The
> > offloading scheme we decided on does not give us the means to write
> > out multiple versions of an offloaded function where each target
> > gets a different one. For that reason I think we should postpone
> > these lowering decisions until we're in the accel compiler, where
> > they could be controlled by target hooks, and over the last two
> > weeks I've been doing some experiments to see how that could be
> > achieved.

> I wonder why struct loop flags and other info together with function
> attributes and/or cgraph flags and other info aren't sufficient for
> the OpenACC needs.
> Have you or Thomas looked what we're doing for OpenMP simd / Cilk+
> simd?
> 
> Why can't the execution model (normal, vector-single and
> worker-single) be simply attributes on functions or cgraph node flags
> and the kind of #acc loop simply be flags on struct loop, like
> already OpenMP simd / Cilk+ simd is?

One problem is that (at least on the GPU hardware we've considered so
far) we're somewhat constrained in how much control we have over how the
underlying hardware executes code: it's possible to draw up a scheme
where OpenACC source-level control-flow semantics are reflected directly
in the PTX assembly output (e.g. to say "all threads in a CTA/warp will
be coherent after such-and-such a loop"), and lowering OpenACC
directives quite early seems to make that relatively tractable. (Even
if the resulting code is relatively un-optimisable due to the abnormal
edges inserted to make sure that the CFG doesn't become "ill-formed".)

If arbitrary optimisations are done between OMP-lowering time and
somewhere around vectorisation (say), it's less clear if that
correspondence can be maintained. Say if the code executed by half the
threads in a warp becomes physically separated from the code executed
by the other half of the threads in a warp due to some loop
optimisation, we can no longer easily determine where that warp will
reconverge, and certain other operations (relying on coherent warps --
e.g. CTA synchronisation) become impossible. A similar issue exists for
warps within a CTA.

So, essentially -- I don't know how "late" loop lowering would interact
with:

(a) Maintaining a CFG that will work with PTX.

(b) Predication for worker-single and/or vector-single modes
(actually all currently-proposed schemes have problems with proper
representation of data-dependencies for variables and
compiler-generated temporaries between predicated regions.)

Julian
Bernd Schmidt June 22, 2015, 1:59 p.m. UTC | #5
On 06/19/2015 03:45 PM, Jakub Jelinek wrote:
> I actually believe having some optimization passes in between the ompexp
> and the lowering of the IR into the form PTX wants is highly desirable,
> the form with the worker-single or vector-single mode lowered will contain
> too complex CFG for many optimizations to be really effective, especially
> if it uses abnormal edges.  E.g. inlining supposedly would have harder job
> etc.  What exact unpredictable effects do you fear?

Mostly the ones I can't predict. But let's take one example, LICM: let's 
say you pull some assignment out of a loop, then you find yourself in 
one of two possible situations: either it's become not actually 
available inside the loop (because the data and control flow is not 
described correctly and the compiler doesn't know what's going on), or, 
to avoid that, you introduce additional broadcasting operations when 
entering the loop, which might be quite expensive.


Bernd
Jakub Jelinek June 22, 2015, 2:11 p.m. UTC | #6
On Mon, Jun 22, 2015 at 03:59:57PM +0200, Bernd Schmidt wrote:
> On 06/19/2015 03:45 PM, Jakub Jelinek wrote:
> >I actually believe having some optimization passes in between the ompexp
> >and the lowering of the IR into the form PTX wants is highly desirable,
> >the form with the worker-single or vector-single mode lowered will contain
> >too complex CFG for many optimizations to be really effective, especially
> >if it uses abnormal edges.  E.g. inlining supposedly would have harder job
> >etc.  What exact unpredictable effects do you fear?
> 
> Mostly the ones I can't predict. But let's take one example, LICM: let's say
> you pull some assignment out of a loop, then you find yourself in one of two
> possible situations: either it's become not actually available inside the
> loop (because the data and control flow is not described correctly and the
> compiler doesn't know what's going on), or, to avoid that, you introduce

Why do you think that would happen?  E.g. for non-addressable gimple types you'd
most likely just have a PHI for it on the loop.

> additional broadcasting operations when entering the loop, which might be
> quite expensive.

If the PHI has cheap initialization, there is not a problem to emit it as
initialization in the loop instead of a broadcast (kind like RA
rematerialization).  And by actually adding such an optimization, you help
even code that has computation in a vector-single code and uses it in vector
acc loop.

	Jakub
Jakub Jelinek June 22, 2015, 2:24 p.m. UTC | #7
On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote:
> One problem is that (at least on the GPU hardware we've considered so
> far) we're somewhat constrained in how much control we have over how the
> underlying hardware executes code: it's possible to draw up a scheme
> where OpenACC source-level control-flow semantics are reflected directly
> in the PTX assembly output (e.g. to say "all threads in a CTA/warp will
> be coherent after such-and-such a loop"), and lowering OpenACC
> directives quite early seems to make that relatively tractable. (Even
> if the resulting code is relatively un-optimisable due to the abnormal
> edges inserted to make sure that the CFG doesn't become "ill-formed".)
> 
> If arbitrary optimisations are done between OMP-lowering time and
> somewhere around vectorisation (say), it's less clear if that
> correspondence can be maintained. Say if the code executed by half the
> threads in a warp becomes physically separated from the code executed
> by the other half of the threads in a warp due to some loop
> optimisation, we can no longer easily determine where that warp will
> reconverge, and certain other operations (relying on coherent warps --
> e.g. CTA synchronisation) become impossible. A similar issue exists for
> warps within a CTA.
> 
> So, essentially -- I don't know how "late" loop lowering would interact
> with:
> 
> (a) Maintaining a CFG that will work with PTX.
> 
> (b) Predication for worker-single and/or vector-single modes
> (actually all currently-proposed schemes have problems with proper
> representation of data-dependencies for variables and
> compiler-generated temporaries between predicated regions.)

I don't understand why lowering the way you suggest helps here at all.
In the proposed scheme, you essentially have whole function
in e.g. worker-single or vector-single mode, which you need to be able to
handle properly in any case, because users can write such routines
themselves.  And then you can have a loop in such a function that
has some special attribute, a hint that it is desirable to vectorize it
(for PTX the PTX way) or use vector-single mode for it in a worker-single
function.  So, the special pass then of course needs to handle all the
needed broadcasting and reduction required to change the mode from e.g.
worker-single to vector-single, but the convergence points still would be
either on the boundary of such loops to be vectorized or parallelized, or
wherever else they appear in normal vector-single or worker-single functions
(around the calls to certainly calls?).

	Jakub
Julian Brown June 22, 2015, 3:17 p.m. UTC | #8
On Mon, 22 Jun 2015 16:24:56 +0200
Jakub Jelinek <jakub@redhat.com> wrote:

> On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote:
> > One problem is that (at least on the GPU hardware we've considered
> > so far) we're somewhat constrained in how much control we have over
> > how the underlying hardware executes code: it's possible to draw up
> > a scheme where OpenACC source-level control-flow semantics are
> > reflected directly in the PTX assembly output (e.g. to say "all
> > threads in a CTA/warp will be coherent after such-and-such a
> > loop"), and lowering OpenACC directives quite early seems to make
> > that relatively tractable. (Even if the resulting code is
> > relatively un-optimisable due to the abnormal edges inserted to
> > make sure that the CFG doesn't become "ill-formed".)
> > 
> > If arbitrary optimisations are done between OMP-lowering time and
> > somewhere around vectorisation (say), it's less clear if that
> > correspondence can be maintained. Say if the code executed by half
> > the threads in a warp becomes physically separated from the code
> > executed by the other half of the threads in a warp due to some loop
> > optimisation, we can no longer easily determine where that warp will
> > reconverge, and certain other operations (relying on coherent warps
> > -- e.g. CTA synchronisation) become impossible. A similar issue
> > exists for warps within a CTA.
> > 
> > So, essentially -- I don't know how "late" loop lowering would
> > interact with:
> > 
> > (a) Maintaining a CFG that will work with PTX.
> > 
> > (b) Predication for worker-single and/or vector-single modes
> > (actually all currently-proposed schemes have problems with proper
> > representation of data-dependencies for variables and
> > compiler-generated temporaries between predicated regions.)
> 
> I don't understand why lowering the way you suggest helps here at all.
> In the proposed scheme, you essentially have whole function
> in e.g. worker-single or vector-single mode, which you need to be
> able to handle properly in any case, because users can write such
> routines themselves.  And then you can have a loop in such a function
> that has some special attribute, a hint that it is desirable to
> vectorize it (for PTX the PTX way) or use vector-single mode for it
> in a worker-single function.  So, the special pass then of course
> needs to handle all the needed broadcasting and reduction required to
> change the mode from e.g. worker-single to vector-single, but the
> convergence points still would be either on the boundary of such
> loops to be vectorized or parallelized, or wherever else they appear
> in normal vector-single or worker-single functions (around the calls
> to certainly calls?).

I think most of my concerns are centred around loops (with the markings
you suggest) that might be split into parts: if that cannot happen for
loops that are annotated as you describe, maybe things will work out OK.

(Apologies for my ignorance here, this isn't a part of the compiler
that I know anything about.)

Julian
Bernd Schmidt June 22, 2015, 3:18 p.m. UTC | #9
On 06/22/2015 04:24 PM, Jakub Jelinek wrote:
> I don't understand why lowering the way you suggest helps here at all.
> In the proposed scheme, you essentially have whole function
> in e.g. worker-single or vector-single mode, which you need to be able to
> handle properly in any case, because users can write such routines
> themselves.  And then you can have a loop in such a function that
> has some special attribute, a hint that it is desirable to vectorize it
> (for PTX the PTX way) or use vector-single mode for it in a worker-single
> function.

You can have a hint that it is desirable, but not a hint that it is 
correct (because passes in between may invalidate that). The OpenACC 
directives guarantee to the compiler that the program can be transformed 
into a parallel form. If we lose them early we must then rely on our 
analysis which may not be strong enough to prove that the loop can be 
parallelized. If we make these transformations early enough, while we 
still have the OpenACC directives, we can guarantee that we do exactly 
what the programmer specified.


Bernd
Nathan Sidwell June 22, 2015, 4:08 p.m. UTC | #10
On 06/22/15 11:18, Bernd Schmidt wrote:

> You can have a hint that it is desirable, but not a hint that it is correct
> (because passes in between may invalidate that). The OpenACC directives
> guarantee to the compiler that the program can be transformed into a parallel
> form. If we lose them early we must then rely on our analysis which may not be
> strong enough to prove that the loop can be parallelized. If we make these
> transformations early enough, while we still have the OpenACC directives, we can
> guarantee that we do exactly what the programmer specified.

How does this differ from openmp's needs to preserve parallelism on a parallel 
loop?  Is it more than the reconvergence issue?

nathan
Jakub Jelinek June 22, 2015, 4:20 p.m. UTC | #11
On Mon, Jun 22, 2015 at 12:08:36PM -0400, Nathan Sidwell wrote:
> On 06/22/15 11:18, Bernd Schmidt wrote:
> 
> >You can have a hint that it is desirable, but not a hint that it is correct
> >(because passes in between may invalidate that). The OpenACC directives
> >guarantee to the compiler that the program can be transformed into a parallel
> >form. If we lose them early we must then rely on our analysis which may not be
> >strong enough to prove that the loop can be parallelized. If we make these
> >transformations early enough, while we still have the OpenACC directives, we can
> >guarantee that we do exactly what the programmer specified.
> 
> How does this differ from openmp's needs to preserve parallelism on a
> parallel loop?  Is it more than the reconvergence issue?

OpenMP has significantly different execution model, a parallel block in
OpenMP is run by certain number of threads (the initial thread (the one
encountering that region) and then dpeending on clauses and library
decisions perhaps others), with a barrier at the end of the region, and
afterwards only the initial thread continues again.
So, an OpenMP parallel is implemented as a library call, taking outlined
function from the parallel's body as one of its arguments and the body
is executed by the initial thread and perhaps others.
OpenMP worksharing loop is just coordination between the threads in the
team, which thread takes which subset of the loop's iterations, and
optionally followed by a barrier.  OpenMP simd loop is a loop that has
certain properties guaranteed by the user and can be vectorized.
In contrast to this, OpenACC spawns all the threads/CTAs upfront, and then
idles on some of them until there is work for them.

	Jakub
Nathan Sidwell June 22, 2015, 4:31 p.m. UTC | #12
On 06/22/15 12:20, Jakub Jelinek wrote:

> OpenMP worksharing loop is just coordination between the threads in the
> team, which thread takes which subset of the loop's iterations, and
> optionally followed by a barrier.  OpenMP simd loop is a loop that has
> certain properties guaranteed by the user and can be vectorized.
> In contrast to this, OpenACC spawns all the threads/CTAs upfront, and then
> idles on some of them until there is work for them.

correct.  I expressed my question poorly.  What I mean is that in openmp, a loop 
that is parallelizeable (by user decree, I guess[*]), should not be transformed 
such that it is not parallelizeable.

This seems to me to be a common requirement of both languages.  How one gets 
parallel threads of execution to the body of the loop is a different question.

nathan

[*] For ones where the compiler needs to detect parallizeablilty, it's 
preferable that it doesn't do something earlier to force serializeablility.
Julian Brown June 22, 2015, 5:48 p.m. UTC | #13
On Mon, 22 Jun 2015 16:24:56 +0200
Jakub Jelinek <jakub@redhat.com> wrote:

> On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote:
> > One problem is that (at least on the GPU hardware we've considered
> > so far) we're somewhat constrained in how much control we have over
> > how the underlying hardware executes code: it's possible to draw up
> > a scheme where OpenACC source-level control-flow semantics are
> > reflected directly in the PTX assembly output (e.g. to say "all
> > threads in a CTA/warp will be coherent after such-and-such a
> > loop"), and lowering OpenACC directives quite early seems to make
> > that relatively tractable. (Even if the resulting code is
> > relatively un-optimisable due to the abnormal edges inserted to
> > make sure that the CFG doesn't become "ill-formed".)
> > 
> > If arbitrary optimisations are done between OMP-lowering time and
> > somewhere around vectorisation (say), it's less clear if that
> > correspondence can be maintained. Say if the code executed by half
> > the threads in a warp becomes physically separated from the code
> > executed by the other half of the threads in a warp due to some loop
> > optimisation, we can no longer easily determine where that warp will
> > reconverge, and certain other operations (relying on coherent warps
> > -- e.g. CTA synchronisation) become impossible. A similar issue
> > exists for warps within a CTA.
> > 
> > So, essentially -- I don't know how "late" loop lowering would
> > interact with:
> > 
> > (a) Maintaining a CFG that will work with PTX.
> > 
> > (b) Predication for worker-single and/or vector-single modes
> > (actually all currently-proposed schemes have problems with proper
> > representation of data-dependencies for variables and
> > compiler-generated temporaries between predicated regions.)
> 
> I don't understand why lowering the way you suggest helps here at all.
> In the proposed scheme, you essentially have whole function
> in e.g. worker-single or vector-single mode, which you need to be
> able to handle properly in any case, because users can write such
> routines themselves.

In vector-single or worker-single mode, divergence of threads within a
warp or a CTA is controlled by broadcasting the controlling expression
of conditional branches to the set of "inactive" threads, so each of
those follows along with the active thread. So you only get
potentially-problematic thread divergence when workers or vectors are
operating in partitioned mode.

So, for instance, a made-up example:

#pragma acc parallel
{
  #pragma acc loop gang
  for (i = 0; i < N; i++))
  {
    #pragma acc loop worker
    for (j = 0; j < M; j++)
    {
      if (j < M / 2)
        /* stmt 1 */
      else
        /* stmt 2 */
    }

    /* reconvergence point: thread barrier */

    [...]
  }
}

Here "stmt 1" and "stmt 2" execute in worker-partitioned, vector-single
mode. With "early lowering", the reconvergence point can be
inserted at the end of the loop, and abnormal edges (etc.) can be used
to ensure that the CFG does not get changed in such a way that there is
no longer a unique point at which the loop threads reconverge.

With "late lowering", it's no longer obvious to me if that can still be
done.

Julian
Jakub Jelinek June 22, 2015, 6:27 p.m. UTC | #14
On Mon, Jun 22, 2015 at 06:48:10PM +0100, Julian Brown wrote:
> In vector-single or worker-single mode, divergence of threads within a
> warp or a CTA is controlled by broadcasting the controlling expression
> of conditional branches to the set of "inactive" threads, so each of
> those follows along with the active thread. So you only get
> potentially-problematic thread divergence when workers or vectors are
> operating in partitioned mode.
> 
> So, for instance, a made-up example:
> 
> #pragma acc parallel
> {
>   #pragma acc loop gang
>   for (i = 0; i < N; i++))
>   {
>     #pragma acc loop worker
>     for (j = 0; j < M; j++)
>     {
>       if (j < M / 2)
>         /* stmt 1 */
>       else
>         /* stmt 2 */
>     }
> 
>     /* reconvergence point: thread barrier */
> 
>     [...]
>   }
> }
> 
> Here "stmt 1" and "stmt 2" execute in worker-partitioned, vector-single
> mode. With "early lowering", the reconvergence point can be
> inserted at the end of the loop, and abnormal edges (etc.) can be used
> to ensure that the CFG does not get changed in such a way that there is
> no longer a unique point at which the loop threads reconverge.
> 
> With "late lowering", it's no longer obvious to me if that can still be
> done.

Why?  The loop still has an exit edge (if there is no break/return/throw out of
the loop which I bet is not allowed), so you just insert the reconvergence
point at the exit edge from the loop.
For the "late lowering", I said it is up for benchmarking/investigation
where it would be best placed, it doesn't have to be after the loop passes,
there are plenty of optimization passes even before those.  But once you turn
many of the SSA_NAMEs in a function into (ab) ssa vars, many optimizations
just give up.
And, if you really want to avoid certain loop optimizations, you have always
the possibility to e.g. wrap certain statement in the loop in internal
function (e.g. the loop condition) or something similar to make the passes
more careful about those loops and make it easier to lower it later.

	Jakub
Bernd Schmidt June 24, 2015, 1:11 p.m. UTC | #15
On 06/19/2015 03:45 PM, Jakub Jelinek wrote:

> If the loop remains in the IL (isn't optimized away as unreachable or
> isn't removed, e.g. as a non-loop - say if it contains a noreturn call),
> the flags on struct loop should be still there.  For the loop clauses
> (reduction always, and private/lastprivate if addressable etc.) for
> OpenMP simd / Cilk+ simd we use special arrays indexed by internal
> functions, which then during vectorization are shrunk (but in theory could
> be expanded too) to the right vectorization factor if vectorized, of course
> accesses within the loop vectorized using SIMD, and if not vectorized,
> shrunk to 1 element.

I'd appreciate if you could describe that mechanism in more detail. As 
far as I can tell it is very poorly commented and documented in the 
code. I mean, it doesn't even follow the minimal coding standards of 
describing function inputs:

/* Helper function of lower_rec_input_clauses, used for #pragma omp simd
    privatization.  */

static bool
lower_rec_simd_input_clauses (tree new_var, omp_context *ctx, int &max_vf,
			      tree &idx, tree &lane, tree &ivar, tree &lvar)


Bernd
Jakub Jelinek June 24, 2015, 1:53 p.m. UTC | #16
On Wed, Jun 24, 2015 at 03:11:04PM +0200, Bernd Schmidt wrote:
> On 06/19/2015 03:45 PM, Jakub Jelinek wrote:
> 
> >If the loop remains in the IL (isn't optimized away as unreachable or
> >isn't removed, e.g. as a non-loop - say if it contains a noreturn call),
> >the flags on struct loop should be still there.  For the loop clauses
> >(reduction always, and private/lastprivate if addressable etc.) for
> >OpenMP simd / Cilk+ simd we use special arrays indexed by internal
> >functions, which then during vectorization are shrunk (but in theory could
> >be expanded too) to the right vectorization factor if vectorized, of course
> >accesses within the loop vectorized using SIMD, and if not vectorized,
> >shrunk to 1 element.
> 
> I'd appreciate if you could describe that mechanism in more detail. As far
> as I can tell it is very poorly commented and documented in the code. I
> mean, it doesn't even follow the minimal coding standards of describing
> function inputs:
> 
> /* Helper function of lower_rec_input_clauses, used for #pragma omp simd
>    privatization.  */
> 
> static bool
> lower_rec_simd_input_clauses (tree new_var, omp_context *ctx, int &max_vf,
> 			      tree &idx, tree &lane, tree &ivar, tree &lvar)

Here is the theory behind it:
https://gcc.gnu.org/ml/gcc-patches/2013-04/msg01661.html
In the end it is using internal functions instead of uglified builtins.
I'd suggest you look at some of the libgomp.c/simd*.c tests, say
with -O2 -mavx2 -fdump-tree-{omplower,ssa,ifcvt,vect,optimized}
to see how it is lowered and expanded.  I assume #pragma omp simd roughly
corresponds to #pragma acc loop vector, maxvf for PTX vectorization is
supposedly 32 (warp size).  For SIMD vectorization, if the vectorization
fails, the arrays are shrunk to 1 element, otherwise they are shrunk to the
vectorization factor, and later optimizations if they aren't really
addressable optimized using FRE and other memory optimizations so that they
don't touch memory unless really needed.
For the PTX style vectorization (parallelization between threads in a warp),
I'd say you would always shrink to 1 element again, but such variables would
be local to each of the threads in the warp (or another possibility is
shared arrays of size 32 indexed by %tid.x & 31), while addressable variables
without such magic type would be shared among all threads; non-addressable
variables (SSA_NAMEs) depending on where they are used.
You'd need to transform reductions (which are right now represented as
another loop, from 0 to an internal function, so easily recognizable) into
the PTX reductions.  Also, lastprivate is now an access to the array using
last lane internal function, dunno what that corresponds to in PTX
(perhaps also a reduction where all but the thread executing the last
iteration say or in 0 and the remaining thread ors in the lastprivate value).

	Jakub
diff mbox

Patch

Index: gcc/cgraphunit.c
===================================================================
--- gcc/cgraphunit.c	(revision 224547)
+++ gcc/cgraphunit.c	(working copy)
@@ -2171,6 +2171,23 @@  ipa_passes (void)
       execute_ipa_pass_list (passes->all_small_ipa_passes);
       if (seen_error ())
 	return;
+
+      if (g->have_offload)
+	{
+	  extern void write_offload_lto ();
+	  section_name_prefix = OFFLOAD_SECTION_NAME_PREFIX;
+	  write_offload_lto ();
+	}
+    }
+  bool do_local_opts = !in_lto_p;
+#ifdef ACCEL_COMPILER
+  do_local_opts = true;
+#endif
+  if (do_local_opts)
+    {
+      execute_ipa_pass_list (passes->all_local_opt_passes);
+      if (seen_error ())
+	return;
     }
 
   /* This extra symtab_remove_unreachable_nodes pass tends to catch some
@@ -2182,7 +2199,7 @@  ipa_passes (void)
   if (symtab->state < IPA_SSA)
     symtab->state = IPA_SSA;
 
-  if (!in_lto_p)
+  if (do_local_opts)
     {
       /* Generate coverage variables and constructors.  */
       coverage_finish ();
@@ -2285,6 +2302,14 @@  symbol_table::compile (void)
   if (seen_error ())
     return;
 
+#ifdef ACCEL_COMPILER
+  {
+    cgraph_node *node;
+    FOR_EACH_DEFINED_FUNCTION (node)
+      node->get_untransformed_body ();
+  }
+#endif
+
 #ifdef ENABLE_CHECKING
   symtab_node::verify_symtab_nodes ();
 #endif
Index: gcc/config/nvptx/nvptx.c
===================================================================
--- gcc/config/nvptx/nvptx.c	(revision 224547)
+++ gcc/config/nvptx/nvptx.c	(working copy)
@@ -1171,18 +1171,42 @@  nvptx_section_from_addr_space (addr_spac
     }
 }
 
-/* Determine whether DECL goes into .const or .global.  */
+/* Determine the address space DECL lives in.  */
 
-const char *
-nvptx_section_for_decl (const_tree decl)
+static addr_space_t
+nvptx_addr_space_for_decl (const_tree decl)
 {
+  if (decl == NULL_TREE || TREE_CODE (decl) == FUNCTION_DECL)
+    return ADDR_SPACE_GENERIC;
+
+  if (lookup_attribute ("oacc ganglocal", DECL_ATTRIBUTES (decl)) != NULL_TREE)
+    return ADDR_SPACE_SHARED;
+
   bool is_const = (CONSTANT_CLASS_P (decl)
 		   || TREE_CODE (decl) == CONST_DECL
 		   || TREE_READONLY (decl));
   if (is_const)
-    return ".const";
+    return ADDR_SPACE_CONST;
 
-  return ".global";
+  return ADDR_SPACE_GLOBAL;
+}
+
+/* Return a ptx string representing the address space for a variable DECL.  */
+
+const char *
+nvptx_section_for_decl (const_tree decl)
+{
+  switch (nvptx_addr_space_for_decl (decl))
+    {
+    case ADDR_SPACE_CONST:
+      return ".const";
+    case ADDR_SPACE_SHARED:
+      return ".shared";
+    case ADDR_SPACE_GLOBAL:
+      return ".global";
+    default:
+      gcc_unreachable ();
+    }
 }
 
 /* Look for a SYMBOL_REF in ADDR and return the address space to be used
@@ -1196,17 +1220,7 @@  nvptx_addr_space_from_address (rtx addr)
   if (GET_CODE (addr) != SYMBOL_REF)
     return ADDR_SPACE_GENERIC;
 
-  tree decl = SYMBOL_REF_DECL (addr);
-  if (decl == NULL_TREE || TREE_CODE (decl) == FUNCTION_DECL)
-    return ADDR_SPACE_GENERIC;
-
-  bool is_const = (CONSTANT_CLASS_P (decl)
-		   || TREE_CODE (decl) == CONST_DECL
-		   || TREE_READONLY (decl));
-  if (is_const)
-    return ADDR_SPACE_CONST;
-
-  return ADDR_SPACE_GLOBAL;
+  return nvptx_addr_space_for_decl (SYMBOL_REF_DECL (addr));
 }
 
 /* Machinery to output constant initializers.  */
Index: gcc/gimple-pretty-print.c
===================================================================
--- gcc/gimple-pretty-print.c	(revision 224547)
+++ gcc/gimple-pretty-print.c	(working copy)
@@ -1175,11 +1175,12 @@  dump_gimple_omp_for (pretty_printer *buf
       dump_gimple_fmt (buffer, spc, flags, " >,");
       for (i = 0; i < gimple_omp_for_collapse (gs); i++)
 	dump_gimple_fmt (buffer, spc, flags,
-			 "%+%T, %T, %T, %s, %T,%n",
+			 "%+%T, %T, %T, %s, %s, %T,%n",
 			 gimple_omp_for_index (gs, i),
 			 gimple_omp_for_initial (gs, i),
 			 gimple_omp_for_final (gs, i),
 			 get_tree_code_name (gimple_omp_for_cond (gs, i)),
+			 get_tree_code_name (gimple_omp_for_incr_code (gs, i)),
 			 gimple_omp_for_incr (gs, i));
       dump_gimple_fmt (buffer, spc, flags, "PRE_BODY <%S>%->",
 		       gimple_omp_for_pre_body (gs));
@@ -1259,6 +1260,20 @@  dump_gimple_omp_for (pretty_printer *buf
 	  dump_generic_node (buffer, gimple_omp_for_index (gs, i), spc,
 			     flags, false);
 	  pp_string (buffer, " = ");
+	  dump_generic_node (buffer, gimple_omp_for_index (gs, i), spc,
+			     flags, false);
+	  switch (gimple_omp_for_incr_code (gs, i))
+	    {
+	    case POINTER_PLUS_EXPR:
+	    case PLUS_EXPR:
+	      pp_plus (buffer);
+	      break;
+	    case MINUS_EXPR:
+	      pp_minus (buffer);
+	      break;
+	    default:
+	      gcc_unreachable ();
+	    }
 	  dump_generic_node (buffer, gimple_omp_for_incr (gs, i), spc,
 			     flags, false);
 	  pp_right_paren (buffer);
Index: gcc/gimple-streamer-in.c
===================================================================
--- gcc/gimple-streamer-in.c	(revision 224547)
+++ gcc/gimple-streamer-in.c	(working copy)
@@ -176,6 +176,7 @@  input_gimple_stmt (struct lto_input_bloc
       }
       /* Fallthru  */
 
+    case GIMPLE_OMP_ENTRY_END:
     case GIMPLE_ASSIGN:
     case GIMPLE_CALL:
     case GIMPLE_RETURN:
@@ -225,6 +226,7 @@  input_gimple_stmt (struct lto_input_bloc
 
     case GIMPLE_NOP:
     case GIMPLE_PREDICT:
+    case GIMPLE_OMP_RETURN:
       break;
 
     case GIMPLE_TRANSACTION:
@@ -232,6 +234,42 @@  input_gimple_stmt (struct lto_input_bloc
 				    stream_read_tree (ib, data_in));
       break;
 
+    case GIMPLE_OMP_FOR:
+      {
+	gomp_for *for_stmt = as_a <gomp_for *> (stmt);
+	gimple_omp_for_set_clauses (for_stmt, stream_read_tree (ib, data_in));
+	size_t collapse = streamer_read_hwi (ib);
+	for_stmt->collapse = collapse;
+	for_stmt->iter = ggc_cleared_vec_alloc<gimple_omp_for_iter> (collapse);
+	for (size_t i = 0; i < collapse; i++)
+	  {
+	    gimple_omp_for_set_cond (stmt, i, streamer_read_enum (ib, tree_code,
+							       MAX_TREE_CODES));
+	    gimple_omp_for_set_incr_code (stmt, i, streamer_read_enum (ib, tree_code,
+								       MAX_TREE_CODES));
+	    gimple_omp_for_set_index (stmt, i, stream_read_tree (ib, data_in));
+	    gimple_omp_for_set_initial (stmt, i, stream_read_tree (ib, data_in));
+	    gimple_omp_for_set_final (stmt, i, stream_read_tree (ib, data_in));
+	    gimple_omp_for_set_incr (stmt, i, stream_read_tree (ib, data_in));
+	  }
+      }
+      break;
+
+    case GIMPLE_OMP_CONTINUE:
+      {
+	gomp_continue *cont_stmt = as_a <gomp_continue *> (stmt);
+	gimple_omp_continue_set_control_def (cont_stmt, stream_read_tree (ib, data_in));
+	gimple_omp_continue_set_control_use (cont_stmt, stream_read_tree (ib, data_in));
+      }
+      break;
+
+    case GIMPLE_OMP_TARGET:
+      {
+	gomp_target *tgt_stmt = as_a <gomp_target *> (stmt);
+	gimple_omp_target_set_clauses (tgt_stmt, stream_read_tree (ib, data_in));
+      }
+      break;
+
     default:
       internal_error ("bytecode stream: unknown GIMPLE statement tag %s",
 		      lto_tag_name (tag));
@@ -239,9 +277,9 @@  input_gimple_stmt (struct lto_input_bloc
 
   /* Update the properties of symbols, SSA names and labels associated
      with STMT.  */
-  if (code == GIMPLE_ASSIGN || code == GIMPLE_CALL)
+  if (code == GIMPLE_ASSIGN || code == GIMPLE_CALL || code == GIMPLE_OMP_CONTINUE)
     {
-      tree lhs = gimple_get_lhs (stmt);
+      tree lhs = gimple_op (stmt, 0);
       if (lhs && TREE_CODE (lhs) == SSA_NAME)
 	SSA_NAME_DEF_STMT (lhs) = stmt;
     }
@@ -257,7 +295,16 @@  input_gimple_stmt (struct lto_input_bloc
 	    SSA_NAME_DEF_STMT (op) = stmt;
 	}
     }
-
+  else if (code == GIMPLE_OMP_FOR)
+    {
+      gomp_for *for_stmt = as_a <gomp_for *> (stmt);
+      for (unsigned i = 0; i < gimple_omp_for_collapse (for_stmt); i++)
+	{
+	  tree op = gimple_omp_for_index (for_stmt, i);
+	  if (TREE_CODE (op) == SSA_NAME)
+	    SSA_NAME_DEF_STMT (op) = stmt;
+	}
+    }
   /* Reset alias information.  */
   if (code == GIMPLE_CALL)
     gimple_call_reset_alias_info (as_a <gcall *> (stmt));
Index: gcc/gimple-streamer-out.c
===================================================================
--- gcc/gimple-streamer-out.c	(revision 224547)
+++ gcc/gimple-streamer-out.c	(working copy)
@@ -147,6 +147,7 @@  output_gimple_stmt (struct output_block
       }
       /* Fallthru  */
 
+    case GIMPLE_OMP_ENTRY_END:
     case GIMPLE_ASSIGN:
     case GIMPLE_CALL:
     case GIMPLE_RETURN:
@@ -201,6 +202,7 @@  output_gimple_stmt (struct output_block
 
     case GIMPLE_NOP:
     case GIMPLE_PREDICT:
+    case GIMPLE_OMP_RETURN:
       break;
 
     case GIMPLE_TRANSACTION:
@@ -211,6 +213,45 @@  output_gimple_stmt (struct output_block
       }
       break;
 
+    case GIMPLE_OMP_FOR:
+      {
+	gomp_for *for_stmt = as_a <gomp_for *> (stmt);
+	stream_write_tree (ob, gimple_omp_for_clauses (for_stmt), true);
+	size_t collapse_count = gimple_omp_for_collapse (for_stmt);
+	streamer_write_hwi (ob, collapse_count);
+	for (size_t i = 0; i < collapse_count; i++)
+	  {
+	    streamer_write_enum (ob->main_stream, tree_code, MAX_TREE_CODES,
+				 gimple_omp_for_cond (for_stmt, i));
+	    streamer_write_enum (ob->main_stream, tree_code, MAX_TREE_CODES,
+				 gimple_omp_for_incr_code (for_stmt, i));
+	    stream_write_tree (ob, gimple_omp_for_index (for_stmt, i), true);
+	    stream_write_tree (ob, gimple_omp_for_initial (for_stmt, i), true);
+	    stream_write_tree (ob, gimple_omp_for_final (for_stmt, i), true);
+	    stream_write_tree (ob, gimple_omp_for_incr (for_stmt, i), true);
+	  }
+	/* No need to write out the pre-body, it's empty by the time we
+	   get here.  */
+      }
+      break;
+
+    case GIMPLE_OMP_CONTINUE:
+      {
+	gomp_continue *cont_stmt = as_a <gomp_continue *> (stmt);
+	stream_write_tree (ob, gimple_omp_continue_control_def (cont_stmt),
+			   true);
+	stream_write_tree (ob, gimple_omp_continue_control_use (cont_stmt),
+			   true);
+      }
+      break;
+
+    case GIMPLE_OMP_TARGET:
+      {
+	gomp_target *tgt_stmt = as_a <gomp_target *> (stmt);
+	stream_write_tree (ob, gimple_omp_target_clauses (tgt_stmt), true);
+      }
+      break;
+
     default:
       gcc_unreachable ();
     }
Index: gcc/gimple.c
===================================================================
--- gcc/gimple.c	(revision 224547)
+++ gcc/gimple.c	(working copy)
@@ -855,9 +855,11 @@  gimple_build_debug_source_bind_stat (tre
 /* Build a GIMPLE_OMP_ENTRY_END statement.  */
 
 gimple
-gimple_build_omp_entry_end (void)
+gimple_build_omp_entry_end (tree var)
 {
-  return gimple_alloc (GIMPLE_OMP_ENTRY_END, 0);
+  gimple t = gimple_alloc (GIMPLE_OMP_ENTRY_END, 1);
+  gimple_set_op (t, 0, var);
+  return t;
 }
 
 
@@ -890,13 +892,14 @@  gomp_for *
 gimple_build_omp_for (gimple_seq body, int kind, tree clauses, size_t collapse,
 		      gimple_seq pre_body)
 {
-  gomp_for *p = as_a <gomp_for *> (gimple_alloc (GIMPLE_OMP_FOR, 0));
+  int nops = collapse * 4;
+  gomp_for *p = as_a <gomp_for *> (gimple_alloc (GIMPLE_OMP_FOR, nops));
   if (body)
     gimple_omp_set_body (p, body);
   gimple_omp_for_set_clauses (p, clauses);
   gimple_omp_for_set_kind (p, kind);
   p->collapse = collapse;
-  p->iter =  ggc_cleared_vec_alloc<gimple_omp_for_iter> (collapse);
+  p->iter = ggc_cleared_vec_alloc<gimple_omp_for_iter> (collapse);
 
   if (pre_body)
     gimple_omp_for_set_pre_body (p, pre_body);
@@ -1011,7 +1014,7 @@  gomp_continue *
 gimple_build_omp_continue (tree control_def, tree control_use)
 {
   gomp_continue *p
-    = as_a <gomp_continue *> (gimple_alloc (GIMPLE_OMP_CONTINUE, 0));
+    = as_a <gomp_continue *> (gimple_alloc (GIMPLE_OMP_CONTINUE, 2));
   gimple_omp_continue_set_control_def (p, control_def);
   gimple_omp_continue_set_control_use (p, control_use);
   return p;
Index: gcc/gimple.def
===================================================================
--- gcc/gimple.def	(revision 224547)
+++ gcc/gimple.def	(working copy)
@@ -225,11 +225,11 @@  DEFGSCODE(GIMPLE_OMP_ATOMIC_STORE, "gimp
 
 /* GIMPLE_OMP_CONTINUE marks the location of the loop or sections
    iteration in partially lowered OpenMP code.  */
-DEFGSCODE(GIMPLE_OMP_CONTINUE, "gimple_omp_continue", GSS_OMP_CONTINUE)
+DEFGSCODE(GIMPLE_OMP_CONTINUE, "gimple_omp_continue", GSS_WITH_OPS)
 
 /* GIMPLE_OMP_ENTRY_END marks the end of the unpredicated entry block
    into an offloaded region.  */
-DEFGSCODE(GIMPLE_OMP_ENTRY_END, "gimple_omp_entry_end", GSS_BASE)
+DEFGSCODE(GIMPLE_OMP_ENTRY_END, "gimple_omp_entry_end", GSS_WITH_OPS)
 
 /* GIMPLE_OMP_CRITICAL <NAME, BODY> represents
 
Index: gcc/gimple.h
===================================================================
--- gcc/gimple.h	(revision 224547)
+++ gcc/gimple.h	(working copy)
@@ -301,7 +301,7 @@  struct GTY((tag("GSS_CALL")))
 /* OMP statements.  */
 
 struct GTY((tag("GSS_OMP")))
-  gimple_statement_omp : public gimple_statement_base
+  gimple_statement_omp : public gimple_statement_with_ops_base
 {
   /* [ WORD 1-6 ] : base class */
 
@@ -520,20 +520,8 @@  struct GTY((tag("GSS_OMP_CRITICAL")))
 
 
 struct GTY(()) gimple_omp_for_iter {
-  /* Condition code.  */
-  enum tree_code cond;
-
-  /* Index variable.  */
-  tree index;
-
-  /* Initial value.  */
-  tree initial;
-
-  /* Final value.  */
-  tree final;
-
-  /* Increment.  */
-  tree incr;
+  /* Condition code and increment code.  */
+  enum tree_code cond, incr;
 };
 
 /* GIMPLE_OMP_FOR */
@@ -556,6 +544,12 @@  struct GTY((tag("GSS_OMP_FOR")))
   /* [ WORD 11 ]
      Pre-body evaluated before the loop body begins.  */
   gimple_seq pre_body;
+
+  /* [ WORD 12 ]
+     Operand vector.  NOTE!  This must always be the last field
+     of this structure.  In particular, this means that this
+     structure cannot be embedded inside another one.  */
+  tree GTY((length ("%h.num_ops"))) op[1];
 };
 
 
@@ -581,10 +575,6 @@  struct GTY((tag("GSS_OMP_PARALLEL_LAYOUT
   /* [ WORD 11 ]
      Size of the gang-local memory to allocate.  */
   tree ganglocal_size;
-
-  /* [ WORD 12 ]
-     A pointer to the array to be used for broadcasting across threads.  */
-  tree broadcast_array;
 };
 
 /* GIMPLE_OMP_PARALLEL or GIMPLE_TASK */
@@ -655,16 +645,10 @@  struct GTY((tag("GSS_OMP_SECTIONS")))
    Note: This does not inherit from gimple_statement_omp, because we
          do not need the body field.  */
 
-struct GTY((tag("GSS_OMP_CONTINUE")))
-  gomp_continue : public gimple_statement_base
+struct GTY((tag("GSS_WITH_OPS")))
+  gomp_continue : public gimple_statement_with_ops
 {
-  /* [ WORD 1-6 ] : base class */
-
-  /* [ WORD 7 ]  */
-  tree control_def;
-
-  /* [ WORD 8 ]  */
-  tree control_use;
+  /* no additional fields; this uses the layout for GSS_WITH_OPS. */
 };
 
 /* GIMPLE_OMP_SINGLE, GIMPLE_OMP_TEAMS */
@@ -1356,7 +1340,7 @@  gimple gimple_build_omp_taskgroup (gimpl
 gomp_continue *gimple_build_omp_continue (tree, tree);
 gimple gimple_build_omp_ordered (gimple_seq);
 gimple gimple_build_omp_return (bool);
-gimple gimple_build_omp_entry_end ();
+gimple gimple_build_omp_entry_end (tree);
 gomp_sections *gimple_build_omp_sections (gimple_seq, tree);
 gimple gimple_build_omp_sections_switch (void);
 gomp_single *gimple_build_omp_single (gimple_seq, tree);
@@ -1853,7 +1837,10 @@  gimple_init_singleton (gimple g)
 static inline bool
 gimple_has_ops (const_gimple g)
 {
-  return gimple_code (g) >= GIMPLE_COND && gimple_code (g) <= GIMPLE_RETURN;
+  return ((gimple_code (g) >= GIMPLE_COND && gimple_code (g) <= GIMPLE_RETURN)
+	  || gimple_code (g) == GIMPLE_OMP_FOR
+	  || gimple_code (g) == GIMPLE_OMP_ENTRY_END
+	  || gimple_code (g) == GIMPLE_OMP_CONTINUE);
 }
 
 template <>
@@ -4559,6 +4546,27 @@  gimple_omp_for_set_cond (gimple gs, size
   omp_for_stmt->iter[i].cond = cond;
 }
 
+/* Return the increment code associated with the OMP_FOR statement GS.  */
+
+static inline enum tree_code
+gimple_omp_for_incr_code (const_gimple gs, size_t i)
+{
+  const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs);
+  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
+  return omp_for_stmt->iter[i].incr;
+}
+
+
+/* Set INCR to be the increment code for the OMP_FOR statement GS.  */
+
+static inline void
+gimple_omp_for_set_incr_code (gimple gs, size_t i, enum tree_code incr)
+{
+  gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
+  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
+  omp_for_stmt->iter[i].incr = incr;
+}
+
 
 /* Return the index variable for the OMP_FOR statement GS.  */
 
@@ -4567,7 +4575,7 @@  gimple_omp_for_index (const_gimple gs, s
 {
   const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs);
   gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return omp_for_stmt->iter[i].index;
+  return gimple_op (gs, i);
 }
 
 
@@ -4578,7 +4586,7 @@  gimple_omp_for_index_ptr (gimple gs, siz
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
   gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return &omp_for_stmt->iter[i].index;
+  return gimple_op_ptr (gs, i);
 }
 
 
@@ -4588,8 +4596,9 @@  static inline void
 gimple_omp_for_set_index (gimple gs, size_t i, tree index)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  omp_for_stmt->iter[i].index = index;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  gimple_set_op (gs, i, index);
 }
 
 
@@ -4599,8 +4608,9 @@  static inline tree
 gimple_omp_for_initial (const_gimple gs, size_t i)
 {
   const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return omp_for_stmt->iter[i].initial;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  return gimple_op (gs, i + c);
 }
 
 
@@ -4610,8 +4620,9 @@  static inline tree *
 gimple_omp_for_initial_ptr (gimple gs, size_t i)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return &omp_for_stmt->iter[i].initial;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  return gimple_op_ptr (gs, i + c);
 }
 
 
@@ -4621,8 +4632,9 @@  static inline void
 gimple_omp_for_set_initial (gimple gs, size_t i, tree initial)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  omp_for_stmt->iter[i].initial = initial;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  gimple_set_op (gs, i + c, initial);
 }
 
 
@@ -4632,8 +4644,9 @@  static inline tree
 gimple_omp_for_final (const_gimple gs, size_t i)
 {
   const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return omp_for_stmt->iter[i].final;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  return gimple_op (gs, i + c * 2);
 }
 
 
@@ -4643,8 +4656,9 @@  static inline tree *
 gimple_omp_for_final_ptr (gimple gs, size_t i)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return &omp_for_stmt->iter[i].final;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  return gimple_op_ptr (gs, i + c * 2);
 }
 
 
@@ -4654,8 +4668,9 @@  static inline void
 gimple_omp_for_set_final (gimple gs, size_t i, tree final)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  omp_for_stmt->iter[i].final = final;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  gimple_set_op (gs, i + c * 2, final);
 }
 
 
@@ -4665,8 +4680,9 @@  static inline tree
 gimple_omp_for_incr (const_gimple gs, size_t i)
 {
   const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return omp_for_stmt->iter[i].incr;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  return gimple_op (gs, i + c * 3);
 }
 
 
@@ -4676,8 +4692,9 @@  static inline tree *
 gimple_omp_for_incr_ptr (gimple gs, size_t i)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return &omp_for_stmt->iter[i].incr;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  return gimple_op_ptr (gs, i + c * 3);
 }
 
 
@@ -4687,8 +4704,9 @@  static inline void
 gimple_omp_for_set_incr (gimple gs, size_t i, tree incr)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  omp_for_stmt->iter[i].incr = incr;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  gimple_set_op (gs, i + c * 3, incr);
 }
 
 
@@ -5248,25 +5266,6 @@  gimple_omp_target_set_ganglocal_size (go
 }
 
 
-/* Return the pointer to the broadcast array associated with OMP_TARGET GS.  */
-
-static inline tree
-gimple_omp_target_broadcast_array (const gomp_target *omp_target_stmt)
-{
-  return omp_target_stmt->broadcast_array;
-}
-
-
-/* Set PTR to be the broadcast array associated with OMP_TARGET
-   GS.  */
-
-static inline void
-gimple_omp_target_set_broadcast_array (gomp_target *omp_target_stmt, tree ptr)
-{
-  omp_target_stmt->broadcast_array = ptr;
-}
-
-
 /* Return the clauses associated with OMP_TEAMS GS.  */
 
 static inline tree
@@ -5446,7 +5445,7 @@  gimple_omp_atomic_load_rhs_ptr (gomp_ato
 static inline tree
 gimple_omp_continue_control_def (const gomp_continue *cont_stmt)
 {
-  return cont_stmt->control_def;
+  return gimple_op (cont_stmt, 0);
 }
 
 /* The same as above, but return the address.  */
@@ -5454,7 +5453,7 @@  gimple_omp_continue_control_def (const g
 static inline tree *
 gimple_omp_continue_control_def_ptr (gomp_continue *cont_stmt)
 {
-  return &cont_stmt->control_def;
+  return gimple_op_ptr (cont_stmt, 0);
 }
 
 /* Set the definition of the control variable in a GIMPLE_OMP_CONTINUE.  */
@@ -5462,7 +5461,7 @@  gimple_omp_continue_control_def_ptr (gom
 static inline void
 gimple_omp_continue_set_control_def (gomp_continue *cont_stmt, tree def)
 {
-  cont_stmt->control_def = def;
+  gimple_set_op (cont_stmt, 0, def);
 }
 
 
@@ -5471,7 +5470,7 @@  gimple_omp_continue_set_control_def (gom
 static inline tree
 gimple_omp_continue_control_use (const gomp_continue *cont_stmt)
 {
-  return cont_stmt->control_use;
+  return gimple_op (cont_stmt, 1);
 }
 
 
@@ -5480,7 +5479,7 @@  gimple_omp_continue_control_use (const g
 static inline tree *
 gimple_omp_continue_control_use_ptr (gomp_continue *cont_stmt)
 {
-  return &cont_stmt->control_use;
+  return gimple_op_ptr (cont_stmt, 1);
 }
 
 
@@ -5489,7 +5488,7 @@  gimple_omp_continue_control_use_ptr (gom
 static inline void
 gimple_omp_continue_set_control_use (gomp_continue *cont_stmt, tree use)
 {
-  cont_stmt->control_use = use;
+  gimple_set_op (cont_stmt, 1, use);
 }
 
 /* Return a pointer to the body for the GIMPLE_TRANSACTION statement
Index: gcc/gimplify.c
===================================================================
--- gcc/gimplify.c	(revision 224547)
+++ gcc/gimplify.c	(working copy)
@@ -7582,12 +7582,15 @@  gimplify_omp_for (tree *expr_p, gimple_s
   for (i = 0; i < TREE_VEC_LENGTH (OMP_FOR_INIT (for_stmt)); i++)
     {
       t = TREE_VEC_ELT (OMP_FOR_INIT (for_stmt), i);
-      gimple_omp_for_set_index (gfor, i, TREE_OPERAND (t, 0));
+      tree idxvar = TREE_OPERAND (t, 0);
+      gimple_omp_for_set_index (gfor, i, idxvar);
       gimple_omp_for_set_initial (gfor, i, TREE_OPERAND (t, 1));
       t = TREE_VEC_ELT (OMP_FOR_COND (for_stmt), i);
       gimple_omp_for_set_cond (gfor, i, TREE_CODE (t));
       gimple_omp_for_set_final (gfor, i, TREE_OPERAND (t, 1));
       t = TREE_VEC_ELT (OMP_FOR_INCR (for_stmt), i);
+      t = TREE_OPERAND (t, 1);
+      gimple_omp_for_set_incr_code (gfor, i, TREE_CODE (t));
       gimple_omp_for_set_incr (gfor, i, TREE_OPERAND (t, 1));
     }
 
Index: gcc/gsstruct.def
===================================================================
--- gcc/gsstruct.def	(revision 224547)
+++ gcc/gsstruct.def	(working copy)
@@ -42,12 +42,11 @@  DEFGSSTRUCT(GSS_EH_ELSE, geh_else, false
 DEFGSSTRUCT(GSS_WCE, gimple_statement_wce, false)
 DEFGSSTRUCT(GSS_OMP, gimple_statement_omp, false)
 DEFGSSTRUCT(GSS_OMP_CRITICAL, gomp_critical, false)
-DEFGSSTRUCT(GSS_OMP_FOR, gomp_for, false)
+DEFGSSTRUCT(GSS_OMP_FOR, gomp_for, true)
 DEFGSSTRUCT(GSS_OMP_PARALLEL_LAYOUT, gimple_statement_omp_parallel_layout, false)
 DEFGSSTRUCT(GSS_OMP_TASK, gomp_task, false)
 DEFGSSTRUCT(GSS_OMP_SECTIONS, gomp_sections, false)
 DEFGSSTRUCT(GSS_OMP_SINGLE_LAYOUT, gimple_statement_omp_single_layout, false)
-DEFGSSTRUCT(GSS_OMP_CONTINUE, gomp_continue, false)
 DEFGSSTRUCT(GSS_OMP_ATOMIC_LOAD, gomp_atomic_load, false)
 DEFGSSTRUCT(GSS_OMP_ATOMIC_STORE_LAYOUT, gomp_atomic_store, false)
 DEFGSSTRUCT(GSS_TRANSACTION, gtransaction, false)
Index: gcc/ipa-inline-analysis.c
===================================================================
--- gcc/ipa-inline-analysis.c	(revision 224547)
+++ gcc/ipa-inline-analysis.c	(working copy)
@@ -4122,10 +4122,12 @@  inline_generate_summary (void)
 {
   struct cgraph_node *node;
 
+#ifndef ACCEL_COMPILER
   /* When not optimizing, do not bother to analyze.  Inlining is still done
      because edge redirection needs to happen there.  */
   if (!optimize && !flag_generate_lto && !flag_generate_offload && !flag_wpa)
     return;
+#endif
 
   if (!inline_summaries)
     inline_summaries = (inline_summary_t*) inline_summary_t::create_ggc (symtab);
Index: gcc/lto/lto.c
===================================================================
--- gcc/lto/lto.c	(revision 224547)
+++ gcc/lto/lto.c	(working copy)
@@ -3115,8 +3115,10 @@  read_cgraph_and_symbols (unsigned nfiles
   /* Read the IPA summary data.  */
   if (flag_ltrans)
     ipa_read_optimization_summaries ();
+#ifndef ACCEL_COMPILER
   else
     ipa_read_summaries ();
+#endif
 
   for (i = 0; all_file_decl_data[i]; i++)
     {
Index: gcc/lto-streamer-out.c
===================================================================
--- gcc/lto-streamer-out.c	(revision 224547)
+++ gcc/lto-streamer-out.c	(working copy)
@@ -1800,27 +1800,32 @@  output_ssa_names (struct output_block *o
 {
   unsigned int i, len;
 
-  len = vec_safe_length (SSANAMES (fn));
-  streamer_write_uhwi (ob, len);
-
-  for (i = 1; i < len; i++)
+  if (cfun->gimple_df)
     {
-      tree ptr = (*SSANAMES (fn))[i];
+      len = vec_safe_length (SSANAMES (fn));
+      streamer_write_uhwi (ob, len);
 
-      if (ptr == NULL_TREE
-	  || SSA_NAME_IN_FREE_LIST (ptr)
-	  || virtual_operand_p (ptr))
-	continue;
+      for (i = 1; i < len; i++)
+	{
+	  tree ptr = (*SSANAMES (fn))[i];
 
-      streamer_write_uhwi (ob, i);
-      streamer_write_char_stream (ob->main_stream,
-				  SSA_NAME_IS_DEFAULT_DEF (ptr));
-      if (SSA_NAME_VAR (ptr))
-	stream_write_tree (ob, SSA_NAME_VAR (ptr), true);
-      else
-	/* ???  This drops SSA_NAME_IDENTIFIER on the floor.  */
-	stream_write_tree (ob, TREE_TYPE (ptr), true);
+	  if (ptr == NULL_TREE
+	      || SSA_NAME_IN_FREE_LIST (ptr)
+	      || virtual_operand_p (ptr))
+	    continue;
+
+	  streamer_write_uhwi (ob, i);
+	  streamer_write_char_stream (ob->main_stream,
+				      SSA_NAME_IS_DEFAULT_DEF (ptr));
+	  if (SSA_NAME_VAR (ptr))
+	    stream_write_tree (ob, SSA_NAME_VAR (ptr), true);
+	  else
+	    /* ???  This drops SSA_NAME_IDENTIFIER on the floor.  */
+	    stream_write_tree (ob, TREE_TYPE (ptr), true);
+	}
     }
+  else
+    streamer_write_zero (ob);
 
   streamer_write_zero (ob);
 }
Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 224547)
+++ gcc/omp-low.c	(working copy)
@@ -110,7 +110,7 @@  along with GCC; see the file COPYING3.
 #include "gomp-constants.h"
 #include "gimple-pretty-print.h"
 #include "set"
-
+#include "output.h"
 
 /* Lowering of OMP parallel and workshare constructs proceeds in two
    phases.  The first phase scans the function looking for OMP statements
@@ -597,17 +597,17 @@  extract_omp_for_data (gomp_for *for_stmt
 	}
 
       t = gimple_omp_for_incr (for_stmt, i);
-      gcc_assert (TREE_OPERAND (t, 0) == var);
-      switch (TREE_CODE (t))
+      enum tree_code incr_code = gimple_omp_for_incr_code (for_stmt, i);
+      switch (incr_code)
 	{
 	case PLUS_EXPR:
-	  loop->step = TREE_OPERAND (t, 1);
+	  loop->step = t;
 	  break;
 	case POINTER_PLUS_EXPR:
-	  loop->step = fold_convert (ssizetype, TREE_OPERAND (t, 1));
+	  loop->step = fold_convert (ssizetype, t);
 	  break;
 	case MINUS_EXPR:
-	  loop->step = TREE_OPERAND (t, 1);
+	  loop->step = t;
 	  loop->step = fold_build1_loc (loc,
 				    NEGATE_EXPR, TREE_TYPE (loop->step),
 				    loop->step);
@@ -9721,12 +9721,21 @@  loop_get_oacc_kernels_region_entry (stru
     }
 }
 
+static bool
+was_offloaded_p (tree fn)
+{
+#ifdef ACCEL_COMPILER
+  return true;
+#endif
+  struct cgraph_node *node = cgraph_node::get (fn);
+  return node->offloadable;
+}
+
 /* Expand the GIMPLE_OMP_TARGET starting at REGION.  */
 
 static void
 expand_omp_target (struct omp_region *region)
 {
-  basic_block entry_bb, exit_bb, new_bb;
   struct function *child_cfun;
   tree child_fn, block, t;
   gimple_stmt_iterator gsi;
@@ -9736,12 +9745,33 @@  expand_omp_target (struct omp_region *re
   bool offloaded, data_region;
   bool do_emit_library_call = true;
   bool do_splitoff = true;
+  bool already_offloaded = was_offloaded_p (current_function_decl);
 
   entry_stmt = as_a <gomp_target *> (last_stmt (region->entry));
+  location_t entry_loc = gimple_location (entry_stmt);
 
-  new_bb = region->entry;
+  basic_block new_bb = region->entry;
+  basic_block entry_bb = region->entry;
+  basic_block exit_bb = region->exit;
+  basic_block entry_succ_bb = single_succ (entry_bb);
 
-  offloaded = is_gimple_omp_offloaded (entry_stmt);
+  if (already_offloaded)
+    {
+      gsi = gsi_for_stmt (entry_stmt);
+      gsi_remove (&gsi, true);
+
+      gsi = gsi_last_bb (exit_bb);
+      gcc_assert (!gsi_end_p (gsi)
+		  && gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_RETURN);
+      gsi_remove (&gsi, true);
+
+      gsi = gsi_last_bb (entry_succ_bb);
+      if (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_ENTRY_END)
+	gsi_remove (&gsi, true);
+      return;
+    }
+
+  offloaded = !already_offloaded && is_gimple_omp_offloaded (entry_stmt);
   switch (gimple_omp_target_kind (entry_stmt))
     {
     case GF_OMP_TARGET_KIND_REGION:
@@ -9773,9 +9803,6 @@  expand_omp_target (struct omp_region *re
   if (child_cfun != NULL)
     gcc_checking_assert (!child_cfun->cfg);
 
-  entry_bb = region->entry;
-  exit_bb = region->exit;
-
   if (gimple_omp_target_kind (entry_stmt) == GF_OMP_TARGET_KIND_OACC_KERNELS)
     {
       if (!gimple_in_ssa_p (cfun))
@@ -9814,13 +9841,7 @@  expand_omp_target (struct omp_region *re
 	}
     }
 
-  basic_block entry_succ_bb = single_succ (entry_bb);
-  if (offloaded && !gimple_in_ssa_p (cfun))
-    {
-      gsi = gsi_last_bb (entry_succ_bb);
-      if (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_ENTRY_END)
-	gsi_remove (&gsi, true);
-    }
+  tree data_arg = gimple_omp_target_data_arg (entry_stmt);
 
   if (offloaded
       && do_splitoff)
@@ -9840,7 +9861,6 @@  expand_omp_target (struct omp_region *re
 	 a function call that has been inlined, the original PARM_DECL
 	 .OMP_DATA_I may have been converted into a different local
 	 variable.  In which case, we need to keep the assignment.  */
-      tree data_arg = gimple_omp_target_data_arg (entry_stmt);
       if (data_arg)
 	{
 	  gimple_stmt_iterator gsi;
@@ -9923,8 +9943,12 @@  expand_omp_target (struct omp_region *re
       stmt = gsi_stmt (gsi);
       gcc_assert (stmt
 		  && gimple_code (stmt) == gimple_code (entry_stmt));
+      gsi_prev (&gsi);
+      stmt = gsi_stmt (gsi);
       e = split_block (entry_bb, stmt);
+#if 0
       gsi_remove (&gsi, true);
+#endif
       entry_bb = e->dest;
       single_succ_edge (entry_bb)->flags = EDGE_FALLTHRU;
 
@@ -9932,11 +9956,16 @@  expand_omp_target (struct omp_region *re
       if (exit_bb)
 	{
 	  gsi = gsi_last_bb (exit_bb);
+	  gimple ompret = gsi_stmt (gsi);
 	  gcc_assert (!gsi_end_p (gsi)
-		      && gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_RETURN);
+		      && gimple_code (ompret) == GIMPLE_OMP_RETURN);
 	  stmt = gimple_build_return (NULL);
 	  gsi_insert_after (&gsi, stmt, GSI_SAME_STMT);
+#if 0
 	  gsi_remove (&gsi, true);
+#endif
+	  edge e1 = split_block (exit_bb, ompret);
+	  exit_bb = e1->dest;
 
 	  /* A vuse in single_succ (exit_bb) may use a vdef from the region
 	     which is about to be split off.  Mark the vdef for renaming.  */
@@ -9955,6 +9984,9 @@  expand_omp_target (struct omp_region *re
       else
 	block = gimple_block (entry_stmt);
 
+      /* Make sure we don't try to copy these.  */
+      gimple_omp_target_set_child_fn (entry_stmt, NULL);
+      gimple_omp_target_set_data_arg (entry_stmt, NULL);
       new_bb = move_sese_region_to_fn (child_cfun, entry_bb, exit_bb, block);
       if (exit_bb)
 	single_succ_edge (new_bb)->flags = EDGE_FALLTHRU;
@@ -9979,6 +10011,8 @@  expand_omp_target (struct omp_region *re
 
       /* Inform the callgraph about the new function.  */
       DECL_STRUCT_FUNCTION (child_fn)->curr_properties = cfun->curr_properties;
+      DECL_STRUCT_FUNCTION (child_fn)->curr_properties &= ~PROP_gimple_eomp;
+      
       cgraph_node::add_new_function (child_fn, true);
       cgraph_node::get (child_fn)->parallelized_function = 1;
 
@@ -10088,7 +10122,7 @@  expand_omp_target (struct omp_region *re
       clause_loc = OMP_CLAUSE_LOCATION (c);
     }
   else
-    clause_loc = gimple_location (entry_stmt);
+    clause_loc = entry_loc;
 
   /* Ensure 'device' is of the correct type.  */
   device = fold_convert_loc (clause_loc, integer_type_node, device);
@@ -10147,7 +10181,7 @@  expand_omp_target (struct omp_region *re
 
   gsi = gsi_last_bb (new_bb);
   t = gimple_omp_target_data_arg (entry_stmt);
-  if (t == NULL)
+  if (data_arg == NULL)
     {
       t1 = size_zero_node;
       t2 = build_zero_cst (ptr_type_node);
@@ -10156,11 +10190,11 @@  expand_omp_target (struct omp_region *re
     }
   else
     {
-      t1 = TYPE_MAX_VALUE (TYPE_DOMAIN (TREE_TYPE (TREE_VEC_ELT (t, 1))));
+      t1 = TYPE_MAX_VALUE (TYPE_DOMAIN (TREE_TYPE (TREE_VEC_ELT (data_arg, 1))));
       t1 = size_binop (PLUS_EXPR, t1, size_int (1));
-      t2 = build_fold_addr_expr (TREE_VEC_ELT (t, 0));
-      t3 = build_fold_addr_expr (TREE_VEC_ELT (t, 1));
-      t4 = build_fold_addr_expr (TREE_VEC_ELT (t, 2));
+      t2 = build_fold_addr_expr (TREE_VEC_ELT (data_arg, 0));
+      t3 = build_fold_addr_expr (TREE_VEC_ELT (data_arg, 1));
+      t4 = build_fold_addr_expr (TREE_VEC_ELT (data_arg, 2));
     }
 
   gimple g;
@@ -10209,8 +10243,7 @@  expand_omp_target (struct omp_region *re
 
 	/* Default values for num_gangs, num_workers, and vector_length.  */
 	t_num_gangs = t_num_workers = t_vector_length
-	  = fold_convert_loc (gimple_location (entry_stmt),
-			      integer_type_node, integer_one_node);
+	  = fold_convert_loc (entry_loc, integer_type_node, integer_one_node);
 	/* ..., but if present, use the value specified by the respective
 	   clause, making sure that are of the correct type.  */
 	c = find_omp_clause (clauses, OMP_CLAUSE_NUM_GANGS);
@@ -10241,8 +10274,7 @@  expand_omp_target (struct omp_region *re
 	int t_wait_idx;
 
 	/* Default values for t_async.  */
-	t_async = fold_convert_loc (gimple_location (entry_stmt),
-				    integer_type_node,
+	t_async = fold_convert_loc (entry_loc, integer_type_node,
 				    build_int_cst (integer_type_node,
 						   GOMP_ASYNC_SYNC));
 	/* ..., but if present, use the value specified by the respective
@@ -10257,8 +10289,7 @@  expand_omp_target (struct omp_region *re
 	/* Save the index, and... */
 	t_wait_idx = args.length ();
 	/* ... push a default value.  */
-	args.quick_push (fold_convert_loc (gimple_location (entry_stmt),
-					   integer_type_node,
+	args.quick_push (fold_convert_loc (entry_loc, integer_type_node,
 					   integer_zero_node));
 	c = find_omp_clause (clauses, OMP_CLAUSE_WAIT);
 	if (c)
@@ -10279,8 +10310,7 @@  expand_omp_target (struct omp_region *re
 	    /* Now that we know the number, replace the default value.  */
 	    args.ordered_remove (t_wait_idx);
 	    args.quick_insert (t_wait_idx,
-			       fold_convert_loc (gimple_location (entry_stmt),
-						 integer_type_node,
+			       fold_convert_loc (entry_loc, integer_type_node,
 						 build_int_cst (integer_type_node, n)));
 	  }
       }
@@ -10290,7 +10320,7 @@  expand_omp_target (struct omp_region *re
     }
 
   g = gimple_build_call_vec (builtin_decl_explicit (start_ix), args);
-  gimple_set_location (g, gimple_location (entry_stmt));
+  gimple_set_location (g, entry_loc);
   gsi_insert_before (&gsi, g, GSI_SAME_STMT);
   if (!offloaded)
     {
@@ -10310,6 +10340,23 @@  expand_omp_target (struct omp_region *re
     update_ssa (TODO_update_ssa_only_virtuals);
 }
 
+static bool
+expand_region_inner_p (omp_region *region)
+{
+  if (!region->inner)
+    return false;
+
+  if (region->type != GIMPLE_OMP_TARGET)
+    return true;
+  if (was_offloaded_p (current_function_decl))
+    return true;
+
+  gomp_target *entry_stmt = as_a <gomp_target *> (last_stmt (region->entry));
+  bool offloaded = is_gimple_omp_offloaded (entry_stmt);
+
+  return !offloaded || !is_gimple_omp_oacc (entry_stmt);
+}
+
 /* Expand the parallel region tree rooted at REGION.  Expansion
    proceeds in depth-first order.  Innermost regions are expanded
    first.  This way, parallel regions that require a new function to
@@ -10340,8 +10387,7 @@  expand_omp (struct omp_region *region)
       if (region->type == GIMPLE_OMP_FOR
 	  && gimple_omp_for_combined_p (last_stmt (region->entry)))
 	inner_stmt = last_stmt (region->inner->entry);
-     
-      if (region->inner)
+      if (expand_region_inner_p (region))
 	expand_omp (region->inner);
 
       saved_location = input_location;
@@ -10439,7 +10485,9 @@  find_omp_target_region_data (struct omp_
     region->gwv_this |= MASK_WORKER;
   if (find_omp_clause (clauses, OMP_CLAUSE_VECTOR_LENGTH))
     region->gwv_this |= MASK_VECTOR;
-  region->broadcast_array = gimple_omp_target_broadcast_array (stmt);
+  basic_block entry_succ = single_succ (region->entry);
+  gimple ee_stmt = last_stmt (entry_succ);
+  region->broadcast_array = gimple_op (ee_stmt, 0);
 }
 
 /* Helper for build_omp_regions.  Scan the dominator tree starting at
@@ -10666,6 +10714,7 @@  generate_vector_broadcast (tree dest_var
 	conv1 = gimple_build_assign (casted_var, NOP_EXPR, var);
 
       gsi_insert_after (&where, conv1, GSI_CONTINUE_LINKING);
+      retval = conv1;
     }
 
   tree decl = builtin_decl_explicit (fn);
@@ -10709,19 +10758,21 @@  generate_oacc_broadcast (omp_region *reg
   omp_region *parent = enclosing_target_region (region);
 
   tree elttype = build_qualified_type (TREE_TYPE (var), TYPE_QUAL_VOLATILE);
-  tree ptr = create_tmp_var (build_pointer_type (elttype));
-  gassign *cast1 = gimple_build_assign (ptr, NOP_EXPR,
+  tree ptrtype = build_pointer_type (elttype);
+  tree ptr1 = make_ssa_name (ptrtype);
+  tree ptr2 = make_ssa_name (ptrtype);
+  gassign *cast1 = gimple_build_assign (ptr1, NOP_EXPR,
 				       parent->broadcast_array);
   gsi_insert_after (&where, cast1, GSI_NEW_STMT);
-  gassign *st = gimple_build_assign (build_simple_mem_ref (ptr), var);
+  gassign *st = gimple_build_assign (build_simple_mem_ref (ptr1), var);
   gsi_insert_after (&where, st, GSI_NEW_STMT);
 
   gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
 
-  gassign *cast2 = gimple_build_assign (ptr, NOP_EXPR,
+  gassign *cast2 = gimple_build_assign (ptr2, NOP_EXPR,
 					parent->broadcast_array);
   gsi_insert_after (&where, cast2, GSI_NEW_STMT);
-  gassign *ld = gimple_build_assign (dest_var, build_simple_mem_ref (ptr));
+  gassign *ld = gimple_build_assign (dest_var, build_simple_mem_ref (ptr2));
   gsi_insert_after (&where, ld, GSI_NEW_STMT);
 
   gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
@@ -10735,7 +10786,8 @@  generate_oacc_broadcast (omp_region *reg
    the bits MASK_VECTOR and/or MASK_WORKER.  */
 
 static void
-make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask)
+make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask,
+		       bool set_dominator)
 {
   basic_block cond_bb = true_edge->src;
   
@@ -10747,7 +10799,7 @@  make_predication_test (edge true_edge, b
   if (mask & MASK_VECTOR)
     {
       gimple call = gimple_build_call (decl, 1, integer_zero_node);
-      vvar = create_tmp_var (unsigned_type_node);
+      vvar = make_ssa_name (unsigned_type_node);
       comp_var = vvar;
       gimple_call_set_lhs (call, vvar);
       gsi_insert_after (&tmp_gsi, call, GSI_NEW_STMT);
@@ -10755,14 +10807,14 @@  make_predication_test (edge true_edge, b
   if (mask & MASK_WORKER)
     {
       gimple call = gimple_build_call (decl, 1, integer_one_node);
-      wvar = create_tmp_var (unsigned_type_node);
+      wvar = make_ssa_name (unsigned_type_node);
       comp_var = wvar;
       gimple_call_set_lhs (call, wvar);
       gsi_insert_after (&tmp_gsi, call, GSI_NEW_STMT);
     }
   if (wvar && vvar)
     {
-      comp_var = create_tmp_var (unsigned_type_node);
+      comp_var = make_ssa_name (unsigned_type_node);
       gassign *ior = gimple_build_assign (comp_var, BIT_IOR_EXPR, wvar, vvar);
       gsi_insert_after (&tmp_gsi, ior, GSI_NEW_STMT);
     }
@@ -10782,6 +10834,9 @@  make_predication_test (edge true_edge, b
   basic_block false_abnorm_bb = split_edge (e);
   edge abnorm_edge = single_succ_edge (false_abnorm_bb);
   abnorm_edge->flags |= EDGE_ABNORMAL;
+
+  if (set_dominator)
+    set_immediate_dominator (CDI_DOMINATORS, skip_dest_bb, cond_bb);
 }
 
 /* Apply OpenACC predication to basic block BB which is in
@@ -10791,6 +10846,8 @@  make_predication_test (edge true_edge, b
 static void
 predicate_bb (basic_block bb, struct omp_region *parent, int mask)
 {
+  bool set_dominator = true;
+
   /* We handle worker-single vector-partitioned loops by jumping
      around them if not in the controlling worker.  Don't insert
      unnecessary (and incorrect) predication.  */
@@ -10816,8 +10873,8 @@  predicate_bb (basic_block bb, struct omp
 
   if (gimple_code (stmt) == GIMPLE_COND)
     {
-      tree cond_var = create_tmp_var (boolean_type_node);
-      tree broadcast_cond = create_tmp_var (boolean_type_node);
+      tree cond_var = make_ssa_name (boolean_type_node);
+      tree broadcast_cond = make_ssa_name (boolean_type_node);
       gassign *asgn = gimple_build_assign (cond_var,
 					   gimple_cond_code (stmt),
 					   gimple_cond_lhs (stmt),
@@ -10830,30 +10887,36 @@  predicate_bb (basic_block bb, struct omp
 						   mask);
 
       edge e = split_block (bb, splitpoint);
+      set_immediate_dominator (CDI_DOMINATORS, e->dest, e->src);
       e->flags = EDGE_ABNORMAL;
       skip_dest_bb = e->dest;
 
       gimple_cond_set_condition (as_a <gcond *> (stmt), EQ_EXPR,
 				 broadcast_cond, boolean_true_node);
+      update_stmt (stmt);
     }
   else if (gimple_code (stmt) == GIMPLE_SWITCH)
     {
       gswitch *sstmt = as_a <gswitch *> (stmt);
       tree var = gimple_switch_index (sstmt);
-      tree new_var = create_tmp_var (TREE_TYPE (var));
+      tree new_var = make_ssa_name (TREE_TYPE (var));
 
+#if 0
       gassign *asgn = gimple_build_assign (new_var, var);
       gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
       gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
+#endif
+      gsi_prev (&gsi);
       gimple splitpoint = generate_oacc_broadcast (parent, new_var, var,
-						   gsi_asgn, mask);
+						   gsi, mask);
 
       edge e = split_block (bb, splitpoint);
+      set_immediate_dominator (CDI_DOMINATORS, e->dest, e->src);
       e->flags = EDGE_ABNORMAL;
       skip_dest_bb = e->dest;
 
       gimple_switch_set_index (sstmt, new_var);
+      update_stmt (stmt);
     }
   else if (is_gimple_omp (stmt))
     {
@@ -10876,6 +10939,7 @@  predicate_bb (basic_block bb, struct omp
 	      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
 	      gsi_prev (&head_gsi);
 	      edge e0 = split_block (bb, gsi_stmt (head_gsi));
+	      set_immediate_dominator (CDI_DOMINATORS, e0->dest, e0->src);
 	      int mask2 = mask;
 	      if (code == GIMPLE_OMP_FOR)
 		mask2 &= ~MASK_VECTOR;
@@ -10885,7 +10949,7 @@  predicate_bb (basic_block bb, struct omp
 		     so we just need to make one branch around the
 		     entire loop.  */
 		  inner->entry = e0->dest;
-		  make_predication_test (e0, skip_dest_bb, mask2);
+		  make_predication_test (e0, skip_dest_bb, mask2, true);
 		  return;
 		}
 	      basic_block for_block = e0->dest;
@@ -10896,9 +10960,9 @@  predicate_bb (basic_block bb, struct omp
 	      edge e2 = split_block (for_block, split_stmt);
 	      basic_block bb2 = e2->dest;
 
-	      make_predication_test (e0, bb2, mask);
+	      make_predication_test (e0, bb2, mask, true);
 	      make_predication_test (single_pred_edge (bb3), skip_dest_bb,
-				     mask2);
+				     mask2, true);
 	      inner->entry = bb3;
 	      return;
 	    }
@@ -10917,6 +10981,7 @@  predicate_bb (basic_block bb, struct omp
 	  if (!split_stmt)
 	    return;
 	  edge e = split_block (bb, split_stmt);
+	  set_immediate_dominator (CDI_DOMINATORS, e->dest, e->src);
 	  skip_dest_bb = e->dest;
 	  if (gimple_code (stmt) == GIMPLE_OMP_CONTINUE)
 	    {
@@ -10945,6 +11010,8 @@  predicate_bb (basic_block bb, struct omp
 	gsi_prev (&gsi);
       if (gsi_stmt (gsi) == 0)
 	return;
+      if (get_immediate_dominator (CDI_DOMINATORS, skip_dest_bb) != bb)
+	set_dominator = false;
     }
 
   if (skip_dest_bb != NULL)
@@ -10952,24 +11019,31 @@  predicate_bb (basic_block bb, struct omp
       gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
       gsi_prev (&head_gsi);
       edge e2 = split_block (bb, gsi_stmt (head_gsi));
-      make_predication_test (e2, skip_dest_bb, mask);
+      set_immediate_dominator (CDI_DOMINATORS, e2->dest, e2->src);
+      make_predication_test (e2, skip_dest_bb, mask, set_dominator);
     }
 }
 
 /* Walk the dominator tree starting at BB to collect basic blocks in
    WORKLIST which need OpenACC vector predication applied to them.  */
 
-static void
+static bool
 find_predicatable_bbs (basic_block bb, vec<basic_block> &worklist)
 {
+  bool ret = false;
   struct omp_region *parent = *bb_region_map->get (bb);
   if (required_predication_mask (parent) != 0)
-    worklist.safe_push (bb);
+    {
+      worklist.safe_push (bb);
+      ret = true;
+    }
+  
   basic_block son;
   for (son = first_dom_son (CDI_DOMINATORS, bb);
        son;
        son = next_dom_son (CDI_DOMINATORS, son))
-    find_predicatable_bbs (son, worklist);
+    ret |= find_predicatable_bbs (son, worklist);
+  return ret;
 }
 
 /* Apply OpenACC vector predication to all basic blocks.  HEAD_BB is the
@@ -10979,7 +11053,9 @@  static void
 predicate_omp_regions (basic_block head_bb)
 {
   vec<basic_block> worklist = vNULL;
-  find_predicatable_bbs (head_bb, worklist);
+  if (!find_predicatable_bbs (head_bb, worklist))
+    return;
+
   int i;
   basic_block bb;
   FOR_EACH_VEC_ELT (worklist, i, bb)
@@ -10988,6 +11064,11 @@  predicate_omp_regions (basic_block head_
       int mask = required_predication_mask (region);
       predicate_bb (bb, region, mask);
     }
+  free_dominance_info (CDI_DOMINATORS);
+  calculate_dominance_info (CDI_DOMINATORS);
+  mark_virtual_operands_for_renaming (cfun);
+  update_ssa (TODO_update_ssa);
+  verify_ssa (true, true);
 }
 
 /* USE and GET sets for variable broadcasting.  */
@@ -11176,7 +11257,8 @@  oacc_broadcast (basic_block entry_bb, ba
 
   /* Currently, subroutines aren't supported.  */
   gcc_assert (!lookup_attribute ("oacc function",
-				 DECL_ATTRIBUTES (current_function_decl)));
+				 DECL_ATTRIBUTES (current_function_decl))
+	      || was_offloaded_p (current_function_decl));
 
   /* Populate live_in.  */
   oacc_populate_live_in (entry_bb, region);
@@ -11236,7 +11318,7 @@  oacc_broadcast (basic_block entry_bb, ba
 	  gsi_prev (&gsi);
 	  edge e2 = split_block (entry_bb, gsi_stmt (gsi));
 	  e2->flags |= EDGE_ABNORMAL;
-	  make_predication_test (e2, dest_bb, mask);
+	  make_predication_test (e2, dest_bb, mask, true);
 
 	  /* Update entry_bb.  */
 	  entry_bb = dest_bb;
@@ -11249,7 +11331,7 @@  oacc_broadcast (basic_block entry_bb, ba
 /* Main entry point for expanding OMP-GIMPLE into runtime calls.  */
 
 static unsigned int
-execute_expand_omp (void)
+execute_expand_omp (bool first)
 {
   bb_region_map = new hash_map<basic_block, omp_region *>;
 
@@ -11264,7 +11346,8 @@  execute_expand_omp (void)
 	  fprintf (dump_file, "\n");
 	}
 
-      predicate_omp_regions (ENTRY_BLOCK_PTR_FOR_FN (cfun));
+      if (!first)
+	predicate_omp_regions (ENTRY_BLOCK_PTR_FOR_FN (cfun));
 
       remove_exit_barriers (root_omp_region);
 
@@ -11317,9 +11400,10 @@  public:
       if (!gate)
 	return 0;
 
-      return execute_expand_omp ();
+      return execute_expand_omp (true);
     }
 
+  opt_pass * clone () { return new pass_expand_omp (m_ctxt); }
 }; // class pass_expand_omp
 
 } // anon namespace
@@ -11400,9 +11484,9 @@  public:
     }
   virtual unsigned int execute (function *)
     {
-      unsigned res = execute_expand_omp ();
+      unsigned res = execute_expand_omp (false);
       release_dangling_ssa_names ();
-      return res;
+      return res | TODO_update_ssa;
     }
   opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); }
 
@@ -12562,7 +12646,7 @@  lower_omp_for (gimple_stmt_iterator *gsi
       if (!is_gimple_min_invariant (*rhs_p))
 	*rhs_p = get_formal_tmp_var (*rhs_p, &body);
 
-      rhs_p = &TREE_OPERAND (gimple_omp_for_incr (stmt, i), 1);
+      rhs_p = gimple_omp_for_incr_ptr (stmt, i);
       if (!is_gimple_min_invariant (*rhs_p))
 	*rhs_p = get_formal_tmp_var (*rhs_p, &body);
     }
@@ -13547,7 +13631,7 @@  lower_omp_target (gimple_stmt_iterator *
 
   if (offloaded)
     {
-      gimple_seq_add_stmt (&new_body, gimple_build_omp_entry_end ());
+      gimple_seq_add_stmt (&new_body, gimple_build_omp_entry_end (ctx->worker_sync_elt));
       if (has_reduction)
 	{
 	  gimple_seq_add_seq (&irlist, tgt_body);
@@ -13583,7 +13667,6 @@  lower_omp_target (gimple_stmt_iterator *
   gsi_insert_seq_before (gsi_p, sz_ilist, GSI_SAME_STMT);
 
   gimple_omp_target_set_ganglocal_size (stmt, sz);
-  gimple_omp_target_set_broadcast_array (stmt, ctx->worker_sync_elt);
   pop_gimplify_context (NULL);
 }
 
Index: gcc/pass_manager.h
===================================================================
--- gcc/pass_manager.h	(revision 224547)
+++ gcc/pass_manager.h	(working copy)
@@ -28,6 +28,7 @@  struct register_pass_info;
 #define GCC_PASS_LISTS \
   DEF_PASS_LIST (all_lowering_passes) \
   DEF_PASS_LIST (all_small_ipa_passes) \
+  DEF_PASS_LIST (all_local_opt_passes) \
   DEF_PASS_LIST (all_regular_ipa_passes) \
   DEF_PASS_LIST (all_late_ipa_passes) \
   DEF_PASS_LIST (all_passes)
@@ -82,6 +83,7 @@  public:
   /* The root of the compilation pass tree, once constructed.  */
   opt_pass *all_passes;
   opt_pass *all_small_ipa_passes;
+  opt_pass *all_local_opt_passes;
   opt_pass *all_lowering_passes;
   opt_pass *all_regular_ipa_passes;
   opt_pass *all_late_ipa_passes;
Index: gcc/passes.c
===================================================================
--- gcc/passes.c	(revision 224547)
+++ gcc/passes.c	(working copy)
@@ -454,8 +454,12 @@  public:
   /* opt_pass methods: */
   virtual bool gate (function *)
     {
-      /* Don't bother doing anything if the program has errors.  */
-      return (!seen_error () && !in_lto_p);
+      if (seen_error ())
+	return false;
+#ifdef ACCEL_COMPILER
+      return true;
+#endif
+      return !in_lto_p;
     }
 
 }; // class pass_local_optimization_passes
@@ -952,6 +956,7 @@  pass_manager::dump_passes () const
 
   dump_pass_list (all_lowering_passes, 1);
   dump_pass_list (all_small_ipa_passes, 1);
+  dump_pass_list (all_local_opt_passes, 1);
   dump_pass_list (all_regular_ipa_passes, 1);
   dump_pass_list (all_late_ipa_passes, 1);
   dump_pass_list (all_passes, 1);
@@ -1463,6 +1468,8 @@  pass_manager::register_pass (struct regi
   if (!success || all_instances)
     success |= position_pass (pass_info, &all_small_ipa_passes);
   if (!success || all_instances)
+    success |= position_pass (pass_info, &all_local_opt_passes);
+  if (!success || all_instances)
     success |= position_pass (pass_info, &all_regular_ipa_passes);
   if (!success || all_instances)
     success |= position_pass (pass_info, &all_late_ipa_passes);
@@ -1515,9 +1522,10 @@  pass_manager::register_pass (struct regi
    If we are optimizing, compile is then invoked:
 
    compile ()
-       ipa_passes () 			-> all_small_ipa_passes
+       ipa_passes () 			-> all_small_ipa_passes,
+					   all_local_opt_passes
 					-> Analysis of all_regular_ipa_passes
-	* possible LTO streaming at copmilation time *
+	* possible LTO streaming at compilation time *
 					-> Execution of all_regular_ipa_passes
 	* possible LTO streaming at link time *
 					-> all_late_ipa_passes
@@ -1541,8 +1549,8 @@  pass_manager::operator delete (void *ptr
 }
 
 pass_manager::pass_manager (context *ctxt)
-: all_passes (NULL), all_small_ipa_passes (NULL), all_lowering_passes (NULL),
-  all_regular_ipa_passes (NULL),
+: all_passes (NULL), all_small_ipa_passes (NULL), all_local_opt_passes (NULL),
+  all_lowering_passes (NULL), all_regular_ipa_passes (NULL),
   all_late_ipa_passes (NULL), passes_by_id (NULL), passes_by_id_size (0),
   m_ctxt (ctxt)
 {
@@ -1592,6 +1600,7 @@  pass_manager::pass_manager (context *ctx
   /* Register the passes with the tree dump code.  */
   register_dump_files (all_lowering_passes);
   register_dump_files (all_small_ipa_passes);
+  register_dump_files (all_local_opt_passes);
   register_dump_files (all_regular_ipa_passes);
   register_dump_files (all_late_ipa_passes);
   register_dump_files (all_passes);
@@ -2463,24 +2472,15 @@  ipa_write_summaries_1 (lto_symtab_encode
   lto_delete_out_decl_state (state);
 }
 
-/* Write out summaries for all the nodes in the callgraph.  */
-
-void
-ipa_write_summaries (void)
+static lto_symtab_encoder_t
+build_symtab_encoder (void)
 {
-  lto_symtab_encoder_t encoder;
+  lto_symtab_encoder_t encoder = lto_symtab_encoder_new (false);
   int i, order_pos;
   varpool_node *vnode;
   struct cgraph_node *node;
   struct cgraph_node **order;
 
-  if ((!flag_generate_lto && !flag_generate_offload) || seen_error ())
-    return;
-
-  select_what_to_stream ();
-
-  encoder = lto_symtab_encoder_new (false);
-
   /* Create the callgraph set in the same order used in
      cgraph_expand_all_functions.  This mostly facilitates debugging,
      since it causes the gimple file to be processed in the same order
@@ -2515,10 +2515,50 @@  ipa_write_summaries (void)
   FOR_EACH_DEFINED_VARIABLE (vnode)
     if (vnode->need_lto_streaming)
       lto_set_symtab_encoder_in_partition (encoder, vnode);
+  free (order);
+  return encoder;
+}
 
+/* Write out summaries for all the nodes in the callgraph.  */
+
+void
+ipa_write_summaries (void)
+{
+  if ((!flag_generate_lto && !flag_generate_offload) || seen_error ())
+    return;
+
+  select_what_to_stream ();
+  lto_symtab_encoder_t encoder = build_symtab_encoder ();
   ipa_write_summaries_1 (compute_ltrans_boundary (encoder));
+}
 
-  free (order);
+void
+write_offload_lto (void)
+{
+  if (!flag_generate_offload || seen_error ())
+    return;
+
+  lto_stream_offload_p = true;
+
+  select_what_to_stream ();
+  lto_symtab_encoder_t encoder = build_symtab_encoder ();
+  encoder = compute_ltrans_boundary (encoder);
+
+  struct lto_out_decl_state *state = lto_new_out_decl_state ();
+  state->symtab_node_encoder = encoder;
+
+  lto_output_init_mode_table ();
+  lto_push_out_decl_state (state);
+
+  gcc_assert (!flag_wpa);
+
+  write_lto ();
+
+  gcc_assert (lto_get_out_decl_state () == state);
+  lto_pop_out_decl_state ();
+  lto_delete_out_decl_state (state);
+
+  lto_stream_offload_p = false;
 }
 
 /* Same as execute_pass_list but assume that subpasses of IPA passes
Index: gcc/passes.def
===================================================================
--- gcc/passes.def	(revision 224547)
+++ gcc/passes.def	(working copy)
@@ -60,6 +60,10 @@  along with GCC; see the file COPYING3.
       NEXT_PASS (pass_early_warn_uninitialized);
       NEXT_PASS (pass_nothrow);
   POP_INSERT_PASSES ()
+  TERMINATE_PASS_LIST ()
+
+  /* Local optimization passes.  */
+  INSERT_PASSES_AFTER (all_local_opt_passes)
 
   NEXT_PASS (pass_chkp_instrumentation_passes);
   PUSH_INSERT_PASSES_WITHIN (pass_chkp_instrumentation_passes)
@@ -70,6 +74,7 @@  along with GCC; see the file COPYING3.
 
   NEXT_PASS (pass_local_optimization_passes);
   PUSH_INSERT_PASSES_WITHIN (pass_local_optimization_passes)
+      NEXT_PASS (pass_expand_omp_ssa);
       NEXT_PASS (pass_fixup_cfg);
       NEXT_PASS (pass_rebuild_cgraph_edges);
       NEXT_PASS (pass_inline_parameters);
Index: gcc/ssa-iterators.h
===================================================================
--- gcc/ssa-iterators.h	(revision 224547)
+++ gcc/ssa-iterators.h	(working copy)
@@ -609,17 +609,21 @@  op_iter_init (ssa_op_iter *ptr, gimple s
     {
       switch (gimple_code (stmt))
 	{
-	  case GIMPLE_ASSIGN:
-	  case GIMPLE_CALL:
-	    ptr->numops = 1;
-	    break;
-	  case GIMPLE_ASM:
-	    ptr->numops = gimple_asm_noutputs (as_a <gasm *> (stmt));
-	    break;
-	  default:
-	    ptr->numops = 0;
-	    flags &= ~(SSA_OP_DEF | SSA_OP_VDEF);
-	    break;
+	case GIMPLE_ASSIGN:
+	case GIMPLE_CALL:
+	case GIMPLE_OMP_CONTINUE:
+	  ptr->numops = 1;
+	  break;
+	case GIMPLE_ASM:
+	  ptr->numops = gimple_asm_noutputs (as_a <gasm *> (stmt));
+	  break;
+	case GIMPLE_OMP_FOR:
+	  ptr->numops = gimple_omp_for_collapse (stmt);
+	  break;
+	default:
+	  ptr->numops = 0;
+	  flags &= ~(SSA_OP_DEF | SSA_OP_VDEF);
+	  break;
 	}
     }
   ptr->uses = (flags & (SSA_OP_USE|SSA_OP_VUSE)) ? gimple_use_ops (stmt) : NULL;
Index: gcc/tree-cfg.c
===================================================================
--- gcc/tree-cfg.c	(revision 224547)
+++ gcc/tree-cfg.c	(working copy)
@@ -6649,6 +6649,7 @@  move_stmt_r (gimple_stmt_iterator *gsi_p
 
     case GIMPLE_OMP_RETURN:
     case GIMPLE_OMP_CONTINUE:
+    case GIMPLE_OMP_ENTRY_END:
       break;
     default:
       if (is_gimple_omp (stmt))
@@ -6659,7 +6660,7 @@  move_stmt_r (gimple_stmt_iterator *gsi_p
 	     function.  */
 	  bool save_remap_decls_p = p->remap_decls_p;
 	  p->remap_decls_p = false;
-	  *handled_ops_p = true;
+	  //	  *handled_ops_p = true;
 
 	  walk_gimple_seq_mod (gimple_omp_body_ptr (stmt), move_stmt_r,
 			       move_stmt_op, wi);
Index: gcc/tree-into-ssa.c
===================================================================
--- gcc/tree-into-ssa.c	(revision 224547)
+++ gcc/tree-into-ssa.c	(working copy)
@@ -2442,6 +2442,7 @@  pass_build_ssa::execute (function *fun)
 	SET_SSA_NAME_VAR_OR_IDENTIFIER (name, DECL_NAME (decl));
     }
 
+  verify_ssa (false, true);
   return 0;
 }
 
Index: gcc/tree-nested.c
===================================================================
--- gcc/tree-nested.c	(revision 224547)
+++ gcc/tree-nested.c	(working copy)
@@ -673,14 +673,8 @@  walk_gimple_omp_for (gomp_for *for_stmt,
       wi.is_lhs = false;
       walk_tree (gimple_omp_for_final_ptr (for_stmt, i), callback_op,
 		 &wi, NULL);
-
-      t = gimple_omp_for_incr (for_stmt, i);
-      gcc_assert (BINARY_CLASS_P (t));
-      wi.val_only = false;
-      walk_tree (&TREE_OPERAND (t, 0), callback_op, &wi, NULL);
-      wi.val_only = true;
-      wi.is_lhs = false;
-      walk_tree (&TREE_OPERAND (t, 1), callback_op, &wi, NULL);
+      walk_tree (gimple_omp_for_incr_ptr (for_stmt, i), callback_op,
+		 &wi, NULL);
     }
 
   seq = gsi_seq (wi.gsi);
Index: gcc/tree-ssa-operands.c
===================================================================
--- gcc/tree-ssa-operands.c	(revision 224547)
+++ gcc/tree-ssa-operands.c	(working copy)
@@ -942,11 +942,18 @@  parse_ssa_operands (struct function *fn,
       append_vuse (gimple_vop (fn));
       goto do_default;
 
+    case GIMPLE_OMP_FOR:
+      start = gimple_omp_for_collapse (stmt);
+      for (i = 0; i < start; i++)
+	get_expr_operands (fn, stmt, gimple_op_ptr (stmt, i), opf_def);
+      goto do_default;
+      
     case GIMPLE_CALL:
       /* Add call-clobbered operands, if needed.  */
       maybe_add_call_vops (fn, as_a <gcall *> (stmt));
       /* FALLTHRU */
 
+    case GIMPLE_OMP_CONTINUE:
     case GIMPLE_ASSIGN:
       get_expr_operands (fn, stmt, gimple_op_ptr (stmt, 0), opf_def);
       start = 1;