diff mbox

[8/8] Do simple omp lowering for no address taken var

Message ID 5473171F.2040708@mentor.com
State New
Headers show

Commit Message

Tom de Vries Nov. 24, 2014, 11:31 a.m. UTC
On 24-11-14 12:28, Tom de Vries wrote:
> On 17-11-14 11:13, Richard Biener wrote:
>> On Sat, 15 Nov 2014, Tom de Vries wrote:
>>
>>> >On 15-11-14 13:14, Tom de Vries wrote:
>>>> > >Hi,
>>>> > >
>>>> > >I'm submitting a patch series with initial support for the oacc kernels
>>>> > >directive.
>>>> > >
>>>> > >The patch series uses pass_parallelize_loops to implement parallelization of
>>>> > >loops in the oacc kernels region.
>>>> > >
>>>> > >The patch series consists of these 8 patches:
>>>> > >...
>>>> > >      1  Expand oacc kernels after pass_build_ealias
>>>> > >      2  Add pass_oacc_kernels
>>>> > >      3  Add pass_ch_oacc_kernels to pass_oacc_kernels
>>>> > >      4  Add pass_tree_loop_{init,done} to pass_oacc_kernels
>>>> > >      5  Add pass_loop_im to pass_oacc_kernels
>>>> > >      6  Add pass_ccp to pass_oacc_kernels
>>>> > >      7  Add pass_parloops_oacc_kernels to pass_oacc_kernels
>>>> > >      8  Do simple omp lowering for no address taken var
>>>> > >...
>>> >
>>> >This patch lowers integer variables that do not have their address taken as
>>> >local variable.  We use a copy at region entry and exit to copy the value in
>>> >and out.
>>> >
>>> >In the context of reduction handling in a kernels region, this allows the
>>> >parloops reduction analysis to recognize the reduction, even after oacc
>>> >lowering has been done in pass_lower_omp.
>>> >
>>> >In more detail, without this patch, the omp_data_i load and stores are
>>> >generated in place (in this case, in the loop):
>>> >...
>>> >                 {
>>> >                   .omp_data_iD.2201 = &.omp_data_arr.15D.2220;
>>> >                   {
>>> >                     unsigned intD.9 iD.2146;
>>> >
>>> >                     iD.2146 = 0;
>>> >                     goto <D.2207>;
>>> >                     <D.2208>:
>>> >                     D.2216 = .omp_data_iD.2201->cD.2203;
>>> >                     c.9D.2176 = *D.2216;
>>> >                     D.2177 = (long unsigned intD.10) iD.2146;
>>> >                     D.2178 = D.2177 * 4;
>>> >                     D.2179 = c.9D.2176 + D.2178;
>>> >                     D.2180 = *D.2179;
>>> >                     D.2217 = .omp_data_iD.2201->sumD.2205;
>>> >                     D.2218 = *D.2217;
>>> >                     D.2217 = .omp_data_iD.2201->sumD.2205;
>>> >                     D.2219 = D.2180 + D.2218;
>>> >                     *D.2217 = D.2219;
>>> >                     iD.2146 = iD.2146 + 1;
>>> >                     <D.2207>:
>>> >                     if (iD.2146 <= 524287) goto <D.2208>; else goto <D.2209>;
>>> >                     <D.2209>:
>>> >                   }
>>> >...
>>> >
>>> >With this patch, the omp_data_i load and stores for sum are generated at entry
>>> >and exit:
>>> >...
>>> >                 {
>>> >                   .omp_data_iD.2201 = &.omp_data_arr.15D.2218;
>>> >                   D.2216 = .omp_data_iD.2201->sumD.2205;
>>> >                   sumD.2206 = *D.2216;
>>> >                   {
>>> >                     unsigned intD.9 iD.2146;
>>> >
>>> >                     iD.2146 = 0;
>>> >                     goto <D.2207>;
>>> >                     <D.2208>:
>>> >                     D.2217 = .omp_data_iD.2201->cD.2203;
>>> >                     c.9D.2176 = *D.2217;
>>> >                     D.2177 = (long unsigned intD.10) iD.2146;
>>> >                     D.2178 = D.2177 * 4;
>>> >                     D.2179 = c.9D.2176 + D.2178;
>>> >                     D.2180 = *D.2179;
>>> >                     sumD.2206 = D.2180 + sumD.2206;
>>> >                     iD.2146 = iD.2146 + 1;
>>> >                     <D.2207>:
>>> >                     if (iD.2146 <= 524287) goto <D.2208>; else goto <D.2209>;
>>> >                     <D.2209>:
>>> >                   }
>>> >                   *D.2216 = sumD.2206;
>>> >                   #pragma omp return
>>> >                 }
>>> >...
>>> >
>>> >
>>> >So, without the patch the reduction operation looks like this:
>>> >...
>>> >     *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205) + x
>>> >...
>>> >
>>> >And with this patch the reduction operation is simply:
>>> >...
>>> >     sumD.2206 = sumD.2206 + x:
>>> >...
>>> >
>>> >OK for trunk?
>> I presume the reason you are trying to do that here is that otherwise
>> it happens too late?  What you do is what loop store motion would
>> do.
>
> Richard,
>
> Thanks for the hint. I've built a reduction example:
> ...
> void __attribute__((noinline))
> f (unsigned int *__restrict__ a, unsigned int *__restrict__ sum, unsigned int n)
> {
>    unsigned int i;
>    for (i = 0; i < n; ++i)
>      *sum += a[i];
> }...
> and observed that store motion of the *sum store is done by pass_loop_im,
> provided the *sum load is taken out of the the loop by pass_pre first.
>
> So alternatively, we could use pass_pre and pass_loop_im to achieve the same
> effect.
>
> When trying out adding pass_pre as a part of the pass group pass_oacc_kernels, I
> found that also pass_copyprop was required to get parloops to recognize the
> reduction.
>

Attached patch adds pass_copyprop to pass group pass_oacc_kernels.

Bootstrapped and reg-tested in the same way as before.

OK for trunk?

Thanks,
- Tom

Comments

Richard Biener Nov. 24, 2014, 12:12 p.m. UTC | #1
On Mon, 24 Nov 2014, Tom de Vries wrote:

> On 24-11-14 12:28, Tom de Vries wrote:
> > On 17-11-14 11:13, Richard Biener wrote:
> > > On Sat, 15 Nov 2014, Tom de Vries wrote:
> > > 
> > > > >On 15-11-14 13:14, Tom de Vries wrote:
> > > > > > >Hi,
> > > > > > >
> > > > > > >I'm submitting a patch series with initial support for the oacc
> > > > > kernels
> > > > > > >directive.
> > > > > > >
> > > > > > >The patch series uses pass_parallelize_loops to implement
> > > > > parallelization of
> > > > > > >loops in the oacc kernels region.
> > > > > > >
> > > > > > >The patch series consists of these 8 patches:
> > > > > > >...
> > > > > > >      1  Expand oacc kernels after pass_build_ealias
> > > > > > >      2  Add pass_oacc_kernels
> > > > > > >      3  Add pass_ch_oacc_kernels to pass_oacc_kernels
> > > > > > >      4  Add pass_tree_loop_{init,done} to pass_oacc_kernels
> > > > > > >      5  Add pass_loop_im to pass_oacc_kernels
> > > > > > >      6  Add pass_ccp to pass_oacc_kernels
> > > > > > >      7  Add pass_parloops_oacc_kernels to pass_oacc_kernels
> > > > > > >      8  Do simple omp lowering for no address taken var
> > > > > > >...
> > > > >
> > > > >This patch lowers integer variables that do not have their address
> > > > taken as
> > > > >local variable.  We use a copy at region entry and exit to copy the
> > > > value in
> > > > >and out.
> > > > >
> > > > >In the context of reduction handling in a kernels region, this allows
> > > > the
> > > > >parloops reduction analysis to recognize the reduction, even after oacc
> > > > >lowering has been done in pass_lower_omp.
> > > > >
> > > > >In more detail, without this patch, the omp_data_i load and stores are
> > > > >generated in place (in this case, in the loop):
> > > > >...
> > > > >                 {
> > > > >                   .omp_data_iD.2201 = &.omp_data_arr.15D.2220;
> > > > >                   {
> > > > >                     unsigned intD.9 iD.2146;
> > > > >
> > > > >                     iD.2146 = 0;
> > > > >                     goto <D.2207>;
> > > > >                     <D.2208>:
> > > > >                     D.2216 = .omp_data_iD.2201->cD.2203;
> > > > >                     c.9D.2176 = *D.2216;
> > > > >                     D.2177 = (long unsigned intD.10) iD.2146;
> > > > >                     D.2178 = D.2177 * 4;
> > > > >                     D.2179 = c.9D.2176 + D.2178;
> > > > >                     D.2180 = *D.2179;
> > > > >                     D.2217 = .omp_data_iD.2201->sumD.2205;
> > > > >                     D.2218 = *D.2217;
> > > > >                     D.2217 = .omp_data_iD.2201->sumD.2205;
> > > > >                     D.2219 = D.2180 + D.2218;
> > > > >                     *D.2217 = D.2219;
> > > > >                     iD.2146 = iD.2146 + 1;
> > > > >                     <D.2207>:
> > > > >                     if (iD.2146 <= 524287) goto <D.2208>; else goto
> > > > <D.2209>;
> > > > >                     <D.2209>:
> > > > >                   }
> > > > >...
> > > > >
> > > > >With this patch, the omp_data_i load and stores for sum are generated
> > > > at entry
> > > > >and exit:
> > > > >...
> > > > >                 {
> > > > >                   .omp_data_iD.2201 = &.omp_data_arr.15D.2218;
> > > > >                   D.2216 = .omp_data_iD.2201->sumD.2205;
> > > > >                   sumD.2206 = *D.2216;
> > > > >                   {
> > > > >                     unsigned intD.9 iD.2146;
> > > > >
> > > > >                     iD.2146 = 0;
> > > > >                     goto <D.2207>;
> > > > >                     <D.2208>:
> > > > >                     D.2217 = .omp_data_iD.2201->cD.2203;
> > > > >                     c.9D.2176 = *D.2217;
> > > > >                     D.2177 = (long unsigned intD.10) iD.2146;
> > > > >                     D.2178 = D.2177 * 4;
> > > > >                     D.2179 = c.9D.2176 + D.2178;
> > > > >                     D.2180 = *D.2179;
> > > > >                     sumD.2206 = D.2180 + sumD.2206;
> > > > >                     iD.2146 = iD.2146 + 1;
> > > > >                     <D.2207>:
> > > > >                     if (iD.2146 <= 524287) goto <D.2208>; else goto
> > > > <D.2209>;
> > > > >                     <D.2209>:
> > > > >                   }
> > > > >                   *D.2216 = sumD.2206;
> > > > >                   #pragma omp return
> > > > >                 }
> > > > >...
> > > > >
> > > > >
> > > > >So, without the patch the reduction operation looks like this:
> > > > >...
> > > > >     *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205)
> > > > + x
> > > > >...
> > > > >
> > > > >And with this patch the reduction operation is simply:
> > > > >...
> > > > >     sumD.2206 = sumD.2206 + x:
> > > > >...
> > > > >
> > > > >OK for trunk?
> > > I presume the reason you are trying to do that here is that otherwise
> > > it happens too late?  What you do is what loop store motion would
> > > do.
> > 
> > Richard,
> > 
> > Thanks for the hint. I've built a reduction example:
> > ...
> > void __attribute__((noinline))
> > f (unsigned int *__restrict__ a, unsigned int *__restrict__ sum, unsigned
> > int n)
> > {
> >    unsigned int i;
> >    for (i = 0; i < n; ++i)
> >      *sum += a[i];
> > }...
> > and observed that store motion of the *sum store is done by pass_loop_im,
> > provided the *sum load is taken out of the the loop by pass_pre first.
> > 
> > So alternatively, we could use pass_pre and pass_loop_im to achieve the same
> > effect.
> > 
> > When trying out adding pass_pre as a part of the pass group
> > pass_oacc_kernels, I
> > found that also pass_copyprop was required to get parloops to recognize the
> > reduction.
> > 
> 
> Attached patch adds pass_copyprop to pass group pass_oacc_kernels.

Hum, you are gobbling up very many passes here.  In this case copyprop
will also perform trivial constant propagation so maybe it's enough
to replace ccp by copyprop.  Or go the full way and add a FRE pass.

Richard.
Tom de Vries Nov. 24, 2014, 6:22 p.m. UTC | #2
On 24-11-14 13:12, Richard Biener wrote:
> On Mon, 24 Nov 2014, Tom de Vries wrote:
>
>> On 24-11-14 12:28, Tom de Vries wrote:
>>> On 17-11-14 11:13, Richard Biener wrote:
>>>> On Sat, 15 Nov 2014, Tom de Vries wrote:
>>>>
>>>>>> On 15-11-14 13:14, Tom de Vries wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm submitting a patch series with initial support for the oacc
>>>>>> kernels
>>>>>>>> directive.
>>>>>>>>
>>>>>>>> The patch series uses pass_parallelize_loops to implement
>>>>>> parallelization of
>>>>>>>> loops in the oacc kernels region.
>>>>>>>>
>>>>>>>> The patch series consists of these 8 patches:
>>>>>>>> ...
>>>>>>>>       1  Expand oacc kernels after pass_build_ealias
>>>>>>>>       2  Add pass_oacc_kernels
>>>>>>>>       3  Add pass_ch_oacc_kernels to pass_oacc_kernels
>>>>>>>>       4  Add pass_tree_loop_{init,done} to pass_oacc_kernels
>>>>>>>>       5  Add pass_loop_im to pass_oacc_kernels
>>>>>>>>       6  Add pass_ccp to pass_oacc_kernels
>>>>>>>>       7  Add pass_parloops_oacc_kernels to pass_oacc_kernels
>>>>>>>>       8  Do simple omp lowering for no address taken var
>>>>>>>> ...
>>>>>>
>>>>>> This patch lowers integer variables that do not have their address
>>>>> taken as
>>>>>> local variable.  We use a copy at region entry and exit to copy the
>>>>> value in
>>>>>> and out.
>>>>>>
>>>>>> In the context of reduction handling in a kernels region, this allows
>>>>> the
>>>>>> parloops reduction analysis to recognize the reduction, even after oacc
>>>>>> lowering has been done in pass_lower_omp.
>>>>>>
>>>>>> In more detail, without this patch, the omp_data_i load and stores are
>>>>>> generated in place (in this case, in the loop):
>>>>>> ...
>>>>>>                  {
>>>>>>                    .omp_data_iD.2201 = &.omp_data_arr.15D.2220;
>>>>>>                    {
>>>>>>                      unsigned intD.9 iD.2146;
>>>>>>
>>>>>>                      iD.2146 = 0;
>>>>>>                      goto <D.2207>;
>>>>>>                      <D.2208>:
>>>>>>                      D.2216 = .omp_data_iD.2201->cD.2203;
>>>>>>                      c.9D.2176 = *D.2216;
>>>>>>                      D.2177 = (long unsigned intD.10) iD.2146;
>>>>>>                      D.2178 = D.2177 * 4;
>>>>>>                      D.2179 = c.9D.2176 + D.2178;
>>>>>>                      D.2180 = *D.2179;
>>>>>>                      D.2217 = .omp_data_iD.2201->sumD.2205;
>>>>>>                      D.2218 = *D.2217;
>>>>>>                      D.2217 = .omp_data_iD.2201->sumD.2205;
>>>>>>                      D.2219 = D.2180 + D.2218;
>>>>>>                      *D.2217 = D.2219;
>>>>>>                      iD.2146 = iD.2146 + 1;
>>>>>>                      <D.2207>:
>>>>>>                      if (iD.2146 <= 524287) goto <D.2208>; else goto
>>>>> <D.2209>;
>>>>>>                      <D.2209>:
>>>>>>                    }
>>>>>> ...
>>>>>>
>>>>>> With this patch, the omp_data_i load and stores for sum are generated
>>>>> at entry
>>>>>> and exit:
>>>>>> ...
>>>>>>                  {
>>>>>>                    .omp_data_iD.2201 = &.omp_data_arr.15D.2218;
>>>>>>                    D.2216 = .omp_data_iD.2201->sumD.2205;
>>>>>>                    sumD.2206 = *D.2216;
>>>>>>                    {
>>>>>>                      unsigned intD.9 iD.2146;
>>>>>>
>>>>>>                      iD.2146 = 0;
>>>>>>                      goto <D.2207>;
>>>>>>                      <D.2208>:
>>>>>>                      D.2217 = .omp_data_iD.2201->cD.2203;
>>>>>>                      c.9D.2176 = *D.2217;
>>>>>>                      D.2177 = (long unsigned intD.10) iD.2146;
>>>>>>                      D.2178 = D.2177 * 4;
>>>>>>                      D.2179 = c.9D.2176 + D.2178;
>>>>>>                      D.2180 = *D.2179;
>>>>>>                      sumD.2206 = D.2180 + sumD.2206;
>>>>>>                      iD.2146 = iD.2146 + 1;
>>>>>>                      <D.2207>:
>>>>>>                      if (iD.2146 <= 524287) goto <D.2208>; else goto
>>>>> <D.2209>;
>>>>>>                      <D.2209>:
>>>>>>                    }
>>>>>>                    *D.2216 = sumD.2206;
>>>>>>                    #pragma omp return
>>>>>>                  }
>>>>>> ...
>>>>>>
>>>>>>
>>>>>> So, without the patch the reduction operation looks like this:
>>>>>> ...
>>>>>>      *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205)
>>>>> + x
>>>>>> ...
>>>>>>
>>>>>> And with this patch the reduction operation is simply:
>>>>>> ...
>>>>>>      sumD.2206 = sumD.2206 + x:
>>>>>> ...
>>>>>>
>>>>>> OK for trunk?
>>>> I presume the reason you are trying to do that here is that otherwise
>>>> it happens too late?  What you do is what loop store motion would
>>>> do.
>>>
>>> Richard,
>>>
>>> Thanks for the hint. I've built a reduction example:
>>> ...
>>> void __attribute__((noinline))
>>> f (unsigned int *__restrict__ a, unsigned int *__restrict__ sum, unsigned
>>> int n)
>>> {
>>>     unsigned int i;
>>>     for (i = 0; i < n; ++i)
>>>       *sum += a[i];
>>> }...
>>> and observed that store motion of the *sum store is done by pass_loop_im,
>>> provided the *sum load is taken out of the the loop by pass_pre first.
>>>
>>> So alternatively, we could use pass_pre and pass_loop_im to achieve the same
>>> effect.
>>>
>>> When trying out adding pass_pre as a part of the pass group
>>> pass_oacc_kernels, I
>>> found that also pass_copyprop was required to get parloops to recognize the
>>> reduction.
>>>
>>
>> Attached patch adds pass_copyprop to pass group pass_oacc_kernels.
>
> Hum, you are gobbling up very many passes here.  In this case copyprop
> will also perform trivial constant propagation so maybe it's enough
> to replace ccp by copyprop.  Or go the full way and add a FRE pass.
>

Yep, replacing ccp by copyprop seems to work well enough.

I'll repost once bootstrap and reg-test are done.

Thanks,
- Tom
diff mbox

Patch

2014-11-23  Tom de Vries  <tom@codesourcery.com>

	* passes.def: Add pass_copy_prop to pass group pass_oacc_kernels.
	* tree-ssa-copy.c (stmt_may_generate_copy): Handle .omp_data_i init
	conservatively.
---
 gcc/passes.def      | 1 +
 gcc/tree-ssa-copy.c | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/gcc/passes.def b/gcc/passes.def
index 3a7b096..8c663b0 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -95,6 +95,7 @@  along with GCC; see the file COPYING3.  If not see
 	      NEXT_PASS (pass_tree_loop_init);
 	      NEXT_PASS (pass_lim);
 	      NEXT_PASS (pass_ccp);
+	      NEXT_PASS (pass_copy_prop);
 	      NEXT_PASS (pass_tree_loop_done);
 	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_expand_omp_ssa);
diff --git a/gcc/tree-ssa-copy.c b/gcc/tree-ssa-copy.c
index 7c22c5e..d6eb7a7 100644
--- a/gcc/tree-ssa-copy.c
+++ b/gcc/tree-ssa-copy.c
@@ -55,6 +55,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "tree-scalar-evolution.h"
 #include "tree-ssa-dom.h"
 #include "tree-ssa-loop-niter.h"
+#include "omp-low.h"
 
 
 /* This file implements the copy propagation pass and provides a
@@ -110,6 +111,9 @@  stmt_may_generate_copy (gimple stmt)
   if (gimple_has_volatile_ops (stmt))
     return false;
 
+  if (gimple_stmt_omp_data_i_init_p (stmt))
+    return false;
+
   /* Statements with loads and/or stores will never generate a useful copy.  */
   if (gimple_vuse (stmt))
     return false;
-- 
1.9.1