[PATCH, 8/8] Do simple omp lowering for no address taken var

Mon Nov 24 12:42:00 GMT 2014

On Mon, 24 Nov 2014, Tom de Vries wrote:

> On 24-11-14 12:28, Tom de Vries wrote:
> > On 17-11-14 11:13, Richard Biener wrote:
> > > On Sat, 15 Nov 2014, Tom de Vries wrote:
> > > 
> > > > >On 15-11-14 13:14, Tom de Vries wrote:
> > > > > > >Hi,
> > > > > > >
> > > > > > >I'm submitting a patch series with initial support for the oacc
> > > > > kernels
> > > > > > >directive.
> > > > > > >
> > > > > > >The patch series uses pass_parallelize_loops to implement
> > > > > parallelization of
> > > > > > >loops in the oacc kernels region.
> > > > > > >
> > > > > > >The patch series consists of these 8 patches:
> > > > > > >...
> > > > > > >      1  Expand oacc kernels after pass_build_ealias
> > > > > > >      2  Add pass_oacc_kernels
> > > > > > >      3  Add pass_ch_oacc_kernels to pass_oacc_kernels
> > > > > > >      4  Add pass_tree_loop_{init,done} to pass_oacc_kernels
> > > > > > >      5  Add pass_loop_im to pass_oacc_kernels
> > > > > > >      6  Add pass_ccp to pass_oacc_kernels
> > > > > > >      7  Add pass_parloops_oacc_kernels to pass_oacc_kernels
> > > > > > >      8  Do simple omp lowering for no address taken var
> > > > > > >...
> > > > >
> > > > >This patch lowers integer variables that do not have their address
> > > > taken as
> > > > >local variable.  We use a copy at region entry and exit to copy the
> > > > value in
> > > > >and out.
> > > > >
> > > > >In the context of reduction handling in a kernels region, this allows
> > > > the
> > > > >parloops reduction analysis to recognize the reduction, even after oacc
> > > > >lowering has been done in pass_lower_omp.
> > > > >
> > > > >In more detail, without this patch, the omp_data_i load and stores are
> > > > >generated in place (in this case, in the loop):
> > > > >...
> > > > >                 {
> > > > >                   .omp_data_iD.2201 = &.omp_data_arr.15D.2220;
> > > > >                   {
> > > > >                     unsigned intD.9 iD.2146;
> > > > >
> > > > >                     iD.2146 = 0;
> > > > >                     goto <D.2207>;
> > > > >                     <D.2208>:
> > > > >                     D.2216 = .omp_data_iD.2201->cD.2203;
> > > > >                     c.9D.2176 = *D.2216;
> > > > >                     D.2177 = (long unsigned intD.10) iD.2146;
> > > > >                     D.2178 = D.2177 * 4;
> > > > >                     D.2179 = c.9D.2176 + D.2178;
> > > > >                     D.2180 = *D.2179;
> > > > >                     D.2217 = .omp_data_iD.2201->sumD.2205;
> > > > >                     D.2218 = *D.2217;
> > > > >                     D.2217 = .omp_data_iD.2201->sumD.2205;
> > > > >                     D.2219 = D.2180 + D.2218;
> > > > >                     *D.2217 = D.2219;
> > > > >                     iD.2146 = iD.2146 + 1;
> > > > >                     <D.2207>:
> > > > >                     if (iD.2146 <= 524287) goto <D.2208>; else goto
> > > > <D.2209>;
> > > > >                     <D.2209>:
> > > > >                   }
> > > > >...
> > > > >
> > > > >With this patch, the omp_data_i load and stores for sum are generated
> > > > at entry
> > > > >and exit:
> > > > >...
> > > > >                 {
> > > > >                   .omp_data_iD.2201 = &.omp_data_arr.15D.2218;
> > > > >                   D.2216 = .omp_data_iD.2201->sumD.2205;
> > > > >                   sumD.2206 = *D.2216;
> > > > >                   {
> > > > >                     unsigned intD.9 iD.2146;
> > > > >
> > > > >                     iD.2146 = 0;
> > > > >                     goto <D.2207>;
> > > > >                     <D.2208>:
> > > > >                     D.2217 = .omp_data_iD.2201->cD.2203;
> > > > >                     c.9D.2176 = *D.2217;
> > > > >                     D.2177 = (long unsigned intD.10) iD.2146;
> > > > >                     D.2178 = D.2177 * 4;
> > > > >                     D.2179 = c.9D.2176 + D.2178;
> > > > >                     D.2180 = *D.2179;
> > > > >                     sumD.2206 = D.2180 + sumD.2206;
> > > > >                     iD.2146 = iD.2146 + 1;
> > > > >                     <D.2207>:
> > > > >                     if (iD.2146 <= 524287) goto <D.2208>; else goto
> > > > <D.2209>;
> > > > >                     <D.2209>:
> > > > >                   }
> > > > >                   *D.2216 = sumD.2206;
> > > > >                   #pragma omp return
> > > > >                 }
> > > > >...
> > > > >
> > > > >
> > > > >So, without the patch the reduction operation looks like this:
> > > > >...
> > > > >     *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205)
> > > > + x
> > > > >...
> > > > >
> > > > >And with this patch the reduction operation is simply:
> > > > >...
> > > > >     sumD.2206 = sumD.2206 + x:
> > > > >...
> > > > >
> > > > >OK for trunk?
> > > I presume the reason you are trying to do that here is that otherwise
> > > it happens too late?  What you do is what loop store motion would
> > > do.
> > 
> > Richard,
> > 
> > Thanks for the hint. I've built a reduction example:
> > ...
> > void __attribute__((noinline))
> > f (unsigned int *__restrict__ a, unsigned int *__restrict__ sum, unsigned
> > int n)
> > {
> >    unsigned int i;
> >    for (i = 0; i < n; ++i)
> >      *sum += a[i];
> > }...
> > and observed that store motion of the *sum store is done by pass_loop_im,
> > provided the *sum load is taken out of the the loop by pass_pre first.
> > 
> > So alternatively, we could use pass_pre and pass_loop_im to achieve the same
> > effect.
> > 
> > When trying out adding pass_pre as a part of the pass group
> > pass_oacc_kernels, I
> > found that also pass_copyprop was required to get parloops to recognize the
> > reduction.
> > 
> 
> Attached patch adds pass_copyprop to pass group pass_oacc_kernels.

Hum, you are gobbling up very many passes here.  In this case copyprop
will also perform trivial constant propagation so maybe it's enough
to replace ccp by copyprop.  Or go the full way and add a FRE pass.

Richard.