[Bug tree-optimization/88760] GCC unrolling is suboptimal

Thu Jan 24 12:29:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #18 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 24 Jan 2019, ktkachov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #17 from ktkachov at gcc dot gnu.org ---
> I played around with the source to do some conservative 2x manual unrolling in
> the two hottest functions in 510.parest_r (3 more-or-less identical tight FMA
> loops). This was to try out Richard's thinking suggestion in #c10 about
> unrolling for forming load-pairs, and also to break the accumulator dependency.
> 
> So the initial testcase now became:
> unsigned int *colnums;
> double *val;
> 
> struct foostruct
> {
>   unsigned int rows;
>   unsigned int *colnums;
>   unsigned int *rowstart;
> };
> 
> struct foostruct *cols;
> 
> void
> foo (double * __restrict__ dst, const double *__restrict__ src)
> {
>   const unsigned int n_rows = cols->rows;
>   const double *val_ptr = &val[cols->rowstart[0]];
>   const unsigned int *colnum_ptr = &cols->colnums[cols->rowstart[0]];  
> 
>   double *dst_ptr = dst;
>   for (unsigned int row=0; row<n_rows; ++row)
>     {
>       double s = 0.;
>       const double *const val_end_of_row = &val[cols->rowstart[row+1]];
>       __PTRDIFF_TYPE__ diff = val_end_of_row - val_ptr;
> 
>       if (diff & 1) // Peel the odd iteration.
>         s += *val_ptr++ * src[*colnum_ptr++];
> 
>       double s1 = 0.; // Second accumulator
>       while (val_ptr != val_end_of_row)
>         {
>           s += val_ptr[0] * src[colnum_ptr[0]];
>           s1 += val_ptr[1] * src[colnum_ptr[1]];
>           val_ptr += 2;
>           colnum_ptr += 2;
>         }
>       *dst_ptr++ = s + s1;
>     }
> }
> 
> This transformed the initial loop from:
> .L4:
>         ldr     w3, [x7, x2, lsl 2]
>         cmp     x6, x2
>         ldr     d2, [x5, x2, lsl 3]
>         add     x2, x2, 1
>         ldr     d1, [x1, x3, lsl 3]
>         fmadd   d0, d2, d1, d0
>         bne     .L4
> 
> into:
> .L5:
>         ldp     w6, w5, [x3] // LDP
>         add     x3, x3, 8
>         ldp     d5, d3, [x2]  // LDP
>         add     x2, x2, 16
>         ldr     d4, [x1, x6, lsl 3]
>         cmp     x4, x2
>         ldr     d2, [x1, x5, lsl 3]
>         fmadd   d0, d5, d4, d0
>         fmadd   d1, d3, d2, d1
>         bne     .L5
> 
> In parest itself a few of the loops transformed this way did not form LDPs due
> to unrelated LDP-forming inefficiencies but the most did.
> This transformation gave a 3% improvement on a Cortex-A72. There are more
> similar loops in the 3rd, 4th and 5th hottest functions in that benchmark, so I
> suspect if we do something intelligent there as well we'll get even more
> sizeable gains.
> 
> So rather than solving general "unrolling", how about we break this down into
> two desirable transformations:
> 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
> suggestion)

If that helps, sure (I'd have guessed uarchs are going to split
load-multiple into separate loads, but eventually it avoids
load-port contention?)

> 2) Unrolling and breaking accumulator dependencies.

IIRC RTL unrolling can do this (as side-effect, not as main
cost motivation) guarded with an extra switch.

> I think more general unrolling and the peeling associated with it can be
> discussed independently of 1) and 2) once we collect more data on more
> microarchitectures.

I think both of these can be "implemented" on the RTL unroller
side.