[Bug tree-optimization/88760] GCC unrolling is suboptimal
rguenther at suse dot de
gcc-bugzilla@gcc.gnu.org
Thu Jan 24 12:29:00 GMT 2019
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #18 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 24 Jan 2019, ktkachov at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
>
> --- Comment #17 from ktkachov at gcc dot gnu.org ---
> I played around with the source to do some conservative 2x manual unrolling in
> the two hottest functions in 510.parest_r (3 more-or-less identical tight FMA
> loops). This was to try out Richard's thinking suggestion in #c10 about
> unrolling for forming load-pairs, and also to break the accumulator dependency.
>
> So the initial testcase now became:
> unsigned int *colnums;
> double *val;
>
> struct foostruct
> {
> unsigned int rows;
> unsigned int *colnums;
> unsigned int *rowstart;
> };
>
> struct foostruct *cols;
>
> void
> foo (double * __restrict__ dst, const double *__restrict__ src)
> {
> const unsigned int n_rows = cols->rows;
> const double *val_ptr = &val[cols->rowstart[0]];
> const unsigned int *colnum_ptr = &cols->colnums[cols->rowstart[0]];
>
> double *dst_ptr = dst;
> for (unsigned int row=0; row<n_rows; ++row)
> {
> double s = 0.;
> const double *const val_end_of_row = &val[cols->rowstart[row+1]];
> __PTRDIFF_TYPE__ diff = val_end_of_row - val_ptr;
>
> if (diff & 1) // Peel the odd iteration.
> s += *val_ptr++ * src[*colnum_ptr++];
>
> double s1 = 0.; // Second accumulator
> while (val_ptr != val_end_of_row)
> {
> s += val_ptr[0] * src[colnum_ptr[0]];
> s1 += val_ptr[1] * src[colnum_ptr[1]];
> val_ptr += 2;
> colnum_ptr += 2;
> }
> *dst_ptr++ = s + s1;
> }
> }
>
> This transformed the initial loop from:
> .L4:
> ldr w3, [x7, x2, lsl 2]
> cmp x6, x2
> ldr d2, [x5, x2, lsl 3]
> add x2, x2, 1
> ldr d1, [x1, x3, lsl 3]
> fmadd d0, d2, d1, d0
> bne .L4
>
> into:
> .L5:
> ldp w6, w5, [x3] // LDP
> add x3, x3, 8
> ldp d5, d3, [x2] // LDP
> add x2, x2, 16
> ldr d4, [x1, x6, lsl 3]
> cmp x4, x2
> ldr d2, [x1, x5, lsl 3]
> fmadd d0, d5, d4, d0
> fmadd d1, d3, d2, d1
> bne .L5
>
> In parest itself a few of the loops transformed this way did not form LDPs due
> to unrelated LDP-forming inefficiencies but the most did.
> This transformation gave a 3% improvement on a Cortex-A72. There are more
> similar loops in the 3rd, 4th and 5th hottest functions in that benchmark, so I
> suspect if we do something intelligent there as well we'll get even more
> sizeable gains.
>
> So rather than solving general "unrolling", how about we break this down into
> two desirable transformations:
> 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
> suggestion)
If that helps, sure (I'd have guessed uarchs are going to split
load-multiple into separate loads, but eventually it avoids
load-port contention?)
> 2) Unrolling and breaking accumulator dependencies.
IIRC RTL unrolling can do this (as side-effect, not as main
cost motivation) guarded with an extra switch.
> I think more general unrolling and the peeling associated with it can be
> discussed independently of 1) and 2) once we collect more data on more
> microarchitectures.
I think both of these can be "implemented" on the RTL unroller
side.
More information about the Gcc-bugs
mailing list