One of the hot loops in 510.parest_r from SPEC2017 can be approximated through: unsigned int *colnums; double *val; struct foostruct { unsigned int rows; unsigned int *colnums; unsigned int *rowstart; }; struct foostruct *cols; void foo (double *dst, const double *src) { const unsigned int n_rows = cols->rows; const double *val_ptr = &val[cols->rowstart[0]]; const unsigned int *colnum_ptr = &cols->colnums[cols->rowstart[0]]; double *dst_ptr = dst; for (unsigned int row=0; row<n_rows; ++row) { double s = 0.; const double *const val_end_of_row = &val[cols->rowstart[row+1]]; while (val_ptr != val_end_of_row) s += *val_ptr++ * src[*colnum_ptr++]; *dst_ptr++ = s; } } At -Ofast -mcpu=cortex-a57 on aarch64 GCC generates a tight FMA loop: .L4: ldr w3, [x7, x2, lsl 2] cmp x6, x2 ldr d2, [x5, x2, lsl 3] add x2, x2, 1 ldr d1, [x1, x3, lsl 3] fmadd d0, d2, d1, d0 bne .L4 LLVM unrolls the loop more intelligently: .LBB0_8: // %vector.body // Parent Loop BB0_2 Depth=1 // => This Inner Loop Header: Depth=2 ldp w21, w22, [x20, #-8] ldr d5, [x1, x21, lsl #3] ldp d3, d4, [x7, #-16] ldr d6, [x1, x22, lsl #3] ldp w21, w22, [x20], #16 fmadd d2, d6, d4, d2 fmadd d1, d5, d3, d1 ldr d5, [x1, x21, lsl #3] ldr d6, [x1, x22, lsl #3] add x5, x5, #4 // =4 adds x19, x19, #2 // =2 ldp d3, d4, [x7], #32 fmadd d1, d5, d3, d1 fmadd d2, d6, d4, d2 b.ne .LBB0_8 With -funroll-loops GCC does do unrolling, but it does it differently: <snip> ands x12, x11, 7 beq .L70 cmp x12, 1 beq .L55 cmp x12, 2 beq .L57 cmp x12, 3 beq .L59 cmp x12, 4 beq .L61 cmp x12, 5 beq .L63 cmp x12, 6 bne .L72 .L65: ldr w14, [x4, x2, lsl 2] ldr d3, [x3, x2, lsl 3] add x2, x2, 1 ldr d4, [x1, x14, lsl 3] fmadd d0, d3, d4, d0 .L63: ldr w5, [x4, x2, lsl 2] ldr d5, [x3, x2, lsl 3] add x2, x2, 1 ldr d6, [x1, x5, lsl 3] fmadd d0, d5, d6, d0 .L61: ldr w9, [x4, x2, lsl 2] ldr d7, [x3, x2, lsl 3] add x2, x2, 1 ldr d16, [x1, x9, lsl 3] fmadd d0, d7, d16, d0 <snip> On the whole of 510.parest_r this makes LLVM about 6% faster than GCC on Cortex-A57. Perhaps this can be used as a motivating testcase to move the GCC unrolling discussions forward?
So LLVM unrolls 4 times while GCC (always) unrolls 8 times. The unrolled body for GCC (x86_64 this time) is .L4: movl (%rdx), %ecx vmovsd (%rax), %xmm8 addq $32, %rdx addq $64, %rax vmovsd -56(%rax), %xmm9 vmovsd -48(%rax), %xmm10 vfmadd231sd (%rsi,%rcx,8), %xmm8, %xmm0 movl -28(%rdx), %ecx vmovsd -40(%rax), %xmm11 vmovsd -32(%rax), %xmm12 vfmadd231sd (%rsi,%rcx,8), %xmm9, %xmm0 movl -24(%rdx), %ecx vmovsd -24(%rax), %xmm13 vmovsd -16(%rax), %xmm14 vfmadd231sd (%rsi,%rcx,8), %xmm10, %xmm0 movl -20(%rdx), %ecx vmovsd -8(%rax), %xmm15 vfmadd231sd (%rsi,%rcx,8), %xmm11, %xmm0 movl -16(%rdx), %ecx vfmadd231sd (%rsi,%rcx,8), %xmm12, %xmm0 movl -12(%rdx), %ecx vfmadd231sd (%rsi,%rcx,8), %xmm13, %xmm0 movl -8(%rdx), %ecx vfmadd231sd (%rsi,%rcx,8), %xmm14, %xmm0 movl -4(%rdx), %ecx vfmadd231sd (%rsi,%rcx,8), %xmm15, %xmm0 cmpq %rax, %r9 jne .L4 and what you quoted is the prologue. You didn't quote llvms prologue but if I read my clangs outout correct it uses a loop there. (is there sth like -fdump-tree-optimized for clang?) Our RTL unroller cannot do a loopy prologue but it always has this jump-into peeled copies thing. Using --param max-unroll-times=4 produces .L4: movl (%rdx), %ecx vmovsd (%rax), %xmm2 addq $16, %rdx addq $32, %rax vmovsd -24(%rax), %xmm3 vmovsd -16(%rax), %xmm4 vfmadd231sd (%rsi,%rcx,8), %xmm2, %xmm0 movl -12(%rdx), %ecx vmovsd -8(%rax), %xmm5 vfmadd231sd (%rsi,%rcx,8), %xmm3, %xmm0 movl -8(%rdx), %ecx vfmadd231sd (%rsi,%rcx,8), %xmm4, %xmm0 movl -4(%rdx), %ecx vfmadd231sd (%rsi,%rcx,8), %xmm5, %xmm0 cmpq %rax, %r8 jne .L4 which is nearly equivalent to clnags varaint?
Created attachment 45386 [details] aarch64-llvm output with -Ofast -mcpu=cortex-a57 I'm attaching the full LLVM aarch64 output. The output you quoted is with -funroll-loops. If that's not given, GCC doesn't seem to unroll by default at all (on aarch64 or x86_64 from my testing). Is there anything we can do to make the default unrolling a bit more aggressive?
(In reply to ktkachov from comment #2) > Created attachment 45386 [details] > aarch64-llvm output with -Ofast -mcpu=cortex-a57 > > I'm attaching the full LLVM aarch64 output. > > The output you quoted is with -funroll-loops. If that's not given, GCC > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my > testing). > > Is there anything we can do to make the default unrolling a bit more > aggressive? Well, the RTL loop unroller is not enabled by default at any optimization level (unless you are using FDO). There's also related flags not enabled (-fsplit-ivs-in-unroller and -fvariable-expansion-in-unroller). The RTL loop unroller is simply not good at estimating benefit of unrolling (which is also why you usually see it unrolling --param max-unroll-times times) and the tunables it has are not very well tuned across targets. Micha did quite extensive benchmarking (on x86_64) which shows that the cases where unrolling is profitable are rare and the reason is often hard to understand. That's of course in the context of CPUs having caches of pre-decoded/fused/etc. instructions optimizing issue which makes peeled prologues expensive as well as even more special caches for small loops avoiding more frontend costs. Not sure if arm archs have any of this. I generally don't believe in unrolling as a separately profitable transform. Rather unrolling could be done as part of another transform (vectorization is the best example). For sth still done on RTL that would then include scheduling which is where the best cost estimates should be available (and if you do this post-reload then you even have a very good idea of register pressure). This is also why I think a standalone unrolling phase belongs on RTL since I don't see a good way of estimating cost/benefit on GIMPLE (see how difficult it is to cost vectorization vs. non-vectorization there).
(In reply to ktkachov from comment #2) > Created attachment 45386 [details] > aarch64-llvm output with -Ofast -mcpu=cortex-a57 > > I'm attaching the full LLVM aarch64 output. > > The output you quoted is with -funroll-loops. If that's not given, GCC > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my > testing). > > Is there anything we can do to make the default unrolling a bit more > aggressive? I don't think the RTL unroller works at all. It doesn't have the right settings, and doesn't understand how to unroll, so we always get inefficient and bloated code. To do unrolling correctly it has to be integrated at tree level - for example when vectorization isn't possible/beneficial, unrolling might still be a good idea.
(In reply to Wilco from comment #4) > (In reply to ktkachov from comment #2) > > Created attachment 45386 [details] > > aarch64-llvm output with -Ofast -mcpu=cortex-a57 > > > > I'm attaching the full LLVM aarch64 output. > > > > The output you quoted is with -funroll-loops. If that's not given, GCC > > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my > > testing). > > > > Is there anything we can do to make the default unrolling a bit more > > aggressive? > > I don't think the RTL unroller works at all. It doesn't have the right > settings, and doesn't understand how to unroll, so we always get inefficient > and bloated code. > > To do unrolling correctly it has to be integrated at tree level - for > example when vectorization isn't possible/beneficial, unrolling might still > be a good idea. To add some numbers to the conversation, the gain LLVM gets from default unrolling is 4.5% on SPECINT2017 and 1.0% on SPECFP2017. This clearly shows there is huge potential from unrolling, *if* we can teach GCC to unroll properly like LLVM. That means early unrolling, using good default settings and using a trailing loop rather than inefficient peeling.
On Wed, 9 Jan 2019, wilco at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #5 from Wilco <wilco at gcc dot gnu.org> --- > (In reply to Wilco from comment #4) > > (In reply to ktkachov from comment #2) > > > Created attachment 45386 [details] > > > aarch64-llvm output with -Ofast -mcpu=cortex-a57 > > > > > > I'm attaching the full LLVM aarch64 output. > > > > > > The output you quoted is with -funroll-loops. If that's not given, GCC > > > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my > > > testing). > > > > > > Is there anything we can do to make the default unrolling a bit more > > > aggressive? > > > > I don't think the RTL unroller works at all. It doesn't have the right > > settings, and doesn't understand how to unroll, so we always get inefficient > > and bloated code. > > > > To do unrolling correctly it has to be integrated at tree level - for > > example when vectorization isn't possible/beneficial, unrolling might still > > be a good idea. > > To add some numbers to the conversation, the gain LLVM gets from default > unrolling is 4.5% on SPECINT2017 and 1.0% on SPECFP2017. > > This clearly shows there is huge potential from unrolling, *if* we can teach > GCC to unroll properly like LLVM. That means early unrolling, using good > default settings and using a trailing loop rather than inefficient peeling. I don't see why this cannot be done on RTL where we have vastly more information of whether there are execution resources that can be used by unrolling. Note we also want unrolling to interleave instructions to not rely on pre-reload scheduling which in turn means having a good eye on register pressure (again sth not very well handled on GIMPLE)
(In reply to rguenther@suse.de from comment #6) > On Wed, 9 Jan 2019, wilco at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > > > --- Comment #5 from Wilco <wilco at gcc dot gnu.org> --- > > (In reply to Wilco from comment #4) > > > (In reply to ktkachov from comment #2) > > > > Created attachment 45386 [details] > > > > aarch64-llvm output with -Ofast -mcpu=cortex-a57 > > > > > > > > I'm attaching the full LLVM aarch64 output. > > > > > > > > The output you quoted is with -funroll-loops. If that's not given, GCC > > > > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my > > > > testing). > > > > > > > > Is there anything we can do to make the default unrolling a bit more > > > > aggressive? > > > > > > I don't think the RTL unroller works at all. It doesn't have the right > > > settings, and doesn't understand how to unroll, so we always get inefficient > > > and bloated code. > > > > > > To do unrolling correctly it has to be integrated at tree level - for > > > example when vectorization isn't possible/beneficial, unrolling might still > > > be a good idea. > > > > To add some numbers to the conversation, the gain LLVM gets from default > > unrolling is 4.5% on SPECINT2017 and 1.0% on SPECFP2017. > > > > This clearly shows there is huge potential from unrolling, *if* we can teach > > GCC to unroll properly like LLVM. That means early unrolling, using good > > default settings and using a trailing loop rather than inefficient peeling. > > I don't see why this cannot be done on RTL where we have vastly more > information of whether there are execution resources that can be > used by unrolling. Note we also want unrolling to interleave > instructions to not rely on pre-reload scheduling which in turn means > having a good eye on register pressure (again sth not very well handled > on GIMPLE) The main issue is that other loop optimizations are done on tree, so things like addressing modes, loop invariants, CSEs are run on the non-unrolled version. Then when we unroll in RTL we end up with very non-optimal code. Typical unrolled loop starts like this: add x13, x2, 1 add x14, x2, 2 add x11, x2, 3 add x10, x2, 4 ldr w30, [x4, x13, lsl 2] add x9, x2, 5 add x5, x2, 6 add x12, x2, 7 ldr d23, [x3, x2, lsl 3] ... rest of unrolled loop So basically it decides to create a new induction variable for every unrolled copy in the loop. This often leads to spills just because it creates way too many redundant addressing instructions. It also blocks scheduling between iterations since the alias optimization doesn't appear to understand simple constant differences between indices. So unrolling should definitely be done at a high level just like vectorization.
btw looks likes ICC vectorises this as well as unrolling: ..B1.14: movl (%rcx,%rbx,4), %r15d vmovsd (%rdi,%r15,8), %xmm2 movl 4(%rcx,%rbx,4), %r15d vmovhpd (%rdi,%r15,8), %xmm2, %xmm3 movl 8(%rcx,%rbx,4), %r15d vfmadd231pd (%r10,%rbx,8), %xmm3, %xmm0 vmovsd (%rdi,%r15,8), %xmm4 movl 12(%rcx,%rbx,4), %r15d vmovhpd (%rdi,%r15,8), %xmm4, %xmm5 movl 16(%rcx,%rbx,4), %r15d vfmadd231pd 16(%r10,%rbx,8), %xmm5, %xmm1 vmovsd (%rdi,%r15,8), %xmm6 movl 20(%rcx,%rbx,4), %r15d vmovhpd (%rdi,%r15,8), %xmm6, %xmm7 movl 24(%rcx,%rbx,4), %r15d vfmadd231pd 32(%r10,%rbx,8), %xmm7, %xmm0 vmovsd (%rdi,%r15,8), %xmm8 movl 28(%rcx,%rbx,4), %r15d vmovhpd (%rdi,%r15,8), %xmm8, %xmm9 vfmadd231pd 48(%r10,%rbx,8), %xmm9, %xmm1 addq $8, %rbx cmpq %r14, %rbx jb ..B1.14 Is that something GCC could reasonably do?
On Tue, 15 Jan 2019, ktkachov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #8 from ktkachov at gcc dot gnu.org --- > btw looks likes ICC vectorises this as well as unrolling: > ..B1.14: > movl (%rcx,%rbx,4), %r15d > vmovsd (%rdi,%r15,8), %xmm2 > movl 4(%rcx,%rbx,4), %r15d > vmovhpd (%rdi,%r15,8), %xmm2, %xmm3 > movl 8(%rcx,%rbx,4), %r15d > vfmadd231pd (%r10,%rbx,8), %xmm3, %xmm0 > vmovsd (%rdi,%r15,8), %xmm4 > movl 12(%rcx,%rbx,4), %r15d > vmovhpd (%rdi,%r15,8), %xmm4, %xmm5 > movl 16(%rcx,%rbx,4), %r15d > vfmadd231pd 16(%r10,%rbx,8), %xmm5, %xmm1 > vmovsd (%rdi,%r15,8), %xmm6 > movl 20(%rcx,%rbx,4), %r15d > vmovhpd (%rdi,%r15,8), %xmm6, %xmm7 > movl 24(%rcx,%rbx,4), %r15d > vfmadd231pd 32(%r10,%rbx,8), %xmm7, %xmm0 > vmovsd (%rdi,%r15,8), %xmm8 > movl 28(%rcx,%rbx,4), %r15d > vmovhpd (%rdi,%r15,8), %xmm8, %xmm9 > vfmadd231pd 48(%r10,%rbx,8), %xmm9, %xmm1 > addq $8, %rbx > cmpq %r14, %rbx > jb ..B1.14 > > Is that something GCC could reasonably do? GCC could choose a larger vectorization factor, yes. The longer epilogue could be vectorized with the same vector size again then.
FWIW, I agree that pure unrolling doesn't feel like a gimple-level optimisation. Whether it's a win or not depends on whether the unrolled loop will make better use of the microarchitecture. The problem isn't just that that's hard to decide at the gimple level, but that the result can't be represented directly in gimple. AIUI there's no real significance to the schedule of gimple statements (beyond ensuring valid SSA and functional correctness). This is different from vectorisation and ivopts, which can represent the benefit of the transformation directly in gimple (using vector ops and TARGET_MEM_REFs respectively). As Kyrill pointed out off-list, LLVM does the unrolling in the vectoriser rather than a separate unrolling pass. (Use -mllvm -print-after-all to see this.) I think for AArch64 we can view LDP and STP as 2-element vector loads and stores that have zero-cost insertion and extraction. So converting: ldr x0, [...] add x0, x0, 1 str x0, [...] into: ldp x0, x1, [...] add x0, x0, 1 add x1, x1, 1 stp x0, x1, [...] is IMO genuine vectorisation. The LDPs and STPs are effectively scalar IFN_LOAD_LANES and IFN_STORE_LANES, although we could also represent them as single-element (V1) vector ops instead if that seems more consistent. Vectorising operations other than loads and stores would simply involve duplicating the statements VF times.
Thank you all for the input. Just to add a bit of data. I've instrumented 510.parest_r to count the number of loop iterations to get a feel for how much of the unrolled loop is spent in the actual unrolled part rather than the prologue/peeled part. Overall, the hot function itself is entered 290M times. The distribution of loop iteration counts is: Frequency iter: 92438870 36 87028560 54 20404571 24 17312960 62 14237184 72 13403904 108 7574437 102 7574420 70 5564881 40 4328249 64 4328240 46 3142656 48 2666496 124 1248176 8 1236641 16 1166592 204 1166592 140 1134392 4 857088 80 666624 92 666624 128 618320 30 613056 1 234464 2 190464 32 95232 60 84476 20 48272 10 6896 5 So the two most common iteration counts are 36 and 54. For an 8x unrolled loop that's 4 and 6 iterations spent in the prologue with 4 and 6 times going around the 8x unrolled loop respectively. As an experiment I hacked the AArch64 assembly of the function generated with -funroll-loops to replace the peeled prologue version with a simple non-unrolled loop. That gave a sizeable speedup on two AArch64 platforms: >7%. So beyond the vectorisation point Richard S. made above, maybe it's worth considering replacing the peeled prologue with a simple loop instead? Or at least add that as a distinct unrolling strategy and work to come up with an analysis that would allow us to choose one over the other?
(In reply to ktkachov from comment #11) > > As an experiment I hacked the AArch64 assembly of the function generated > with -funroll-loops to replace the peeled prologue version with a simple > non-unrolled loop. That gave a sizeable speedup on two AArch64 platforms: > >7%. > > So beyond the vectorisation point Richard S. made above, maybe it's worth > considering replacing the peeled prologue with a simple loop instead? > Or at least add that as a distinct unrolling strategy and work to come up > with an analysis that would allow us to choose one over the other? Upon reflection I think I may have bungled up the assembly hacking (the changes I made may not be equivalent to the source). I'll redo that experiment soon, so please disregard that part for now. The iteration count distribution numbers are still valid though.
On Wed, 16 Jan 2019, ktkachov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #11 from ktkachov at gcc dot gnu.org --- > Thank you all for the input. > > Just to add a bit of data. > I've instrumented 510.parest_r to count the number of loop iterations to get a > feel for how much of the unrolled loop is spent in the actual unrolled part > rather than the prologue/peeled part. Overall, the hot function itself is > entered 290M times. The distribution of loop iteration counts is: > > Frequency iter: > 92438870 36 > 87028560 54 > 20404571 24 > 17312960 62 > 14237184 72 > 13403904 108 > 7574437 102 > 7574420 70 > 5564881 40 > 4328249 64 > 4328240 46 > 3142656 48 > 2666496 124 > 1248176 8 > 1236641 16 > 1166592 204 > 1166592 140 > 1134392 4 > 857088 80 > 666624 92 > 666624 128 > 618320 30 > 613056 1 > 234464 2 > 190464 32 > 95232 60 > 84476 20 > 48272 10 > 6896 5 > > So the two most common iteration counts are 36 and 54. For an 8x unrolled loop > that's 4 and 6 iterations spent in the prologue with 4 and 6 times going around > the 8x unrolled loop respectively. > > As an experiment I hacked the AArch64 assembly of the function generated with > -funroll-loops to replace the peeled prologue version with a simple > non-unrolled loop. That gave a sizeable speedup on two AArch64 platforms: >7%. > > So beyond the vectorisation point Richard S. made above, maybe it's worth > considering replacing the peeled prologue with a simple loop instead? > Or at least add that as a distinct unrolling strategy and work to come up with > an analysis that would allow us to choose one over the other? Patches welcome ;) Usually the peeling is done to improve branch prediction on the prologue/epilogue.
(In reply to rguenther@suse.de from comment #13) > Usually the peeling is done to improve branch prediction on the > prologue/epilogue. Modern branch predictors do much better on a loop than with this kind of code: ands x12, x11, 7 beq .L70 cmp x12, 1 beq .L55 cmp x12, 2 beq .L57 cmp x12, 3 beq .L59 cmp x12, 4 beq .L61 cmp x12, 5 beq .L63 cmp x12, 6 bne .L72 That's way too many branches close together so most predictors will hit the maximum branches per fetch block limit and not predict them. If you wanted to peel it would have to be like: if (n & 4) do 4 iterations if (n & 2) do 2 iterations if (n & 1) do 1 iteration However that's still way too much code explosion for little or no gain. The only case where this makes sense is in a handwritten memcpy.
On Thu, 17 Jan 2019, wilco at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #14 from Wilco <wilco at gcc dot gnu.org> --- > (In reply to rguenther@suse.de from comment #13) > > > Usually the peeling is done to improve branch prediction on the > > prologue/epilogue. > > Modern branch predictors do much better on a loop than with this kind of code: > > ands x12, x11, 7 > beq .L70 > cmp x12, 1 > beq .L55 > cmp x12, 2 > beq .L57 > cmp x12, 3 > beq .L59 > cmp x12, 4 > beq .L61 > cmp x12, 5 > beq .L63 > cmp x12, 6 > bne .L72 > > That's way too many branches close together so most predictors will hit the > maximum branches per fetch block limit and not predict them. > > If you wanted to peel it would have to be like: > > if (n & 4) > do 4 iterations > if (n & 2) > do 2 iterations > if (n & 1) > do 1 iteration > > However that's still way too much code explosion for little or no gain. The > only case where this makes sense is in a handwritten memcpy. Classic peeling would be do 1 iteration if (n == 1) exit do 1 iteration if (n == 2) exit ... which is what I refered to for branch prediction. Your & prompts me to a way to do sth similar as duffs device, turning the loop into a nest. head: if (n == 0) exit; <1 iteration> if (n & 1) n -= 1, goto head; <1 iteration> if (n & 2) n -= 2, goto head; <2 iterations> if (n & 4) n -= 4, goto head; <4 iterations> n -= 8, goto head; the inner loop tests should become well-predicted quickly. But as always - more branches make it more likely that one is mispredicted. For a single non-unrolled loop you usually only get the actual exit mispredicted.
(In reply to rguenther@suse.de from comment #15) > which is what I refered to for branch prediction. Your & prompts me > to a way to do sth similar as duffs device, turning the loop into a nest. > > head: > if (n == 0) exit; > <1 iteration> > if (n & 1) > n -= 1, goto head; > <1 iteration> > if (n & 2) > n -= 2, goto head; > <2 iterations> > if (n & 4) > n -= 4, goto head; > <4 iterations> > n -= 8, goto head; > > the inner loop tests should become well-predicted quickly. > > But as always - more branches make it more likely that one is > mispredicted. For a single non-unrolled loop you usually only > get the actual exit mispredicted. Yes the overlapping the branches for the tail loop and the main loop will result in more mispredictions. And there are still 4 branches for an 8x unrolled loop, blocking optimizations and scheduling. So Duff's device is always inefficient - the above loop is much faster like this: if (n & 1) do 1 iteration if (n & 2) do 2 iterations if (n >= 4) do 4 iterations while ((n -= 4) > 0)
I played around with the source to do some conservative 2x manual unrolling in the two hottest functions in 510.parest_r (3 more-or-less identical tight FMA loops). This was to try out Richard's thinking suggestion in #c10 about unrolling for forming load-pairs, and also to break the accumulator dependency. So the initial testcase now became: unsigned int *colnums; double *val; struct foostruct { unsigned int rows; unsigned int *colnums; unsigned int *rowstart; }; struct foostruct *cols; void foo (double * __restrict__ dst, const double *__restrict__ src) { const unsigned int n_rows = cols->rows; const double *val_ptr = &val[cols->rowstart[0]]; const unsigned int *colnum_ptr = &cols->colnums[cols->rowstart[0]]; double *dst_ptr = dst; for (unsigned int row=0; row<n_rows; ++row) { double s = 0.; const double *const val_end_of_row = &val[cols->rowstart[row+1]]; __PTRDIFF_TYPE__ diff = val_end_of_row - val_ptr; if (diff & 1) // Peel the odd iteration. s += *val_ptr++ * src[*colnum_ptr++]; double s1 = 0.; // Second accumulator while (val_ptr != val_end_of_row) { s += val_ptr[0] * src[colnum_ptr[0]]; s1 += val_ptr[1] * src[colnum_ptr[1]]; val_ptr += 2; colnum_ptr += 2; } *dst_ptr++ = s + s1; } } This transformed the initial loop from: .L4: ldr w3, [x7, x2, lsl 2] cmp x6, x2 ldr d2, [x5, x2, lsl 3] add x2, x2, 1 ldr d1, [x1, x3, lsl 3] fmadd d0, d2, d1, d0 bne .L4 into: .L5: ldp w6, w5, [x3] // LDP add x3, x3, 8 ldp d5, d3, [x2] // LDP add x2, x2, 16 ldr d4, [x1, x6, lsl 3] cmp x4, x2 ldr d2, [x1, x5, lsl 3] fmadd d0, d5, d4, d0 fmadd d1, d3, d2, d1 bne .L5 In parest itself a few of the loops transformed this way did not form LDPs due to unrelated LDP-forming inefficiencies but the most did. This transformation gave a 3% improvement on a Cortex-A72. There are more similar loops in the 3rd, 4th and 5th hottest functions in that benchmark, so I suspect if we do something intelligent there as well we'll get even more sizeable gains. So rather than solving general "unrolling", how about we break this down into two desirable transformations: 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's suggestion) 2) Unrolling and breaking accumulator dependencies. I think more general unrolling and the peeling associated with it can be discussed independently of 1) and 2) once we collect more data on more microarchitectures.
On Thu, 24 Jan 2019, ktkachov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #17 from ktkachov at gcc dot gnu.org --- > I played around with the source to do some conservative 2x manual unrolling in > the two hottest functions in 510.parest_r (3 more-or-less identical tight FMA > loops). This was to try out Richard's thinking suggestion in #c10 about > unrolling for forming load-pairs, and also to break the accumulator dependency. > > So the initial testcase now became: > unsigned int *colnums; > double *val; > > struct foostruct > { > unsigned int rows; > unsigned int *colnums; > unsigned int *rowstart; > }; > > struct foostruct *cols; > > void > foo (double * __restrict__ dst, const double *__restrict__ src) > { > const unsigned int n_rows = cols->rows; > const double *val_ptr = &val[cols->rowstart[0]]; > const unsigned int *colnum_ptr = &cols->colnums[cols->rowstart[0]]; > > double *dst_ptr = dst; > for (unsigned int row=0; row<n_rows; ++row) > { > double s = 0.; > const double *const val_end_of_row = &val[cols->rowstart[row+1]]; > __PTRDIFF_TYPE__ diff = val_end_of_row - val_ptr; > > if (diff & 1) // Peel the odd iteration. > s += *val_ptr++ * src[*colnum_ptr++]; > > double s1 = 0.; // Second accumulator > while (val_ptr != val_end_of_row) > { > s += val_ptr[0] * src[colnum_ptr[0]]; > s1 += val_ptr[1] * src[colnum_ptr[1]]; > val_ptr += 2; > colnum_ptr += 2; > } > *dst_ptr++ = s + s1; > } > } > > This transformed the initial loop from: > .L4: > ldr w3, [x7, x2, lsl 2] > cmp x6, x2 > ldr d2, [x5, x2, lsl 3] > add x2, x2, 1 > ldr d1, [x1, x3, lsl 3] > fmadd d0, d2, d1, d0 > bne .L4 > > into: > .L5: > ldp w6, w5, [x3] // LDP > add x3, x3, 8 > ldp d5, d3, [x2] // LDP > add x2, x2, 16 > ldr d4, [x1, x6, lsl 3] > cmp x4, x2 > ldr d2, [x1, x5, lsl 3] > fmadd d0, d5, d4, d0 > fmadd d1, d3, d2, d1 > bne .L5 > > In parest itself a few of the loops transformed this way did not form LDPs due > to unrelated LDP-forming inefficiencies but the most did. > This transformation gave a 3% improvement on a Cortex-A72. There are more > similar loops in the 3rd, 4th and 5th hottest functions in that benchmark, so I > suspect if we do something intelligent there as well we'll get even more > sizeable gains. > > So rather than solving general "unrolling", how about we break this down into > two desirable transformations: > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's > suggestion) If that helps, sure (I'd have guessed uarchs are going to split load-multiple into separate loads, but eventually it avoids load-port contention?) > 2) Unrolling and breaking accumulator dependencies. IIRC RTL unrolling can do this (as side-effect, not as main cost motivation) guarded with an extra switch. > I think more general unrolling and the peeling associated with it can be > discussed independently of 1) and 2) once we collect more data on more > microarchitectures. I think both of these can be "implemented" on the RTL unroller side.
(In reply to rguenther@suse.de from comment #18) > > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's > > suggestion) > > If that helps, sure (I'd have guessed uarchs are going to split > load-multiple into separate loads, but eventually it avoids > load-port contention?) Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a 128-bit LDP in a single cycle (see Optimization Guide). > > 2) Unrolling and breaking accumulator dependencies. > > IIRC RTL unrolling can do this (as side-effect, not as main > cost motivation) guarded with an extra switch. > > > I think more general unrolling and the peeling associated with it can be > > discussed independently of 1) and 2) once we collect more data on more > > microarchitectures. > > I think both of these can be "implemented" on the RTL unroller > side. You still need dependence analysis, alias info, ivopt to run again. The goal is to remove the increment of the index, use efficient addressing modes (base+imm) and allow scheduling to move instructions between iterations. I don't believe the RTL unroller supports any of this today.
On Thu, 24 Jan 2019, wilco at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #19 from Wilco <wilco at gcc dot gnu.org> --- > (In reply to rguenther@suse.de from comment #18) > > > > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's > > > suggestion) > > > > If that helps, sure (I'd have guessed uarchs are going to split > > load-multiple into separate loads, but eventually it avoids > > load-port contention?) > > Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a > 128-bit LDP in a single cycle (see Optimization Guide). > > > > 2) Unrolling and breaking accumulator dependencies. > > > > IIRC RTL unrolling can do this (as side-effect, not as main > > cost motivation) guarded with an extra switch. > > > > > I think more general unrolling and the peeling associated with it can be > > > discussed independently of 1) and 2) once we collect more data on more > > > microarchitectures. > > > > I think both of these can be "implemented" on the RTL unroller > > side. > > You still need dependence analysis, alias info, ivopt to run again. The goal is > to remove the increment of the index, use efficient addressing modes (base+imm) > and allow scheduling to move instructions between iterations. I don't believe > the RTL unroller supports any of this today. There's no way to encode load-multiple on GIMPLE that wouldn't be awkward to other GIMPLE optimizers. Yes, the RTL unroller supports scheduling (sched runs after unrolling) and scheduling can do dependence analysis. Yes, the RTL unroller does _not_ do dependence analysis at the moment, so if you want to know beforehand whether you can interleave iterations you have to actually perform dependence analysis. Of course you can do that on RTL. And of course you can do IVOPTs on RTL (yes, we don't do that at the moment). Note I'm not opposed to have IVOPTs on GIMPLE itself perform unrolling (I know Bin was against this given IVOPTs is already so comples) and a do accumulator breakup. But I don't see how to do the load-multiple thing (yes, you could represent it as a vector load plus N element extracts on GIMPLE and it would be easy to teach SLP vectorization to perform this transform on its own if it is really profitable - which I doubt you can reasonably argue before RA, let alone on GIMPLE).
(In reply to rguenther@suse.de from comment #20) > On Thu, 24 Jan 2019, wilco at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > > > --- Comment #19 from Wilco <wilco at gcc dot gnu.org> --- > > (In reply to rguenther@suse.de from comment #18) > > > > > > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's > > > > suggestion) > > > > > > If that helps, sure (I'd have guessed uarchs are going to split > > > load-multiple into separate loads, but eventually it avoids > > > load-port contention?) > > > > Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a > > 128-bit LDP in a single cycle (see Optimization Guide). > > > > > > 2) Unrolling and breaking accumulator dependencies. > > > > > > IIRC RTL unrolling can do this (as side-effect, not as main > > > cost motivation) guarded with an extra switch. > > > > > > > I think more general unrolling and the peeling associated with it can be > > > > discussed independently of 1) and 2) once we collect more data on more > > > > microarchitectures. > > > > > > I think both of these can be "implemented" on the RTL unroller > > > side. > > > > You still need dependence analysis, alias info, ivopt to run again. The goal is > > to remove the increment of the index, use efficient addressing modes (base+imm) > > and allow scheduling to move instructions between iterations. I don't believe > > the RTL unroller supports any of this today. > > There's no way to encode load-multiple on GIMPLE that wouldn't be > awkward to other GIMPLE optimizers. I don't think anyone want LDP/STP directly in GIMPLE - that doesn't seem useful. We don't even form LDP until quite late in RTL. The key to forming LDP/STP is using base+imm addressing modes and having correct alias info (so loads/stores from different iterations can be interleaved and then combined into LDP/STP). The main thing a backend would need to do is tune address costs to take future LDP formation into account (and yes, the existing cost models need to be improved anyway). > Yes, the RTL unroller supports scheduling (sched runs after unrolling) > and scheduling can do dependence analysis. Yes, the RTL unroller > does _not_ do dependence analysis at the moment, so if you want to > know beforehand whether you can interleave iterations you have to > actually perform dependence analysis. Of course you can do that > on RTL. And of course you can do IVOPTs on RTL (yes, we don't do that > at the moment). Sure we *could* duplicate all high-level loop optimizations to work on RTL. However is that worth the effort given we have them already at tree level? > Note I'm not opposed to have IVOPTs on GIMPLE itself perform > unrolling (I know Bin was against this given IVOPTs is already > so comples) and a do accumulator breakup. But I don't see how > to do the load-multiple thing (yes, you could represent it > as a vector load plus N element extracts on GIMPLE and it > would be easy to teach SLP vectorization to perform this > transform on its own if it is really profitable - which I > doubt you can reasonably argue before RA, let alone on GIMPLE). Let's forget about load-multiple in GIMPLE. Kyrill's example shows that unrolling at the high level means the existing loop optimizations and analysis work as expected and we end up with good addressing modes, LDPs and interleaving of different iterations. With the existing RTL unroller this just isn't happening.
Some more experiments... Unrolling 4x in a similar way to my previous example and not splitting the accumulator (separate issue): unsigned int *colnums; double *val; struct foostruct { unsigned int rows; unsigned int *colnums; unsigned int *rowstart; }; struct foostruct *cols; void foo (double * __restrict__ dst, const double *__restrict__ src) { const unsigned int n_rows = cols->rows; const double *val_ptr = &val[cols->rowstart[0]]; const unsigned int *colnum_ptr = &cols->colnums[cols->rowstart[0]]; double *dst_ptr = dst; for (unsigned int row=0; row<n_rows; ++row) { double s = 0.; const double *const val_end_of_row = &val[cols->rowstart[row+1]]; __PTRDIFF_TYPE__ diff = val_end_of_row - val_ptr; if (diff & 1) { s += *val_ptr++ * src[*colnum_ptr++]; diff--; } if (diff & 2) { s += val_ptr[0] * src[colnum_ptr[0]]; s += val_ptr[1] * src[colnum_ptr[1]]; val_ptr += 2; colnum_ptr += 2; } while (val_ptr != val_end_of_row) { s += val_ptr[0] * src[colnum_ptr[0]]; s += val_ptr[1] * src[colnum_ptr[1]]; s += val_ptr[2] * src[colnum_ptr[2]]; s += val_ptr[3] * src[colnum_ptr[3]]; val_ptr += 4; colnum_ptr += 4; } *dst_ptr++ = s; } } helps even more. On Cortex-A72 it gives a bit more than 6% (vs 3%) improvement on parest, and about 5.3% on a more aggressive CPU. I tried unrolling 8x in a similar manner and that was not faster than 4x on either target. Note that perf profiling shows that the loads are what's hot in these loops, not the FMAs themselves: 4.41 │1b8: ldp w3, w4, [x0] ▒ 5.85 │ ldp d3, d4, [x2] ▒ │ add x2, x2, #0x20 ▒ 3.79 │ ldur d5, [x2, #-16] ▒ 2.82 │ ldr d0, [x1, x4, lsl #3] ▒ 2.53 │ ldr d2, [x1, x3, lsl #3] ▒ 2.10 │ ldp w4, w3, [x0, #8] ▒ │ add x0, x0, #0x10 ▒ 0.00 │ cmp x5, x0 ▒ │ fmul d0, d0, d4 ▒ 4.73 │ ldr d4, [x1, x4, lsl #3] ▒ │ fmadd d0, d3, d2, d0 ▒ 2.01 │ ldur d3, [x2, #-8] ▒ 2.54 │ ldr d2, [x1, x3, lsl #3] ▒ │ fmadd d0, d5, d4, d0 ▒ │ fmadd d0, d3, d2, d0 ▒ │ fadd d1, d1, d0
(In reply to ktkachov from comment #22) > helps even more. On Cortex-A72 it gives a bit more than 6% (vs 3%) > improvement on parest, and about 5.3% on a more aggressive CPU. > I tried unrolling 8x in a similar manner and that was not faster than 4x on > either target. The 4x unrolled version has 19 instructions (and microops) vs 7*4 for the non-unrolled version, a significant reduction (without LDP it would be 21 vs 28). There is potential to use 2 more LDPs and use load+writeback which would make it 15 vs 28 instructions, so close to 2x reduction in instructions.
On some (many?) targets it would be good to unroll all loops with a small body (not containing calls etc.) at -O2 already, some small number of times (2 or 4 maybe).
Well very small loops should be unrolled more than slightly larger ones. So perhaps if the loop body is only 3 or 4 instructions it should be unrolled four times but above that perhaps only twice. Some consideration should be given, however, to whether the target micro-architecture has looping buffers as unrolling beyond the limit of such structures will hurt performance.
Yeah, and it probably should be a param (that different targets can default differently, per CPU probably). On most Power CPUs all loops take a minimum number of cycles per iteration (say, three), but that translates to a lot of instructions (say, >10). Small loops should probably be unrolled at -O2 already as well. Maybe fixed length loops only?
(In reply to Segher Boessenkool from comment #26) > Yeah, and it probably should be a param (that different targets can default > differently, per CPU probably). On most Power CPUs all loops take a minimum > number of cycles per iteration (say, three), but that translates to a lot of > instructions (say, >10). > > Small loops should probably be unrolled at -O2 already as well. Maybe fixed > length loops only? Agreed, there is no reason not to unroll (or vectorize) with -O2 given several other compilers do (and beat GCC as a result). It's best to limit unrolling to small loops and only 2x unless it's 1-2 instructions.
For these kind of small loops, it would be acceptable to unroll in GIMPLE, because register pressure and instruction cost may not be major concerns; just like "cunroll" and "cunrolli" passes (complete unroll) which also been done at O2.
(In reply to Jiu Fu Guo from comment #28) > For these kind of small loops, it would be acceptable to unroll in GIMPLE, > because register pressure and instruction cost may not be major concerns; > just like "cunroll" and "cunrolli" passes (complete unroll) which also been > done at O2. Absolutely, unrolling is a high-level optimization like vectorization. Note the existing unroller doesn't care about register pressure or costs, but a high-level unroller can certainly take this into account by not unrolling as aggressively like the existing unroller.
On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #29 from Wilco <wilco at gcc dot gnu.org> --- > (In reply to Jiu Fu Guo from comment #28) > > For these kind of small loops, it would be acceptable to unroll in GIMPLE, > > because register pressure and instruction cost may not be major concerns; > > just like "cunroll" and "cunrolli" passes (complete unroll) which also been > > done at O2. > > Absolutely, unrolling is a high-level optimization like vectorization. To expose ILP? I'd call that low-level though ;) If it exposes data reuse then I'd call it high-level - and at that level we already have passes like predictive commoning or unroll-and-jam doing exactly that. Or vectorization. We've shown though data that unrolling without a good idea on CPU pipeline details is a loss on x86_64. This further hints at it being low-level.
Gimple passes know a lot about machine details, too. Irrespective of if this is "low-level" or "high-level", it should be done earlier than it is now. It should either be done right after expand, or somewhere before it. Doing it in gimple seems easiest?
(In reply to Segher Boessenkool from comment #31) > Gimple passes know a lot about machine details, too. > > Irrespective of if this is "low-level" or "high-level", it should be done > earlier than it is now. It should either be done right after expand, or > somewhere before it. Doing it in gimple seems easiest? Expand is way too late. Unrolling is high-level because it relies on other high-level optimizations - for example to get addressing and induction variables right. Without that it's always a loss.
On Fri, 11 Oct 2019, segher at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #31 from Segher Boessenkool <segher at gcc dot gnu.org> --- > Gimple passes know a lot about machine details, too. > > Irrespective of if this is "low-level" or "high-level", it should be done > earlier than it is now. It should either be done right after expand, or > somewhere before it. Doing it in gimple seems easiest? Sure it's easiest in GIMPLE. But I'd say it shouldn't be done earlier, it should be done correctly - RTL unroll simply unrolls everything without any good costing. That is what needs to be fixed. Other than that, where we do cunroll on GIMPLE is a good spot I guess (need to do it _before_ IVOPTs at least, though I wanted to push that one later as well).
(In reply to rguenther@suse.de from comment #30) > On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > > > --- Comment #29 from Wilco <wilco at gcc dot gnu.org> --- > > (In reply to Jiu Fu Guo from comment #28) > > > For these kind of small loops, it would be acceptable to unroll in GIMPLE, > > > because register pressure and instruction cost may not be major concerns; > > > just like "cunroll" and "cunrolli" passes (complete unroll) which also been > > > done at O2. > > > > Absolutely, unrolling is a high-level optimization like vectorization. > > To expose ILP? I'd call that low-level though ;) > > If it exposes data reuse then I'd call it high-level - and at that level > we already have passes like predictive commoning or unroll-and-jam doing > exactly that. Or vectorization. > > We've shown though data that unrolling without a good idea on CPU > pipeline details is a loss on x86_64. This further hints at it > being low-level. That was using the RTL unroller and existing settings right? Those settings are not sane - is 16x unrolling still the default?!? If so, that doesn't surprise me at all. Note also it's dangerous to assume x86 behaves exactly like Arm, POWER or other ISAs. Arm cores are very wide nowadays (6+ wide) and have supported 2 memory accesses per cycle (loads and/or stores) for years.
On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #34 from Wilco <wilco at gcc dot gnu.org> --- > (In reply to rguenther@suse.de from comment #30) > > On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote: > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > > > > > --- Comment #29 from Wilco <wilco at gcc dot gnu.org> --- > > > (In reply to Jiu Fu Guo from comment #28) > > > > For these kind of small loops, it would be acceptable to unroll in GIMPLE, > > > > because register pressure and instruction cost may not be major concerns; > > > > just like "cunroll" and "cunrolli" passes (complete unroll) which also been > > > > done at O2. > > > > > > Absolutely, unrolling is a high-level optimization like vectorization. > > > > To expose ILP? I'd call that low-level though ;) > > > > If it exposes data reuse then I'd call it high-level - and at that level > > we already have passes like predictive commoning or unroll-and-jam doing > > exactly that. Or vectorization. > > > > We've shown though data that unrolling without a good idea on CPU > > pipeline details is a loss on x86_64. This further hints at it > > being low-level. > > That was using the RTL unroller and existing settings right? Those settings are > not sane - is 16x unrolling still the default?!? If so, that doesn't surprise > me at all. We actually did measurements with restricting that to 2x, 3x, 4x, ... plus selectively enabling some extra unroller features (fsplit-ivs-in-unroller, fvariable-expansion-in-unroller). Quite a big testing matrix but then nothing was an obvious win. > Note also it's dangerous to assume x86 behaves exactly like Arm, POWER > or other ISAs. Arm cores are very wide nowadays (6+ wide) and have > supported 2 memory accesses per cycle (loads and/or stores) for years. Sure.
On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #32 from Wilco <wilco at gcc dot gnu.org> --- > (In reply to Segher Boessenkool from comment #31) > > Gimple passes know a lot about machine details, too. > > > > Irrespective of if this is "low-level" or "high-level", it should be done > > earlier than it is now. It should either be done right after expand, or > > somewhere before it. Doing it in gimple seems easiest? > > Expand is way too late. Unrolling is high-level because it relies on other > high-level optimizations - for example to get addressing and induction > variables right. Without that it's always a loss. I know I've said it before but I really want to have a "GIMPLE unroller" integrated with a pass that can compute unrolling benefit. One candidate was of course IVOPTs which albeit is already quite complex but which has register pressure estimates and should have an idea how unrolling affects these. Computing unrolling benefit on its own would require some pipeline modeling on GIMPLE. So the other obvious candidiate would be the RTL scheduler - here I think SMS does "unrolling".
-- If it is done in RTL it should really be done earlier, it doesn't get all optimisations it should right now. -- Unrolling small loops more aggressively (at -O2 even) perhaps needs to be done at a different spot in the pipeline then? It needs no cost estimate (other than some simple params).
On Fri, 11 Oct 2019, segher at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #37 from Segher Boessenkool <segher at gcc dot gnu.org> --- > -- If it is done in RTL it should really be done earlier, it doesn't get all > optimisations it should right now. > > -- Unrolling small loops more aggressively (at -O2 even) perhaps needs to be > done at a different spot in the pipeline then? It needs no cost estimate > (other than some simple params). Well. Even unrolling a two-stmt loop just once can be detrimental on x86 CPUs and whether it is beneficial depends a lot on code alignment which we cannot really track well w/o introducing artificial code padding everywhere. But yes, this can be turned off easily on x86. Just heuristics are not as simple as you might think.
For small loop (1-2 stmts), in forms of GIMPLE and RTL, it would be around 5-10 instructions: 2-4 insns per stmt, ~4 insns for idx. With current unroller, here is a statistic on spec2017. Using --param max-unrolled-insns=12, there are ~3000 small loops could be unrolled totally, and ~40 of these small loops are located in hot-functions. Using --param max-unrolled-insns=16, there are ~11000 small loops could be unrolled totally, and ~230 of these small loops are located in hot-functions. Using --param max-unrolled-insns=20, there are ~15000 small loops could be unrolled totally, and ~570 of these small loops are located in hot-functions. Using --param max-unrolled-insns=24, there are ~18000 small loops could be unrolled totally, and ~680 of these small loops are located in hot-functions. if max-unrolled-insns<16, just few small loops are unrolled for hot-functions; it may be not very valuable.
On Sat, 12 Oct 2019, guojiufu at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #39 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> --- > For small loop (1-2 stmts), in forms of GIMPLE and RTL, it would be around 5-10 > instructions: 2-4 insns per stmt, ~4 insns for idx. > > With current unroller, here is a statistic on spec2017. > Using --param max-unrolled-insns=12, there are ~3000 small loops could be > unrolled totally, and ~40 of these small loops are located in hot-functions. > > Using --param max-unrolled-insns=16, there are ~11000 small loops could be > unrolled totally, and ~230 of these small loops are located in hot-functions. > > Using --param max-unrolled-insns=20, there are ~15000 small loops could be > unrolled totally, and ~570 of these small loops are located in hot-functions. > > Using --param max-unrolled-insns=24, there are ~18000 small loops could be > unrolled totally, and ~680 of these small loops are located in hot-functions. > > > if max-unrolled-insns<16, just few small loops are unrolled for hot-functions; > it may be not very valuable. So 12 if two times unrolled is already 6 insns, minus IV update and compare-and-branch (assuming single pattern) that's 4 insns. On GIMPLE I'd already call this large since eventual memory loads and stores would be separate - so there it wuld be ~16 instead of 12. I think the better approach is to identify the cases where unrolling would help, and on which (sub-)architectures, and prepare testcases for them. I guess the times where our default unroll factor (if it fits the size limits) of 8 is a good idea is long gone, I'd expect ILP to stop improving much earlier (depending on the set of operations). For ILP you also want to do interleaving of the unrolled iterations, so I point to SMS again here (SMS suffers from the fact that loop dependence info is weak on RTL, but it uses the scheduler model of the target).
for code: subroutine foo (i, i1, block) integer :: i, i1 integer :: block(9, 9, 9) block(i:9,1,i1) = block(i:9,1,i1) - 10 end subroutine foo "-funroll-loops --param max-unroll-times=2 --param max-unrolled-insns=20" could help to improve some run time.(~10% on ppcle) main: do n = 0, N do i = 1, 9 do j = 1, 9 call foo (i, j, block) end do end do end do
Author: guojiufu Date: Mon Oct 28 05:23:24 2019 New Revision: 277501 URL: https://gcc.gnu.org/viewcvs?rev=277501&root=gcc&view=rev Log: rs6000: Enable limited unrolling at -O2 In PR88760, there are a few disscussion about improve or tune unroller for targets. And we would agree to enable unroller for small loops at O2 first. And we could see performance improvement(~10%) for below code: ``` subroutine foo (i, i1, block) integer :: i, i1 integer :: block(9, 9, 9) block(i:9,1,i1) = block(i:9,1,i1) - 10 end subroutine foo ``` This kind of code occurs a few times in exchange2 benchmark. Similar C code: ``` for (i = 0; i < n; i++) arr[i] = arr[i] - 10; ``` On powerpcle, for O2 , enable -funroll-loops and limit PARAM_MAX_UNROLL_TIMES=2 and PARAM_MAX_UNROLLED_INSNS=20, we can see >2% overall improvement for SPEC2017. This patch is only for rs6000 in which we see visible performance improvement. gcc/ 2019-10-25 Jiufu Guo <guojiufu@linux.ibm.com> PR tree-optimization/88760 * config/rs6000/rs6000-common.c (rs6000_option_optimization_table): Enable -funroll-loops for -O2 and above. * config/rs6000/rs6000.c (rs6000_option_override_internal): Set PARAM_MAX_UNROLL_TIMES to 2 and PARAM_MAX_UNROLLED_INSNS to 20, and do not turn on web and rngreg implicitly, if the unroller is not explicitly enabled. gcc.testsuite/ 2019-10-25 Jiufu Guo <guojiufu@linux.ibm.com> PR tree-optimization/88760 * gcc.target/powerpc/small-loop-unroll.c: New test. * c-c++-common/tsan/thread_leak2.c: Update test. * gcc.dg/pr59643.c: Update test. * gcc.target/powerpc/loop_align.c: Update test. * gcc.target/powerpc/ppc-fma-1.c: Update test. * gcc.target/powerpc/ppc-fma-2.c: Update test. * gcc.target/powerpc/ppc-fma-3.c: Update test. * gcc.target/powerpc/ppc-fma-4.c: Update test. * gcc.target/powerpc/pr78604.c: Update test. Added: trunk/gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c Modified: trunk/gcc/ChangeLog trunk/gcc/common/config/rs6000/rs6000-common.c trunk/gcc/config/rs6000/rs6000.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/c-c++-common/tsan/thread_leak2.c trunk/gcc/testsuite/gcc.dg/pr59643.c trunk/gcc/testsuite/gcc.target/powerpc/loop_align.c trunk/gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c trunk/gcc/testsuite/gcc.target/powerpc/ppc-fma-2.c trunk/gcc/testsuite/gcc.target/powerpc/ppc-fma-3.c trunk/gcc/testsuite/gcc.target/powerpc/ppc-fma-4.c trunk/gcc/testsuite/gcc.target/powerpc/pr78604.c
Author: guojiufu Date: Mon Nov 11 06:30:38 2019 New Revision: 278034 URL: https://gcc.gnu.org/viewcvs?rev=278034&root=gcc&view=rev Log: rs6000: Refine small loop unroll in loop_unroll_adjust hook In this patch, loop unroll adjust hook is introduced for powerpc. We can do target related heuristic adjustment in this hook. In this patch, -funroll-loops is enabled for small loops at O2 and above with an option -munroll-small-loops to guard the small loops unrolling, and it works fine with -flto. gcc/ 2019-11-11 Jiufu Guo <guojiufu@linux.ibm.com> PR tree-optimization/88760 * gcc/config/rs6000/rs6000.opt (-munroll-only-small-loops): New option. * gcc/common/config/rs6000/rs6000-common.c (rs6000_option_optimization_table) [OPT_LEVELS_2_PLUS_SPEED_ONLY]: Turn on -funroll-loops and -munroll-only-small-loops. [OPT_LEVELS_ALL]: Turn off -fweb and -frename-registers. * config/rs6000/rs6000.c (rs6000_option_override_internal): Remove set of PARAM_MAX_UNROLL_TIMES and PARAM_MAX_UNROLLED_INSNS. Turn off -munroll-only-small-loops for explicit -funroll-loops. (TARGET_LOOP_UNROLL_ADJUST): Add loop unroll adjust hook. (rs6000_loop_unroll_adjust): Define it. Use -munroll-only-small-loops. gcc.testsuite/ 2019-11-11 Jiufu Guo <guojiufu@linux.ibm.com> PR tree-optimization/88760 * gcc.dg/pr59643.c: Update back to r277550. Modified: trunk/gcc/ChangeLog trunk/gcc/common/config/rs6000/rs6000-common.c trunk/gcc/config/rs6000/rs6000.c trunk/gcc/config/rs6000/rs6000.opt trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/pr59643.c