Bug 88760

Summary: GCC unrolling is suboptimal
Product: gcc Reporter: ktkachov
Component: tree-optimizationAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal CC: bill.schmidt, dimhen, dje, hjl.tools, hp, pinskia, rguenth, rsandifo, segher
Priority: P3 Keywords: missed-optimization
Version: 9.0   
Target Milestone: ---   
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2019-01-16 00:00:00
Bug Depends on:    
Bug Blocks: 26163    
Attachments: aarch64-llvm output with -Ofast -mcpu=cortex-a57

Description ktkachov 2019-01-08 17:09:04 UTC
One of the hot loops in 510.parest_r from SPEC2017 can be approximated through:
unsigned int *colnums;
double *val;

struct foostruct
{
  unsigned int rows;
  unsigned int *colnums;
  unsigned int *rowstart;
};

struct foostruct *cols;

void
foo (double *dst, const double *src)
{
  const unsigned int n_rows = cols->rows;
  const double *val_ptr = &val[cols->rowstart[0]];
  const unsigned int *colnum_ptr = &cols->colnums[cols->rowstart[0]];  

  double *dst_ptr = dst;
  for (unsigned int row=0; row<n_rows; ++row)
    {
      double s = 0.;
      const double *const val_end_of_row = &val[cols->rowstart[row+1]];
      while (val_ptr != val_end_of_row)
        s += *val_ptr++ * src[*colnum_ptr++];
      *dst_ptr++ = s;
    }
}


At -Ofast -mcpu=cortex-a57 on aarch64 GCC generates a tight FMA loop:
.L4:
        ldr     w3, [x7, x2, lsl 2]
        cmp     x6, x2
        ldr     d2, [x5, x2, lsl 3]
        add     x2, x2, 1
        ldr     d1, [x1, x3, lsl 3]
        fmadd   d0, d2, d1, d0
        bne     .L4


LLVM unrolls the loop more intelligently:
.LBB0_8:                                // %vector.body
                                        //   Parent Loop BB0_2 Depth=1
                                        // =>  This Inner Loop Header: Depth=2
        ldp     w21, w22, [x20, #-8]
        ldr     d5, [x1, x21, lsl #3]
        ldp     d3, d4, [x7, #-16]
        ldr     d6, [x1, x22, lsl #3]
        ldp     w21, w22, [x20], #16
        fmadd   d2, d6, d4, d2
        fmadd   d1, d5, d3, d1
        ldr     d5, [x1, x21, lsl #3]
        ldr     d6, [x1, x22, lsl #3]
        add     x5, x5, #4              // =4
        adds    x19, x19, #2            // =2
        ldp     d3, d4, [x7], #32
        fmadd   d1, d5, d3, d1
        fmadd   d2, d6, d4, d2
        b.ne    .LBB0_8


With -funroll-loops GCC does do unrolling, but it does it differently:
<snip>
        ands    x12, x11, 7
        beq     .L70
        cmp     x12, 1
        beq     .L55
        cmp     x12, 2
        beq     .L57
        cmp     x12, 3
        beq     .L59
        cmp     x12, 4
        beq     .L61
        cmp     x12, 5
        beq     .L63
        cmp     x12, 6
        bne     .L72
.L65:
        ldr     w14, [x4, x2, lsl 2]
        ldr     d3, [x3, x2, lsl 3]
        add     x2, x2, 1
        ldr     d4, [x1, x14, lsl 3]
        fmadd   d0, d3, d4, d0
.L63:
        ldr     w5, [x4, x2, lsl 2]
        ldr     d5, [x3, x2, lsl 3]
        add     x2, x2, 1
        ldr     d6, [x1, x5, lsl 3]
        fmadd   d0, d5, d6, d0
.L61:
        ldr     w9, [x4, x2, lsl 2]
        ldr     d7, [x3, x2, lsl 3]
        add     x2, x2, 1
        ldr     d16, [x1, x9, lsl 3]
        fmadd   d0, d7, d16, d0
<snip>

On the whole of 510.parest_r this makes LLVM about 6% faster than GCC on Cortex-A57.

Perhaps this can be used as a motivating testcase to move the GCC unrolling discussions forward?
Comment 1 Richard Biener 2019-01-09 10:12:10 UTC
So LLVM unrolls 4 times while GCC (always) unrolls 8 times.  The unrolled body
for GCC (x86_64 this time) is

.L4:
        movl    (%rdx), %ecx
        vmovsd  (%rax), %xmm8
        addq    $32, %rdx
        addq    $64, %rax
        vmovsd  -56(%rax), %xmm9
        vmovsd  -48(%rax), %xmm10
        vfmadd231sd     (%rsi,%rcx,8), %xmm8, %xmm0
        movl    -28(%rdx), %ecx
        vmovsd  -40(%rax), %xmm11
        vmovsd  -32(%rax), %xmm12
        vfmadd231sd     (%rsi,%rcx,8), %xmm9, %xmm0
        movl    -24(%rdx), %ecx
        vmovsd  -24(%rax), %xmm13
        vmovsd  -16(%rax), %xmm14
        vfmadd231sd     (%rsi,%rcx,8), %xmm10, %xmm0
        movl    -20(%rdx), %ecx
        vmovsd  -8(%rax), %xmm15
        vfmadd231sd     (%rsi,%rcx,8), %xmm11, %xmm0
        movl    -16(%rdx), %ecx
        vfmadd231sd     (%rsi,%rcx,8), %xmm12, %xmm0
        movl    -12(%rdx), %ecx
        vfmadd231sd     (%rsi,%rcx,8), %xmm13, %xmm0
        movl    -8(%rdx), %ecx
        vfmadd231sd     (%rsi,%rcx,8), %xmm14, %xmm0
        movl    -4(%rdx), %ecx
        vfmadd231sd     (%rsi,%rcx,8), %xmm15, %xmm0
        cmpq    %rax, %r9
        jne     .L4

and what you quoted is the prologue.  You didn't quote llvms prologue
but if I read my clangs outout correct it uses a loop there.
(is there sth like -fdump-tree-optimized for clang?)

Our RTL unroller cannot do a loopy prologue but it always has this
jump-into peeled copies thing.  Using --param max-unroll-times=4
produces

.L4:
        movl    (%rdx), %ecx
        vmovsd  (%rax), %xmm2
        addq    $16, %rdx
        addq    $32, %rax
        vmovsd  -24(%rax), %xmm3
        vmovsd  -16(%rax), %xmm4
        vfmadd231sd     (%rsi,%rcx,8), %xmm2, %xmm0
        movl    -12(%rdx), %ecx
        vmovsd  -8(%rax), %xmm5
        vfmadd231sd     (%rsi,%rcx,8), %xmm3, %xmm0
        movl    -8(%rdx), %ecx
        vfmadd231sd     (%rsi,%rcx,8), %xmm4, %xmm0
        movl    -4(%rdx), %ecx
        vfmadd231sd     (%rsi,%rcx,8), %xmm5, %xmm0
        cmpq    %rax, %r8
        jne     .L4

which is nearly equivalent to clnags varaint?
Comment 2 ktkachov 2019-01-09 10:28:56 UTC
Created attachment 45386 [details]
aarch64-llvm output with -Ofast -mcpu=cortex-a57

I'm attaching the full LLVM aarch64 output.

The output you quoted is with -funroll-loops. If that's not given, GCC doesn't seem to unroll by default at all (on aarch64 or x86_64 from my testing).

Is there anything we can do to make the default unrolling a bit more aggressive?
Comment 3 Richard Biener 2019-01-09 10:47:15 UTC
(In reply to ktkachov from comment #2)
> Created attachment 45386 [details]
> aarch64-llvm output with -Ofast -mcpu=cortex-a57
> 
> I'm attaching the full LLVM aarch64 output.
> 
> The output you quoted is with -funroll-loops. If that's not given, GCC
> doesn't seem to unroll by default at all (on aarch64 or x86_64 from my
> testing).
> 
> Is there anything we can do to make the default unrolling a bit more
> aggressive?

Well, the RTL loop unroller is not enabled by default at any
optimization level (unless you are using FDO).  There's also
related flags not enabled (-fsplit-ivs-in-unroller and
-fvariable-expansion-in-unroller).

The RTL loop unroller is simply not good at estimating benefit
of unrolling (which is also why you usually see it unrolling
--param max-unroll-times times) and the tunables it has are
not very well tuned across targets.

Micha did quite extensive benchmarking (on x86_64) which shows that
the cases where unrolling is profitable are rare and the reason
is often hard to understand.

That's of course in the context of CPUs having caches of
pre-decoded/fused/etc. instructions optimizing issue which
makes peeled prologues expensive as well as even more special
caches for small loops avoiding more frontend costs.

Not sure if arm archs have any of this.

I generally don't believe in unrolling as a separately profitable
transform.  Rather unrolling could be done as part of another
transform (vectorization is the best example).  For sth still
done on RTL that would then include scheduling which is where
the best cost estimates should be available (and if you do
this post-reload then you even have a very good idea of
register pressure).  This is also why I think a standalone
unrolling phase belongs on RTL since I don't see a good way
of estimating cost/benefit on GIMPLE (see how difficult it is
to cost vectorization vs. non-vectorization there).
Comment 4 Wilco 2019-01-09 13:02:56 UTC
(In reply to ktkachov from comment #2)
> Created attachment 45386 [details]
> aarch64-llvm output with -Ofast -mcpu=cortex-a57
> 
> I'm attaching the full LLVM aarch64 output.
> 
> The output you quoted is with -funroll-loops. If that's not given, GCC
> doesn't seem to unroll by default at all (on aarch64 or x86_64 from my
> testing).
> 
> Is there anything we can do to make the default unrolling a bit more
> aggressive?

I don't think the RTL unroller works at all. It doesn't have the right settings, and doesn't understand how to unroll, so we always get inefficient and bloated code.

To do unrolling correctly it has to be integrated at tree level - for example when vectorization isn't possible/beneficial, unrolling might still be a good idea.
Comment 5 Wilco 2019-01-09 13:35:03 UTC
(In reply to Wilco from comment #4)
> (In reply to ktkachov from comment #2)
> > Created attachment 45386 [details]
> > aarch64-llvm output with -Ofast -mcpu=cortex-a57
> > 
> > I'm attaching the full LLVM aarch64 output.
> > 
> > The output you quoted is with -funroll-loops. If that's not given, GCC
> > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my
> > testing).
> > 
> > Is there anything we can do to make the default unrolling a bit more
> > aggressive?
> 
> I don't think the RTL unroller works at all. It doesn't have the right
> settings, and doesn't understand how to unroll, so we always get inefficient
> and bloated code.
> 
> To do unrolling correctly it has to be integrated at tree level - for
> example when vectorization isn't possible/beneficial, unrolling might still
> be a good idea.

To add some numbers to the conversation, the gain LLVM gets from default unrolling is 4.5% on SPECINT2017 and 1.0% on SPECFP2017.

This clearly shows there is huge potential from unrolling, *if* we can teach GCC to unroll properly like LLVM. That means early unrolling, using good default settings and using a trailing loop rather than inefficient peeling.
Comment 6 rguenther@suse.de 2019-01-09 13:38:22 UTC
On Wed, 9 Jan 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #5 from Wilco <wilco at gcc dot gnu.org> ---
> (In reply to Wilco from comment #4)
> > (In reply to ktkachov from comment #2)
> > > Created attachment 45386 [details]
> > > aarch64-llvm output with -Ofast -mcpu=cortex-a57
> > > 
> > > I'm attaching the full LLVM aarch64 output.
> > > 
> > > The output you quoted is with -funroll-loops. If that's not given, GCC
> > > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my
> > > testing).
> > > 
> > > Is there anything we can do to make the default unrolling a bit more
> > > aggressive?
> > 
> > I don't think the RTL unroller works at all. It doesn't have the right
> > settings, and doesn't understand how to unroll, so we always get inefficient
> > and bloated code.
> > 
> > To do unrolling correctly it has to be integrated at tree level - for
> > example when vectorization isn't possible/beneficial, unrolling might still
> > be a good idea.
> 
> To add some numbers to the conversation, the gain LLVM gets from default
> unrolling is 4.5% on SPECINT2017 and 1.0% on SPECFP2017.
> 
> This clearly shows there is huge potential from unrolling, *if* we can teach
> GCC to unroll properly like LLVM. That means early unrolling, using good
> default settings and using a trailing loop rather than inefficient peeling.

I don't see why this cannot be done on RTL where we have vastly more
information of whether there are execution resources that can be
used by unrolling.  Note we also want unrolling to interleave
instructions to not rely on pre-reload scheduling which in turn means
having a good eye on register pressure (again sth not very well handled
on GIMPLE)
Comment 7 Wilco 2019-01-09 13:59:39 UTC
(In reply to rguenther@suse.de from comment #6)
> On Wed, 9 Jan 2019, wilco at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> > 
> > --- Comment #5 from Wilco <wilco at gcc dot gnu.org> ---
> > (In reply to Wilco from comment #4)
> > > (In reply to ktkachov from comment #2)
> > > > Created attachment 45386 [details]
> > > > aarch64-llvm output with -Ofast -mcpu=cortex-a57
> > > > 
> > > > I'm attaching the full LLVM aarch64 output.
> > > > 
> > > > The output you quoted is with -funroll-loops. If that's not given, GCC
> > > > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my
> > > > testing).
> > > > 
> > > > Is there anything we can do to make the default unrolling a bit more
> > > > aggressive?
> > > 
> > > I don't think the RTL unroller works at all. It doesn't have the right
> > > settings, and doesn't understand how to unroll, so we always get inefficient
> > > and bloated code.
> > > 
> > > To do unrolling correctly it has to be integrated at tree level - for
> > > example when vectorization isn't possible/beneficial, unrolling might still
> > > be a good idea.
> > 
> > To add some numbers to the conversation, the gain LLVM gets from default
> > unrolling is 4.5% on SPECINT2017 and 1.0% on SPECFP2017.
> > 
> > This clearly shows there is huge potential from unrolling, *if* we can teach
> > GCC to unroll properly like LLVM. That means early unrolling, using good
> > default settings and using a trailing loop rather than inefficient peeling.
> 
> I don't see why this cannot be done on RTL where we have vastly more
> information of whether there are execution resources that can be
> used by unrolling.  Note we also want unrolling to interleave
> instructions to not rely on pre-reload scheduling which in turn means
> having a good eye on register pressure (again sth not very well handled
> on GIMPLE)

The main issue is that other loop optimizations are done on tree, so things like addressing modes, loop invariants, CSEs are run on the non-unrolled version. Then when we unroll in RTL we end up with very non-optimal code. Typical unrolled loop starts like this:

	add	x13, x2, 1
	add	x14, x2, 2
	add	x11, x2, 3
	add	x10, x2, 4
	ldr	w30, [x4, x13, lsl 2]
	add	x9, x2, 5
	add	x5, x2, 6
	add	x12, x2, 7
	ldr	d23, [x3, x2, lsl 3]
        ... rest of unrolled loop

So basically it decides to create a new induction variable for every unrolled copy in the loop. This often leads to spills just because it creates way too many redundant addressing instructions. It also blocks scheduling between iterations since the alias optimization doesn't appear to understand simple constant differences between indices.

So unrolling should definitely be done at a high level just like vectorization.
Comment 8 ktkachov 2019-01-15 11:08:33 UTC
btw looks likes ICC vectorises this as well as unrolling:
..B1.14:                        
        movl      (%rcx,%rbx,4), %r15d                          
        vmovsd    (%rdi,%r15,8), %xmm2                          
        movl      4(%rcx,%rbx,4), %r15d                         
        vmovhpd   (%rdi,%r15,8), %xmm2, %xmm3                   
        movl      8(%rcx,%rbx,4), %r15d                         
        vfmadd231pd (%r10,%rbx,8), %xmm3, %xmm0                 
        vmovsd    (%rdi,%r15,8), %xmm4                          
        movl      12(%rcx,%rbx,4), %r15d                        
        vmovhpd   (%rdi,%r15,8), %xmm4, %xmm5                   
        movl      16(%rcx,%rbx,4), %r15d                        
        vfmadd231pd 16(%r10,%rbx,8), %xmm5, %xmm1               
        vmovsd    (%rdi,%r15,8), %xmm6                          
        movl      20(%rcx,%rbx,4), %r15d                        
        vmovhpd   (%rdi,%r15,8), %xmm6, %xmm7                   
        movl      24(%rcx,%rbx,4), %r15d                        
        vfmadd231pd 32(%r10,%rbx,8), %xmm7, %xmm0               
        vmovsd    (%rdi,%r15,8), %xmm8                          
        movl      28(%rcx,%rbx,4), %r15d                        
        vmovhpd   (%rdi,%r15,8), %xmm8, %xmm9                   
        vfmadd231pd 48(%r10,%rbx,8), %xmm9, %xmm1               
        addq      $8, %rbx                                      
        cmpq      %r14, %rbx                                    
        jb        ..B1.14 

Is that something GCC could reasonably do?
Comment 9 rguenther@suse.de 2019-01-15 11:19:37 UTC
On Tue, 15 Jan 2019, ktkachov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #8 from ktkachov at gcc dot gnu.org ---
> btw looks likes ICC vectorises this as well as unrolling:
> ..B1.14:                        
>         movl      (%rcx,%rbx,4), %r15d                          
>         vmovsd    (%rdi,%r15,8), %xmm2                          
>         movl      4(%rcx,%rbx,4), %r15d                         
>         vmovhpd   (%rdi,%r15,8), %xmm2, %xmm3                   
>         movl      8(%rcx,%rbx,4), %r15d                         
>         vfmadd231pd (%r10,%rbx,8), %xmm3, %xmm0                 
>         vmovsd    (%rdi,%r15,8), %xmm4                          
>         movl      12(%rcx,%rbx,4), %r15d                        
>         vmovhpd   (%rdi,%r15,8), %xmm4, %xmm5                   
>         movl      16(%rcx,%rbx,4), %r15d                        
>         vfmadd231pd 16(%r10,%rbx,8), %xmm5, %xmm1               
>         vmovsd    (%rdi,%r15,8), %xmm6                          
>         movl      20(%rcx,%rbx,4), %r15d                        
>         vmovhpd   (%rdi,%r15,8), %xmm6, %xmm7                   
>         movl      24(%rcx,%rbx,4), %r15d                        
>         vfmadd231pd 32(%r10,%rbx,8), %xmm7, %xmm0               
>         vmovsd    (%rdi,%r15,8), %xmm8                          
>         movl      28(%rcx,%rbx,4), %r15d                        
>         vmovhpd   (%rdi,%r15,8), %xmm8, %xmm9                   
>         vfmadd231pd 48(%r10,%rbx,8), %xmm9, %xmm1               
>         addq      $8, %rbx                                      
>         cmpq      %r14, %rbx                                    
>         jb        ..B1.14 
> 
> Is that something GCC could reasonably do?

GCC could choose a larger vectorization factor, yes.
The longer epilogue could be vectorized with the same
vector size again then.
Comment 10 Richard Sandiford 2019-01-16 10:09:47 UTC
FWIW, I agree that pure unrolling doesn't feel like a gimple-level
optimisation.  Whether it's a win or not depends on whether the
unrolled loop will make better use of the microarchitecture.
The problem isn't just that that's hard to decide at the gimple level,
but that the result can't be represented directly in gimple.  AIUI
there's no real significance to the schedule of gimple statements
(beyond ensuring valid SSA and functional correctness).

This is different from vectorisation and ivopts, which can represent
the benefit of the transformation directly in gimple (using vector
ops and TARGET_MEM_REFs respectively).

As Kyrill pointed out off-list, LLVM does the unrolling in the vectoriser
rather than a separate unrolling pass.  (Use -mllvm -print-after-all
to see this.)

I think for AArch64 we can view LDP and STP as 2-element vector loads
and stores that have zero-cost insertion and extraction.  So converting:

      ldr     x0, [...]
      add     x0, x0, 1
      str     x0, [...]

into:

      ldp     x0, x1, [...]
      add     x0, x0, 1
      add     x1, x1, 1
      stp     x0, x1, [...]

is IMO genuine vectorisation.  The LDPs and STPs are effectively
scalar IFN_LOAD_LANES and IFN_STORE_LANES, although we could also
represent them as single-element (V1) vector ops instead if that
seems more consistent.

Vectorising operations other than loads and stores would simply
involve duplicating the statements VF times.
Comment 11 ktkachov 2019-01-16 17:45:53 UTC
Thank you all for the input.

Just to add a bit of data.
I've instrumented 510.parest_r to count the number of loop iterations to get a feel for how much of the unrolled loop is spent in the actual unrolled part rather than the prologue/peeled part. Overall, the hot function itself is entered 290M times. The distribution of loop iteration counts is:

Frequency iter:
92438870  36
87028560  54
20404571  24
17312960  62
14237184  72
13403904  108
7574437   102
7574420   70
5564881   40
4328249   64
4328240   46
3142656   48
2666496   124
1248176   8
1236641   16
1166592   204
1166592   140
1134392   4
 857088   80
 666624   92
 666624   128
 618320   30
 613056   1
 234464   2
 190464   32
  95232   60
  84476   20
  48272   10
   6896   5

So the two most common iteration counts are 36 and 54. For an 8x unrolled loop that's 4 and 6 iterations spent in the prologue with 4 and 6 times going around the 8x unrolled loop respectively.

As an experiment I hacked the AArch64 assembly of the function generated with -funroll-loops to replace the peeled prologue version with a simple non-unrolled loop. That gave a sizeable speedup on two AArch64 platforms: >7%.

So beyond the vectorisation point Richard S. made above, maybe it's worth considering replacing the peeled prologue with a simple loop instead?
Or at least add that as a distinct unrolling strategy and work to come up with an analysis that would allow us to choose one over the other?
Comment 12 ktkachov 2019-01-16 18:28:53 UTC
(In reply to ktkachov from comment #11)
> 
> As an experiment I hacked the AArch64 assembly of the function generated
> with -funroll-loops to replace the peeled prologue version with a simple
> non-unrolled loop. That gave a sizeable speedup on two AArch64 platforms:
> >7%.
> 
> So beyond the vectorisation point Richard S. made above, maybe it's worth
> considering replacing the peeled prologue with a simple loop instead?
> Or at least add that as a distinct unrolling strategy and work to come up
> with an analysis that would allow us to choose one over the other?

Upon reflection I think I may have bungled up the assembly hacking (the changes I made may not be equivalent to the source). I'll redo that experiment soon, so please disregard that part for now. The iteration count distribution numbers are still valid though.
Comment 13 rguenther@suse.de 2019-01-17 08:37:06 UTC
On Wed, 16 Jan 2019, ktkachov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #11 from ktkachov at gcc dot gnu.org ---
> Thank you all for the input.
> 
> Just to add a bit of data.
> I've instrumented 510.parest_r to count the number of loop iterations to get a
> feel for how much of the unrolled loop is spent in the actual unrolled part
> rather than the prologue/peeled part. Overall, the hot function itself is
> entered 290M times. The distribution of loop iteration counts is:
> 
> Frequency iter:
> 92438870  36
> 87028560  54
> 20404571  24
> 17312960  62
> 14237184  72
> 13403904  108
> 7574437   102
> 7574420   70
> 5564881   40
> 4328249   64
> 4328240   46
> 3142656   48
> 2666496   124
> 1248176   8
> 1236641   16
> 1166592   204
> 1166592   140
> 1134392   4
>  857088   80
>  666624   92
>  666624   128
>  618320   30
>  613056   1
>  234464   2
>  190464   32
>   95232   60
>   84476   20
>   48272   10
>    6896   5
> 
> So the two most common iteration counts are 36 and 54. For an 8x unrolled loop
> that's 4 and 6 iterations spent in the prologue with 4 and 6 times going around
> the 8x unrolled loop respectively.
> 
> As an experiment I hacked the AArch64 assembly of the function generated with
> -funroll-loops to replace the peeled prologue version with a simple
> non-unrolled loop. That gave a sizeable speedup on two AArch64 platforms: >7%.
> 
> So beyond the vectorisation point Richard S. made above, maybe it's worth
> considering replacing the peeled prologue with a simple loop instead?
> Or at least add that as a distinct unrolling strategy and work to come up with
> an analysis that would allow us to choose one over the other?

Patches welcome ;)

Usually the peeling is done to improve branch prediction on the
prologue/epilogue.
Comment 14 Wilco 2019-01-17 13:24:55 UTC
(In reply to rguenther@suse.de from comment #13)

> Usually the peeling is done to improve branch prediction on the
> prologue/epilogue.

Modern branch predictors do much better on a loop than with this kind of code:

        ands    x12, x11, 7
        beq     .L70
        cmp     x12, 1
        beq     .L55
        cmp     x12, 2
        beq     .L57
        cmp     x12, 3
        beq     .L59
        cmp     x12, 4
        beq     .L61
        cmp     x12, 5
        beq     .L63
        cmp     x12, 6
        bne     .L72

That's way too many branches close together so most predictors will hit the maximum branches per fetch block limit and not predict them.

If you wanted to peel it would have to be like:

if (n & 4)
  do 4 iterations
if (n & 2)
  do 2 iterations
if (n & 1)
  do 1 iteration

However that's still way too much code explosion for little or no gain. The only case where this makes sense is in a handwritten memcpy.
Comment 15 rguenther@suse.de 2019-01-17 13:38:39 UTC
On Thu, 17 Jan 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #14 from Wilco <wilco at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #13)
> 
> > Usually the peeling is done to improve branch prediction on the
> > prologue/epilogue.
> 
> Modern branch predictors do much better on a loop than with this kind of code:
> 
>         ands    x12, x11, 7
>         beq     .L70
>         cmp     x12, 1
>         beq     .L55
>         cmp     x12, 2
>         beq     .L57
>         cmp     x12, 3
>         beq     .L59
>         cmp     x12, 4
>         beq     .L61
>         cmp     x12, 5
>         beq     .L63
>         cmp     x12, 6
>         bne     .L72
> 
> That's way too many branches close together so most predictors will hit the
> maximum branches per fetch block limit and not predict them.
> 
> If you wanted to peel it would have to be like:
> 
> if (n & 4)
>   do 4 iterations
> if (n & 2)
>   do 2 iterations
> if (n & 1)
>   do 1 iteration
> 
> However that's still way too much code explosion for little or no gain. The
> only case where this makes sense is in a handwritten memcpy.

Classic peeling would be

  do 1 iteration
  if (n == 1) exit
  do 1 iteration
  if (n == 2) exit
...

which is what I refered to for branch prediction.  Your & prompts me
to a way to do sth similar as duffs device, turning the loop into a nest.

 head:
   if (n == 0) exit;
   <1 iteration>
   if (n & 1)
     n -= 1, goto head;
   <1 iteration>
   if (n & 2)
     n -= 2, goto head;
   <2 iterations>
   if (n & 4)
     n -= 4, goto head;
   <4 iterations>
   n -= 8, goto head;

the inner loop tests should become well-predicted quickly.

But as always - more branches make it more likely that one is
mispredicted.  For a single non-unrolled loop you usually only
get the actual exit mispredicted.
Comment 16 Wilco 2019-01-17 14:08:26 UTC
(In reply to rguenther@suse.de from comment #15)

> which is what I refered to for branch prediction.  Your & prompts me
> to a way to do sth similar as duffs device, turning the loop into a nest.
> 
>  head:
>    if (n == 0) exit;
>    <1 iteration>
>    if (n & 1)
>      n -= 1, goto head;
>    <1 iteration>
>    if (n & 2)
>      n -= 2, goto head;
>    <2 iterations>
>    if (n & 4)
>      n -= 4, goto head;
>    <4 iterations>
>    n -= 8, goto head;
> 
> the inner loop tests should become well-predicted quickly.
> 
> But as always - more branches make it more likely that one is
> mispredicted.  For a single non-unrolled loop you usually only
> get the actual exit mispredicted.

Yes the overlapping the branches for the tail loop and the main loop will result in more mispredictions. And there are still 4 branches for an 8x unrolled loop, blocking optimizations and scheduling. So Duff's device is always inefficient - the above loop is much faster like this:

if (n & 1)
 do 1 iteration
if (n & 2)
 do 2 iterations
if (n >= 4)
 do
  4 iterations
 while ((n -= 4) > 0)
Comment 17 ktkachov 2019-01-24 12:22:24 UTC
I played around with the source to do some conservative 2x manual unrolling in the two hottest functions in 510.parest_r (3 more-or-less identical tight FMA loops). This was to try out Richard's thinking suggestion in #c10 about unrolling for forming load-pairs, and also to break the accumulator dependency.

So the initial testcase now became:
unsigned int *colnums;
double *val;

struct foostruct
{
  unsigned int rows;
  unsigned int *colnums;
  unsigned int *rowstart;
};

struct foostruct *cols;

void
foo (double * __restrict__ dst, const double *__restrict__ src)
{
  const unsigned int n_rows = cols->rows;
  const double *val_ptr = &val[cols->rowstart[0]];
  const unsigned int *colnum_ptr = &cols->colnums[cols->rowstart[0]];  

  double *dst_ptr = dst;
  for (unsigned int row=0; row<n_rows; ++row)
    {
      double s = 0.;
      const double *const val_end_of_row = &val[cols->rowstart[row+1]];
      __PTRDIFF_TYPE__ diff = val_end_of_row - val_ptr;

      if (diff & 1) // Peel the odd iteration.
        s += *val_ptr++ * src[*colnum_ptr++];

      double s1 = 0.; // Second accumulator
      while (val_ptr != val_end_of_row)
        {
          s += val_ptr[0] * src[colnum_ptr[0]];
          s1 += val_ptr[1] * src[colnum_ptr[1]];
          val_ptr += 2;
          colnum_ptr += 2;
        }
      *dst_ptr++ = s + s1;
    }
}

This transformed the initial loop from:
.L4:
        ldr     w3, [x7, x2, lsl 2]
        cmp     x6, x2
        ldr     d2, [x5, x2, lsl 3]
        add     x2, x2, 1
        ldr     d1, [x1, x3, lsl 3]
        fmadd   d0, d2, d1, d0
        bne     .L4

into:
.L5:
        ldp     w6, w5, [x3] // LDP
        add     x3, x3, 8
        ldp     d5, d3, [x2]  // LDP
        add     x2, x2, 16
        ldr     d4, [x1, x6, lsl 3]
        cmp     x4, x2
        ldr     d2, [x1, x5, lsl 3]
        fmadd   d0, d5, d4, d0
        fmadd   d1, d3, d2, d1
        bne     .L5

In parest itself a few of the loops transformed this way did not form LDPs due to unrelated LDP-forming inefficiencies but the most did.
This transformation gave a 3% improvement on a Cortex-A72. There are more similar loops in the 3rd, 4th and 5th hottest functions in that benchmark, so I suspect if we do something intelligent there as well we'll get even more sizeable gains.

So rather than solving general "unrolling", how about we break this down into two desirable transformations:
1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's suggestion)
2) Unrolling and breaking accumulator dependencies.

I think more general unrolling and the peeling associated with it can be discussed independently of 1) and 2) once we collect more data on more microarchitectures.
Comment 18 rguenther@suse.de 2019-01-24 12:29:01 UTC
On Thu, 24 Jan 2019, ktkachov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #17 from ktkachov at gcc dot gnu.org ---
> I played around with the source to do some conservative 2x manual unrolling in
> the two hottest functions in 510.parest_r (3 more-or-less identical tight FMA
> loops). This was to try out Richard's thinking suggestion in #c10 about
> unrolling for forming load-pairs, and also to break the accumulator dependency.
> 
> So the initial testcase now became:
> unsigned int *colnums;
> double *val;
> 
> struct foostruct
> {
>   unsigned int rows;
>   unsigned int *colnums;
>   unsigned int *rowstart;
> };
> 
> struct foostruct *cols;
> 
> void
> foo (double * __restrict__ dst, const double *__restrict__ src)
> {
>   const unsigned int n_rows = cols->rows;
>   const double *val_ptr = &val[cols->rowstart[0]];
>   const unsigned int *colnum_ptr = &cols->colnums[cols->rowstart[0]];  
> 
>   double *dst_ptr = dst;
>   for (unsigned int row=0; row<n_rows; ++row)
>     {
>       double s = 0.;
>       const double *const val_end_of_row = &val[cols->rowstart[row+1]];
>       __PTRDIFF_TYPE__ diff = val_end_of_row - val_ptr;
> 
>       if (diff & 1) // Peel the odd iteration.
>         s += *val_ptr++ * src[*colnum_ptr++];
> 
>       double s1 = 0.; // Second accumulator
>       while (val_ptr != val_end_of_row)
>         {
>           s += val_ptr[0] * src[colnum_ptr[0]];
>           s1 += val_ptr[1] * src[colnum_ptr[1]];
>           val_ptr += 2;
>           colnum_ptr += 2;
>         }
>       *dst_ptr++ = s + s1;
>     }
> }
> 
> This transformed the initial loop from:
> .L4:
>         ldr     w3, [x7, x2, lsl 2]
>         cmp     x6, x2
>         ldr     d2, [x5, x2, lsl 3]
>         add     x2, x2, 1
>         ldr     d1, [x1, x3, lsl 3]
>         fmadd   d0, d2, d1, d0
>         bne     .L4
> 
> into:
> .L5:
>         ldp     w6, w5, [x3] // LDP
>         add     x3, x3, 8
>         ldp     d5, d3, [x2]  // LDP
>         add     x2, x2, 16
>         ldr     d4, [x1, x6, lsl 3]
>         cmp     x4, x2
>         ldr     d2, [x1, x5, lsl 3]
>         fmadd   d0, d5, d4, d0
>         fmadd   d1, d3, d2, d1
>         bne     .L5
> 
> In parest itself a few of the loops transformed this way did not form LDPs due
> to unrelated LDP-forming inefficiencies but the most did.
> This transformation gave a 3% improvement on a Cortex-A72. There are more
> similar loops in the 3rd, 4th and 5th hottest functions in that benchmark, so I
> suspect if we do something intelligent there as well we'll get even more
> sizeable gains.
> 
> So rather than solving general "unrolling", how about we break this down into
> two desirable transformations:
> 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
> suggestion)

If that helps, sure (I'd have guessed uarchs are going to split
load-multiple into separate loads, but eventually it avoids
load-port contention?)

> 2) Unrolling and breaking accumulator dependencies.

IIRC RTL unrolling can do this (as side-effect, not as main
cost motivation) guarded with an extra switch.

> I think more general unrolling and the peeling associated with it can be
> discussed independently of 1) and 2) once we collect more data on more
> microarchitectures.

I think both of these can be "implemented" on the RTL unroller
side.
Comment 19 Wilco 2019-01-24 13:27:11 UTC
(In reply to rguenther@suse.de from comment #18)

> > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
> > suggestion)
> 
> If that helps, sure (I'd have guessed uarchs are going to split
> load-multiple into separate loads, but eventually it avoids
> load-port contention?)

Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a 128-bit LDP in a single cycle (see Optimization Guide).

> > 2) Unrolling and breaking accumulator dependencies.
> 
> IIRC RTL unrolling can do this (as side-effect, not as main
> cost motivation) guarded with an extra switch.
> 
> > I think more general unrolling and the peeling associated with it can be
> > discussed independently of 1) and 2) once we collect more data on more
> > microarchitectures.
> 
> I think both of these can be "implemented" on the RTL unroller
> side.

You still need dependence analysis, alias info, ivopt to run again. The goal is to remove the increment of the index, use efficient addressing modes (base+imm) and allow scheduling to move instructions between iterations. I don't believe the RTL unroller supports any of this today.
Comment 20 rguenther@suse.de 2019-01-24 13:40:56 UTC
On Thu, 24 Jan 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #19 from Wilco <wilco at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #18)
> 
> > > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
> > > suggestion)
> > 
> > If that helps, sure (I'd have guessed uarchs are going to split
> > load-multiple into separate loads, but eventually it avoids
> > load-port contention?)
> 
> Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a
> 128-bit LDP in a single cycle (see Optimization Guide).
> 
> > > 2) Unrolling and breaking accumulator dependencies.
> > 
> > IIRC RTL unrolling can do this (as side-effect, not as main
> > cost motivation) guarded with an extra switch.
> > 
> > > I think more general unrolling and the peeling associated with it can be
> > > discussed independently of 1) and 2) once we collect more data on more
> > > microarchitectures.
> > 
> > I think both of these can be "implemented" on the RTL unroller
> > side.
> 
> You still need dependence analysis, alias info, ivopt to run again. The goal is
> to remove the increment of the index, use efficient addressing modes (base+imm)
> and allow scheduling to move instructions between iterations. I don't believe
> the RTL unroller supports any of this today.

There's no way to encode load-multiple on GIMPLE that wouldn't be
awkward to other GIMPLE optimizers.

Yes, the RTL unroller supports scheduling (sched runs after unrolling)
and scheduling can do dependence analysis.  Yes, the RTL unroller
does _not_ do dependence analysis at the moment, so if you want to
know beforehand whether you can interleave iterations you have to
actually perform dependence analysis.  Of course you can do that
on RTL.  And of course you can do IVOPTs on RTL (yes, we don't do that
at the moment).

Note I'm not opposed to have IVOPTs on GIMPLE itself perform
unrolling (I know Bin was against this given IVOPTs is already
so comples) and a do accumulator breakup.  But I don't see how
to do the load-multiple thing (yes, you could represent it
as a vector load plus N element extracts on GIMPLE and it
would be easy to teach SLP vectorization to perform this
transform on its own if it is really profitable - which I
doubt you can reasonably argue before RA, let alone on GIMPLE).
Comment 21 Wilco 2019-01-24 14:18:19 UTC
(In reply to rguenther@suse.de from comment #20)
> On Thu, 24 Jan 2019, wilco at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> > 
> > --- Comment #19 from Wilco <wilco at gcc dot gnu.org> ---
> > (In reply to rguenther@suse.de from comment #18)
> > 
> > > > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
> > > > suggestion)
> > > 
> > > If that helps, sure (I'd have guessed uarchs are going to split
> > > load-multiple into separate loads, but eventually it avoids
> > > load-port contention?)
> > 
> > Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a
> > 128-bit LDP in a single cycle (see Optimization Guide).
> > 
> > > > 2) Unrolling and breaking accumulator dependencies.
> > > 
> > > IIRC RTL unrolling can do this (as side-effect, not as main
> > > cost motivation) guarded with an extra switch.
> > > 
> > > > I think more general unrolling and the peeling associated with it can be
> > > > discussed independently of 1) and 2) once we collect more data on more
> > > > microarchitectures.
> > > 
> > > I think both of these can be "implemented" on the RTL unroller
> > > side.
> > 
> > You still need dependence analysis, alias info, ivopt to run again. The goal is
> > to remove the increment of the index, use efficient addressing modes (base+imm)
> > and allow scheduling to move instructions between iterations. I don't believe
> > the RTL unroller supports any of this today.
> 
> There's no way to encode load-multiple on GIMPLE that wouldn't be
> awkward to other GIMPLE optimizers.

I don't think anyone want LDP/STP directly in GIMPLE - that doesn't seem useful. We don't even form LDP until quite late in RTL. The key to forming LDP/STP is using base+imm addressing modes and having correct alias info (so loads/stores from different iterations can be interleaved and then combined into LDP/STP). The main thing a backend would need to do is tune address costs to take future LDP formation into account (and yes, the existing cost models need to be improved anyway).

> Yes, the RTL unroller supports scheduling (sched runs after unrolling)
> and scheduling can do dependence analysis.  Yes, the RTL unroller
> does _not_ do dependence analysis at the moment, so if you want to
> know beforehand whether you can interleave iterations you have to
> actually perform dependence analysis.  Of course you can do that
> on RTL.  And of course you can do IVOPTs on RTL (yes, we don't do that
> at the moment).

Sure we *could* duplicate all high-level loop optimizations to work on RTL. However is that worth the effort given we have them already at tree level?

> Note I'm not opposed to have IVOPTs on GIMPLE itself perform
> unrolling (I know Bin was against this given IVOPTs is already
> so comples) and a do accumulator breakup.  But I don't see how
> to do the load-multiple thing (yes, you could represent it
> as a vector load plus N element extracts on GIMPLE and it
> would be easy to teach SLP vectorization to perform this
> transform on its own if it is really profitable - which I
> doubt you can reasonably argue before RA, let alone on GIMPLE).

Let's forget about load-multiple in GIMPLE. Kyrill's example shows that unrolling at the high level means the existing loop optimizations and analysis work as expected and we end up with good addressing modes, LDPs and interleaving of different iterations. With the existing RTL unroller this just isn't happening.
Comment 22 ktkachov 2019-01-24 17:54:08 UTC
Some more experiments...
Unrolling 4x in a similar way to my previous example and not splitting the accumulator (separate issue):

unsigned int *colnums;
double *val;

struct foostruct
{
  unsigned int rows;
  unsigned int *colnums;
  unsigned int *rowstart;
};

struct foostruct *cols;

void
foo (double * __restrict__ dst, const double *__restrict__ src)
{
  const unsigned int n_rows = cols->rows;
  const double *val_ptr = &val[cols->rowstart[0]];
  const unsigned int *colnum_ptr = &cols->colnums[cols->rowstart[0]];  

  double *dst_ptr = dst;
  for (unsigned int row=0; row<n_rows; ++row)
    {
      double s = 0.;
      const double *const val_end_of_row = &val[cols->rowstart[row+1]];
      __PTRDIFF_TYPE__ diff = val_end_of_row - val_ptr;
      if (diff & 1)
      {
        s += *val_ptr++ * src[*colnum_ptr++];
        diff--;
      }
      if (diff & 2)
        {
          s += val_ptr[0] * src[colnum_ptr[0]];
          s += val_ptr[1] * src[colnum_ptr[1]];
          val_ptr += 2;
          colnum_ptr += 2;
        }
      while (val_ptr != val_end_of_row)
        {
          s += val_ptr[0] * src[colnum_ptr[0]];
          s += val_ptr[1] * src[colnum_ptr[1]];
          s += val_ptr[2] * src[colnum_ptr[2]];
          s += val_ptr[3] * src[colnum_ptr[3]];
          val_ptr += 4;
          colnum_ptr += 4;
        }
      *dst_ptr++ = s;
    }
}

helps even more. On Cortex-A72 it gives a bit more than 6% (vs 3%) improvement on parest, and about 5.3% on a more aggressive CPU.
I tried unrolling 8x in a similar manner and that was not faster than 4x on either target.

Note that perf profiling shows that the loads are what's hot in these loops, not the FMAs themselves:
  4.41 │1b8:   ldp    w3, w4, [x0]                                                                                                                                                                                 ▒
  5.85 │       ldp    d3, d4, [x2]                                                                                                                                                                                 ▒
       │       add    x2, x2, #0x20                                                                                                                                                                                ▒
  3.79 │       ldur   d5, [x2, #-16]                                                                                                                                                                               ▒
  2.82 │       ldr    d0, [x1, x4, lsl #3]                                                                                                                                                                         ▒
  2.53 │       ldr    d2, [x1, x3, lsl #3]                                                                                                                                                                         ▒
  2.10 │       ldp    w4, w3, [x0, #8]                                                                                                                                                                             ▒
       │       add    x0, x0, #0x10                                                                                                                                                                                ▒
  0.00 │       cmp    x5, x0                                                                                                                                                                                       ▒
       │       fmul   d0, d0, d4                                                                                                                                                                                   ▒
  4.73 │       ldr    d4, [x1, x4, lsl #3]                                                                                                                                                                         ▒
       │       fmadd  d0, d3, d2, d0                                                                                                                                                                               ▒
  2.01 │       ldur   d3, [x2, #-8]                                                                                                                                                                                ▒
  2.54 │       ldr    d2, [x1, x3, lsl #3]                                                                                                                                                                         ▒
       │       fmadd  d0, d5, d4, d0                                                                                                                                                                               ▒
       │       fmadd  d0, d3, d2, d0                                                                                                                                                                               ▒
       │       fadd   d1, d1, d0
Comment 23 Wilco 2019-01-25 14:42:31 UTC
(In reply to ktkachov from comment #22)

> helps even more. On Cortex-A72 it gives a bit more than 6% (vs 3%)
> improvement on parest, and about 5.3% on a more aggressive CPU.
> I tried unrolling 8x in a similar manner and that was not faster than 4x on
> either target.

The 4x unrolled version has 19 instructions (and microops) vs 7*4 for the non-unrolled version, a significant reduction (without LDP it would be 21 vs 28). There is potential to use 2 more LDPs and use load+writeback which would make it 15 vs 28 instructions, so close to 2x reduction in instructions.
Comment 24 Segher Boessenkool 2019-09-25 16:40:47 UTC
On some (many?) targets it would be good to unroll all loops with a small body
(not containing calls etc.) at -O2 already, some small number of times (2 or 4
maybe).
Comment 25 Richard Earnshaw 2019-09-27 12:33:27 UTC
Well very small loops should be unrolled more than slightly larger ones.  So perhaps if the loop body is only 3 or 4 instructions it should be unrolled four times but above that perhaps only twice.

Some consideration should be given, however, to whether the target micro-architecture has looping buffers as unrolling beyond the limit of such structures will hurt performance.
Comment 26 Segher Boessenkool 2019-09-27 16:54:15 UTC
Yeah, and it probably should be a param (that different targets can default
differently, per CPU probably).  On most Power CPUs all loops take a minimum
number of cycles per iteration (say, three), but that translates to a lot of
instructions (say, >10).

Small loops should probably be unrolled at -O2 already as well.  Maybe fixed
length loops only?
Comment 27 Wilco 2019-10-10 09:40:17 UTC
(In reply to Segher Boessenkool from comment #26)
> Yeah, and it probably should be a param (that different targets can default
> differently, per CPU probably).  On most Power CPUs all loops take a minimum
> number of cycles per iteration (say, three), but that translates to a lot of
> instructions (say, >10).
> 
> Small loops should probably be unrolled at -O2 already as well.  Maybe fixed
> length loops only?

Agreed, there is no reason not to unroll (or vectorize) with -O2 given several other compilers do (and beat GCC as a result). It's best to limit unrolling to small loops and only 2x unless it's 1-2 instructions.
Comment 28 Jiu Fu Guo 2019-10-11 02:52:10 UTC
For these kind of small loops, it would be acceptable to unroll in GIMPLE, because register pressure and instruction cost may not be major concerns; just  like "cunroll" and "cunrolli" passes (complete unroll) which also been done at O2.
Comment 29 Wilco 2019-10-11 09:39:08 UTC
(In reply to Jiu Fu Guo from comment #28)
> For these kind of small loops, it would be acceptable to unroll in GIMPLE,
> because register pressure and instruction cost may not be major concerns;
> just  like "cunroll" and "cunrolli" passes (complete unroll) which also been
> done at O2.

Absolutely, unrolling is a high-level optimization like vectorization. Note the existing unroller doesn't care about register pressure or costs, but a high-level unroller can certainly take this into account by not unrolling as aggressively like the existing unroller.
Comment 30 rguenther@suse.de 2019-10-11 10:29:37 UTC
On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #29 from Wilco <wilco at gcc dot gnu.org> ---
> (In reply to Jiu Fu Guo from comment #28)
> > For these kind of small loops, it would be acceptable to unroll in GIMPLE,
> > because register pressure and instruction cost may not be major concerns;
> > just  like "cunroll" and "cunrolli" passes (complete unroll) which also been
> > done at O2.
> 
> Absolutely, unrolling is a high-level optimization like vectorization.

To expose ILP?  I'd call that low-level though ;)

If it exposes data reuse then I'd call it high-level - and at that level
we already have passes like predictive commoning or unroll-and-jam doing
exactly that.  Or vectorization.

We've shown though data that unrolling without a good idea on CPU
pipeline details is a loss on x86_64.  This further hints at it
being low-level.
Comment 31 Segher Boessenkool 2019-10-11 10:45:59 UTC
Gimple passes know a lot about machine details, too.

Irrespective of if this is "low-level" or "high-level", it should be done
earlier than it is now.  It should either be done right after expand, or
somewhere before it.  Doing it in gimple seems easiest?
Comment 32 Wilco 2019-10-11 10:50:20 UTC
(In reply to Segher Boessenkool from comment #31)
> Gimple passes know a lot about machine details, too.
> 
> Irrespective of if this is "low-level" or "high-level", it should be done
> earlier than it is now.  It should either be done right after expand, or
> somewhere before it.  Doing it in gimple seems easiest?

Expand is way too late. Unrolling is high-level because it relies on other high-level optimizations - for example to get addressing and induction variables right. Without that it's always a loss.
Comment 33 rguenther@suse.de 2019-10-11 10:58:18 UTC
On Fri, 11 Oct 2019, segher at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #31 from Segher Boessenkool <segher at gcc dot gnu.org> ---
> Gimple passes know a lot about machine details, too.
> 
> Irrespective of if this is "low-level" or "high-level", it should be done
> earlier than it is now.  It should either be done right after expand, or
> somewhere before it.  Doing it in gimple seems easiest?

Sure it's easiest in GIMPLE.  But I'd say it shouldn't be done earlier,
it should be done correctly - RTL unroll simply unrolls everything
without any good costing.  That is what needs to be fixed.

Other than that, where we do cunroll on GIMPLE is a good spot I guess
(need to do it _before_ IVOPTs at least, though I wanted to push that
one later as well).
Comment 34 Wilco 2019-10-11 11:20:02 UTC
(In reply to rguenther@suse.de from comment #30)
> On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> > 
> > --- Comment #29 from Wilco <wilco at gcc dot gnu.org> ---
> > (In reply to Jiu Fu Guo from comment #28)
> > > For these kind of small loops, it would be acceptable to unroll in GIMPLE,
> > > because register pressure and instruction cost may not be major concerns;
> > > just  like "cunroll" and "cunrolli" passes (complete unroll) which also been
> > > done at O2.
> > 
> > Absolutely, unrolling is a high-level optimization like vectorization.
> 
> To expose ILP?  I'd call that low-level though ;)
> 
> If it exposes data reuse then I'd call it high-level - and at that level
> we already have passes like predictive commoning or unroll-and-jam doing
> exactly that.  Or vectorization.
> 
> We've shown though data that unrolling without a good idea on CPU
> pipeline details is a loss on x86_64.  This further hints at it
> being low-level.

That was using the RTL unroller and existing settings right? Those settings are not sane - is 16x unrolling still the default?!? If so, that doesn't surprise me at all.

Note also it's dangerous to assume x86 behaves exactly like Arm, POWER or other ISAs. Arm cores are very wide nowadays (6+ wide) and have supported 2 memory accesses per cycle (loads and/or stores) for years.
Comment 35 rguenther@suse.de 2019-10-11 11:43:02 UTC
On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #34 from Wilco <wilco at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #30)
> > On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> > > 
> > > --- Comment #29 from Wilco <wilco at gcc dot gnu.org> ---
> > > (In reply to Jiu Fu Guo from comment #28)
> > > > For these kind of small loops, it would be acceptable to unroll in GIMPLE,
> > > > because register pressure and instruction cost may not be major concerns;
> > > > just  like "cunroll" and "cunrolli" passes (complete unroll) which also been
> > > > done at O2.
> > > 
> > > Absolutely, unrolling is a high-level optimization like vectorization.
> > 
> > To expose ILP?  I'd call that low-level though ;)
> > 
> > If it exposes data reuse then I'd call it high-level - and at that level
> > we already have passes like predictive commoning or unroll-and-jam doing
> > exactly that.  Or vectorization.
> > 
> > We've shown though data that unrolling without a good idea on CPU
> > pipeline details is a loss on x86_64.  This further hints at it
> > being low-level.
> 
> That was using the RTL unroller and existing settings right? Those settings are
> not sane - is 16x unrolling still the default?!? If so, that doesn't surprise
> me at all.

We actually did measurements with restricting that to 2x, 3x, 4x, ...
plus selectively enabling some extra unroller features
(fsplit-ivs-in-unroller, fvariable-expansion-in-unroller).  Quite a big
testing matrix but then nothing was an obvious win.

> Note also it's dangerous to assume x86 behaves exactly like Arm, POWER 
> or other ISAs. Arm cores are very wide nowadays (6+ wide) and have 
> supported 2 memory accesses per cycle (loads and/or stores) for years.

Sure.
Comment 36 rguenther@suse.de 2019-10-11 11:45:31 UTC
On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #32 from Wilco <wilco at gcc dot gnu.org> ---
> (In reply to Segher Boessenkool from comment #31)
> > Gimple passes know a lot about machine details, too.
> > 
> > Irrespective of if this is "low-level" or "high-level", it should be done
> > earlier than it is now.  It should either be done right after expand, or
> > somewhere before it.  Doing it in gimple seems easiest?
> 
> Expand is way too late. Unrolling is high-level because it relies on other
> high-level optimizations - for example to get addressing and induction
> variables right. Without that it's always a loss.

I know I've said it before but I really want to have a "GIMPLE unroller"
integrated with a pass that can compute unrolling benefit.  One candidate
was of course IVOPTs which albeit is already quite complex but which
has register pressure estimates and should have an idea how unrolling
affects these.

Computing unrolling benefit on its own would require some pipeline
modeling on GIMPLE.  So the other obvious candidiate would be
the RTL scheduler - here I think SMS does "unrolling".
Comment 37 Segher Boessenkool 2019-10-11 11:54:43 UTC
-- If it is done in RTL it should really be done earlier, it doesn't get all
optimisations it should right now.

-- Unrolling small loops more aggressively (at -O2 even) perhaps needs to be
done at a different spot in the pipeline then?  It needs no cost estimate
(other than some simple params).
Comment 38 rguenther@suse.de 2019-10-11 12:06:21 UTC
On Fri, 11 Oct 2019, segher at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #37 from Segher Boessenkool <segher at gcc dot gnu.org> ---
> -- If it is done in RTL it should really be done earlier, it doesn't get all
> optimisations it should right now.
> 
> -- Unrolling small loops more aggressively (at -O2 even) perhaps needs to be
> done at a different spot in the pipeline then?  It needs no cost estimate
> (other than some simple params).

Well.  Even unrolling a two-stmt loop just once can be detrimental on
x86 CPUs and whether it is beneficial depends a lot on code alignment
which we cannot really track well w/o introducing artificial code
padding everywhere.  But yes, this can be turned off easily on x86.
Just heuristics are not as simple as you might think.
Comment 39 Jiu Fu Guo 2019-10-12 09:26:46 UTC
For small loop (1-2 stmts), in forms of GIMPLE and RTL, it would be around 5-10 instructions: 2-4 insns per stmt, ~4 insns for idx.

With current unroller, here is a statistic on spec2017. 
Using --param max-unrolled-insns=12, there are ~3000 small loops could be unrolled totally, and ~40 of these small loops are located in hot-functions.

Using --param max-unrolled-insns=16, there are ~11000 small loops could be unrolled totally, and ~230 of these small loops are located in hot-functions.

Using --param max-unrolled-insns=20, there are ~15000 small loops could be unrolled totally, and ~570 of these small loops are located in hot-functions.

Using --param max-unrolled-insns=24, there are ~18000 small loops could be unrolled totally, and ~680 of these small loops are located in hot-functions.


if max-unrolled-insns<16, just few small loops are unrolled for hot-functions; it may be not very valuable.
Comment 40 rguenther@suse.de 2019-10-14 10:47:53 UTC
On Sat, 12 Oct 2019, guojiufu at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #39 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
> For small loop (1-2 stmts), in forms of GIMPLE and RTL, it would be around 5-10
> instructions: 2-4 insns per stmt, ~4 insns for idx.
> 
> With current unroller, here is a statistic on spec2017. 
> Using --param max-unrolled-insns=12, there are ~3000 small loops could be
> unrolled totally, and ~40 of these small loops are located in hot-functions.
> 
> Using --param max-unrolled-insns=16, there are ~11000 small loops could be
> unrolled totally, and ~230 of these small loops are located in hot-functions.
> 
> Using --param max-unrolled-insns=20, there are ~15000 small loops could be
> unrolled totally, and ~570 of these small loops are located in hot-functions.
> 
> Using --param max-unrolled-insns=24, there are ~18000 small loops could be
> unrolled totally, and ~680 of these small loops are located in hot-functions.
> 
> 
> if max-unrolled-insns<16, just few small loops are unrolled for hot-functions;
> it may be not very valuable.

So 12 if two times unrolled is already 6 insns, minus IV update and
compare-and-branch (assuming single pattern) that's 4 insns.  On
GIMPLE I'd already call this large since eventual memory loads and
stores would be separate - so there it wuld be ~16 instead of 12.

I think the better approach is to identify the cases where unrolling
would help, and on which (sub-)architectures, and prepare testcases
for them.

I guess the times where our default unroll factor (if it fits the
size limits) of 8 is a good idea is long gone, I'd expect ILP
to stop improving much earlier (depending on the set of operations).
For ILP you also want to do interleaving of the unrolled iterations,
so I point to SMS again here (SMS suffers from the fact that
loop dependence info is weak on RTL, but it uses the scheduler
model of the target).
Comment 41 Jiu Fu Guo 2019-10-22 09:47:48 UTC
for code:

  subroutine foo (i, i1, block)
    integer :: i, i1
    integer :: block(9, 9, 9)
    block(i:9,1,i1) = block(i:9,1,i1) - 10
  end subroutine foo

"-funroll-loops  --param max-unroll-times=2 --param max-unrolled-insns=20" could help to improve some run time.(~10% on ppcle)

main:
  do n = 0, N
     do i = 1, 9
        do j = 1, 9                                                                                                               
           call foo (i, j, block)
        end do
     end do
  end do
Comment 42 Jiu Fu Guo 2019-10-28 05:23:56 UTC
Author: guojiufu
Date: Mon Oct 28 05:23:24 2019
New Revision: 277501

URL: https://gcc.gnu.org/viewcvs?rev=277501&root=gcc&view=rev
Log:
rs6000: Enable limited unrolling at -O2

In PR88760, there are a few disscussion about improve or tune unroller for
targets. And we would agree to enable unroller for small loops at O2 first.
And we could see performance improvement(~10%) for below code:
```
  subroutine foo (i, i1, block)
    integer :: i, i1
    integer :: block(9, 9, 9)
    block(i:9,1,i1) = block(i:9,1,i1) - 10
  end subroutine foo

```
This kind of code occurs a few times in exchange2 benchmark.

Similar C code:
```
  for (i = 0; i < n; i++)
    arr[i] = arr[i] - 10;
```

On powerpcle, for O2 , enable -funroll-loops and limit
PARAM_MAX_UNROLL_TIMES=2 and PARAM_MAX_UNROLLED_INSNS=20, we can see >2%
overall improvement for SPEC2017.

This patch is only for rs6000 in which we see visible performance improvement.

gcc/
2019-10-25  Jiufu Guo  <guojiufu@linux.ibm.com>	    

	PR tree-optimization/88760
	* config/rs6000/rs6000-common.c (rs6000_option_optimization_table):
	Enable -funroll-loops for -O2 and above.
	* config/rs6000/rs6000.c (rs6000_option_override_internal): Set
	PARAM_MAX_UNROLL_TIMES to 2 and PARAM_MAX_UNROLLED_INSNS to 20, and
	do not turn on web and rngreg implicitly, if the unroller is not
	explicitly enabled.
	
gcc.testsuite/
2019-10-25  Jiufu Guo  <guojiufu@linux.ibm.com>

	PR tree-optimization/88760
	* gcc.target/powerpc/small-loop-unroll.c: New test.
	* c-c++-common/tsan/thread_leak2.c: Update test.
	* gcc.dg/pr59643.c: Update test.
	* gcc.target/powerpc/loop_align.c: Update test.
	* gcc.target/powerpc/ppc-fma-1.c: Update test.
	* gcc.target/powerpc/ppc-fma-2.c: Update test.
	* gcc.target/powerpc/ppc-fma-3.c: Update test.
	* gcc.target/powerpc/ppc-fma-4.c: Update test.
	* gcc.target/powerpc/pr78604.c: Update test.

Added:
    trunk/gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/common/config/rs6000/rs6000-common.c
    trunk/gcc/config/rs6000/rs6000.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/c-c++-common/tsan/thread_leak2.c
    trunk/gcc/testsuite/gcc.dg/pr59643.c
    trunk/gcc/testsuite/gcc.target/powerpc/loop_align.c
    trunk/gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c
    trunk/gcc/testsuite/gcc.target/powerpc/ppc-fma-2.c
    trunk/gcc/testsuite/gcc.target/powerpc/ppc-fma-3.c
    trunk/gcc/testsuite/gcc.target/powerpc/ppc-fma-4.c
    trunk/gcc/testsuite/gcc.target/powerpc/pr78604.c
Comment 43 Jiu Fu Guo 2019-11-11 06:31:09 UTC
Author: guojiufu
Date: Mon Nov 11 06:30:38 2019
New Revision: 278034

URL: https://gcc.gnu.org/viewcvs?rev=278034&root=gcc&view=rev
Log:

rs6000: Refine small loop unroll in loop_unroll_adjust hook

In this patch, loop unroll adjust hook is introduced for powerpc.  We
can do target related heuristic adjustment in this hook.  In this patch,
-funroll-loops is enabled for small loops at O2 and above with an option
-munroll-small-loops to guard the small loops unrolling, and it works
fine with -flto.


gcc/
2019-11-11  Jiufu Guo  <guojiufu@linux.ibm.com>

	PR tree-optimization/88760
	* gcc/config/rs6000/rs6000.opt (-munroll-only-small-loops): New option.
	* gcc/common/config/rs6000/rs6000-common.c
	(rs6000_option_optimization_table) [OPT_LEVELS_2_PLUS_SPEED_ONLY]:
	Turn on -funroll-loops and -munroll-only-small-loops.
	[OPT_LEVELS_ALL]: Turn off -fweb and -frename-registers.
	* config/rs6000/rs6000.c (rs6000_option_override_internal): Remove
	set of PARAM_MAX_UNROLL_TIMES and PARAM_MAX_UNROLLED_INSNS.
	Turn off -munroll-only-small-loops for explicit -funroll-loops.
	(TARGET_LOOP_UNROLL_ADJUST): Add loop unroll adjust hook.
	(rs6000_loop_unroll_adjust): Define it.  Use -munroll-only-small-loops.

gcc.testsuite/
2019-11-11  Jiufu Guo  <guojiufu@linux.ibm.com>

	PR tree-optimization/88760
	* gcc.dg/pr59643.c: Update back to r277550.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/common/config/rs6000/rs6000-common.c
    trunk/gcc/config/rs6000/rs6000.c
    trunk/gcc/config/rs6000/rs6000.opt
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/pr59643.c