This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Useless statement in loop latch looks like performance regression


On Fri, Jan 10, 2014 at 2:17 PM, Paulo Matos <pmatos@broadcom.com> wrote:
> Hello,
>
> I have been investigating a few loops and while looking at:
> void matrix_mul_vect(unsigned int N, int *C, short *A, short *B) {
>  unsigned int i,j;
>  for (i=0; i<N; i++) {
>   C[i]=0;
>   for (j=0; j<N; j++) {
>    C[i]+=A[i*N+j] * B[j];
>   }
>  }
> }
>
> I noticed GCC is generating worst code now (trunk and 4.8) than in 4.5.
> Using x86_64 as an example (compiled with O2 and gcc 4.8):
> matrix_mul_vect:
> .LFB0:
>         .cfi_startproc
>         pushq   %rbp
>         .cfi_def_cfa_offset 16
>         .cfi_offset 6, -16
>         xorl    %eax, %eax
>         xorl    %ebp, %ebp
>         testl   %edi, %edi
>         pushq   %rbx
>         .cfi_def_cfa_offset 24
>         .cfi_offset 3, -24
>         je      .L1
>         .p2align 4,,10
>         .p2align 3
> .L7:
>         movq    %rcx, %r9
>         leal    (%rax,%rdi), %ebx
>         xorl    %r10d, %r10d
>         jmp     .L4
>         .p2align 4,,10
>         .p2align 3
> .L10:
>         movl    %r8d, %r10d
> .L4:
>         movl    %eax, %r8d
>         movswl  (%r9), %r11d
>         addl    $1, %eax
>         movswl  (%rdx,%r8,2), %r8d
>         addq    $2, %r9
>         imull   %r11d, %r8d
>         addl    %r10d, %r8d
>         cmpl    %ebx, %eax
>         jne     .L10
>         movl    %r8d, (%rsi,%rbp,4)
>         addq    $1, %rbp
>         cmpl    %ebp, %edi
>         ja      .L7
> .L1:
>         popq    %rbx
>         .cfi_def_cfa_offset 16
>         popq    %rbp
>         .cfi_def_cfa_offset 8
>         ret
>         .cfi_endproc
> .LFE0:
>
>
> For 4.5:
> matrix_mul_vect:
> .LFB0:
>         .cfi_startproc
>         pushq   %r13
>         .cfi_def_cfa_offset 16
>         testl   %edi, %edi
>         pushq   %r12
>         .cfi_def_cfa_offset 24
>         pushq   %rbp
>         .cfi_def_cfa_offset 32
>         pushq   %rbx
>         .cfi_def_cfa_offset 40
>         je      .L1
>         .cfi_offset 3, -40
>         .cfi_offset 6, -32
>         .cfi_offset 12, -24
>         .cfi_offset 13, -16
>         leal    -1(%rdi), %ebx
>         xorl    %r12d, %r12d
>         xorl    %ebp, %ebp
>         addq    $1, %rbx
>         leaq    0(,%rbx,4), %r13
>         addq    %rbx, %rbx
>         .p2align 4,,10
>         .p2align 3
> .L4:
>         movl    $0, (%rsi,%rbp)
>         movl    %r12d, %r9d
>         xorl    %eax, %eax
>         xorl    %r10d, %r10d
>         .p2align 4,,10
>         .p2align 3
> .L3:
>         mov     %r9d, %r8d
>         movswl  (%rcx,%rax), %r11d
>         addq    $2, %rax
>         movswl  (%rdx,%r8,2), %r8d
>         addl    $1, %r9d
>         imull   %r11d, %r8d
>         addl    %r8d, %r10d
>         cmpq    %rbx, %rax
>         jne     .L3
>         movl    %r10d, (%rsi,%rbp)
>         addq    $4, %rbp
>         addl    %edi, %r12d
>         cmpq    %r13, %rbp
>         jne     .L4
> .L1:
>         popq    %rbx
>         .cfi_def_cfa_offset 32
>         popq    %rbp
>         .cfi_def_cfa_offset 24
>         popq    %r12
>         .cfi_def_cfa_offset 16
>         popq    %r13
>         .cfi_def_cfa_offset 8
>         ret
>         .cfi_endproc
> .LFE0:
>
>
> The problem seems to be, at least partially, due to the creation of an extra basic block that ends up being a loop latch with a useless insn. This can be seen in the code output by gcc 4.8, label .L10.
> Before register allocation blocks 4 and 5 look like (gcc 4.8):
> ;; basic block 4, loop depth 2, count 0, freq 9100, maybe hot
> ;;  prev block 3, next block 5, flags: (RTL, MODIFIED)
> ;;  pred:       3 [100.0%]  (FALLTHRU)
> ;;              5 [100.0%]  (DFS_BACK)
> ;; bb 4 artificial_defs: { }
> ;; bb 4 artificial_uses: { u18(6){ }u19(7){ }u20(16){ }u21(20){ }}
> ;; lr  in        6 [bp] 7 [sp] 16 [argp] 20 [frame] 85 91 97 98 101 102 103 104 105
> ;; lr  use       6 [bp] 7 [sp] 16 [argp] 20 [frame] 85 97 98 101 104
> ;; lr  def       17 [flags] 85 96 97 106 108 109
> (code_label 61 36 50 4 4 "" [1 uses])
> (note 50 61 51 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
> (insn 51 50 52 4 (set (reg:DI 106 [ D.3277 ])
>         (zero_extend:DI (reg:SI 85 [ ivtmp.7 ]))) coremark.c:142 139 {*zero_extendsidi2_rex64}
>      (nil))
> (insn 52 51 53 4 (set (reg:SI 109 [ D.3280 ])
>         (sign_extend:SI (mem:HI (plus:DI (mult:DI (reg:DI 106 [ D.3277 ])
>                         (const_int 2 [0x2]))
>                     (reg/v/f:DI 104 [ A ])) [4 *_18+0 S2 A16]))) coremark.c:142 153 {extendhisi2}
>      (expr_list:REG_DEAD (reg:DI 106 [ D.3277 ])
>         (nil)))
> (insn 53 52 54 4 (set (reg:SI 108 [ D.3280 ])
>         (sign_extend:SI (mem:HI (reg:DI 97 [ ivtmp.8 ]) [4 MEM[base: _63, offset: 0B]+0 S2 A16]))) coremark.c:142 153 {extendhisi2}
>      (nil))
> (insn 54 53 55 4 (parallel [
>             (set (reg:SI 109 [ D.3280 ])
>                 (mult:SI (reg:SI 109 [ D.3280 ])
>                     (reg:SI 108 [ D.3280 ])))
>             (clobber (reg:CC 17 flags))
>         ]) coremark.c:142 353 {*mulsi3_1}
>      (expr_list:REG_DEAD (reg:SI 108 [ D.3280 ])
>         (expr_list:REG_UNUSED (reg:CC 17 flags)
>             (nil))))
> (insn 55 54 56 4 (parallel [
>             (set (reg:SI 96 [ D.3280 ])
>                 (plus:SI (reg:SI 109 [ D.3280 ])
>                     (reg:SI 101 [ D.3283 ])))
>             (clobber (reg:CC 17 flags))
>         ]) coremark.c:142 273 {*addsi_1}
>      (expr_list:REG_DEAD (reg:SI 109 [ D.3280 ])
>         (expr_list:REG_DEAD (reg:SI 101 [ D.3283 ])
>             (expr_list:REG_UNUSED (reg:CC 17 flags)
>                 (nil)))))
> (insn 56 55 57 4 (parallel [
>             (set (reg:SI 85 [ ivtmp.7 ])
>                 (plus:SI (reg:SI 85 [ ivtmp.7 ])
>                     (const_int 1 [0x1])))
>             (clobber (reg:CC 17 flags))
>         ]) 273 {*addsi_1}
>      (expr_list:REG_UNUSED (reg:CC 17 flags)
>         (nil)))
> (insn 57 56 58 4 (parallel [
>             (set (reg:DI 97 [ ivtmp.8 ])
>                 (plus:DI (reg:DI 97 [ ivtmp.8 ])
>                     (const_int 2 [0x2])))
>             (clobber (reg:CC 17 flags))
>         ]) 274 {*adddi_1}
>      (expr_list:REG_UNUSED (reg:CC 17 flags)
>         (nil)))
> (insn 58 57 59 4 (set (reg:CCZ 17 flags)
>         (compare:CCZ (reg:SI 85 [ ivtmp.7 ])
>             (reg:SI 98 [ D.3281 ]))) coremark.c:141 7 {*cmpsi_1}
>      (nil))
> (jump_insn 59 58 60 4 (set (pc)
>         (if_then_else (eq (reg:CCZ 17 flags)
>                 (const_int 0 [0]))
>             (label_ref 64)
>             (pc))) coremark.c:141 621 {*jcc_1}
>      (expr_list:REG_DEAD (reg:CCZ 17 flags)
>         (expr_list:REG_BR_PROB (const_int 900 [0x384])
>             (nil)))
>  -> 64)
> ;;  succ:       5 [91.0%]  (FALLTHRU)
> ;;              6 [9.0%]  (LOOP_EXIT)
> ;; lr  out       6 [bp] 7 [sp] 16 [argp] 20 [frame] 85 91 96 97 98 102 103 104 105
>
> ;; basic block 5, loop depth 2, count 0, freq 8281, maybe hot
> ;;  prev block 4, next block 6, flags: (RTL, MODIFIED)
> ;;  pred:       4 [91.0%]  (FALLTHRU)
> ;; bb 5 artificial_defs: { }
> ;; bb 5 artificial_uses: { u35(6){ }u36(7){ }u37(16){ }u38(20){ }}
> ;; lr  in        6 [bp] 7 [sp] 16 [argp] 20 [frame] 85 91 96 97 98 102 103 104 105
> ;; lr  use       6 [bp] 7 [sp] 16 [argp] 20 [frame] 96
> ;; lr  def       101
> (note 60 59 34 5 [bb 5] NOTE_INSN_BASIC_BLOCK)
> (insn 34 60 84 5 (set (reg:SI 101 [ D.3283 ])
>         (reg:SI 96 [ D.3280 ])) coremark.c:141 89 {*movsi_internal}
>      (expr_list:REG_DEAD (reg:SI 96 [ D.3280 ])
>         (nil)))
> (jump_insn 84 34 85 5 (set (pc)
>         (label_ref 61)) 659 {jump}
>      (nil)
>  -> 61)
> ;;  succ:       4 [100.0%]  (DFS_BACK)
> ;; lr  out       6 [bp] 7 [sp] 16 [argp] 20 [frame] 85 91 97 98 101 102 103 104 105
>
> However, registers 96 and 101 only show up in block 4 here:
> (insn 55 54 56 4 (parallel [
>             (set (reg:SI 96 [ D.3280 ])
>                 (plus:SI (reg:SI 109 [ D.3280 ])
>                     (reg:SI 101 [ D.3283 ])))
>             (clobber (reg:CC 17 flags))
>         ]) coremark.c:142 273 {*addsi_1}
>      (expr_list:REG_DEAD (reg:SI 109 [ D.3280 ])
>         (expr_list:REG_DEAD (reg:SI 101 [ D.3283 ])
>             (expr_list:REG_UNUSED (reg:CC 17 flags)
>                 (nil)))))
>
> I don't really understand why GCC is generating this extra basic block however it looks like a regression. Block 5 could disappear and insn 55 rewritten as (replaced reg101 by reg96):
>
>             (set (reg:SI 96 [ D.3280 ])
>                 (plus:SI (reg:SI 109 [ D.3280 ])
>                     (reg:SI 96 [ D.3283 ])))
>
> It is worse on my port due to several technical constraints. For example, zero overhead loops can only be generated if there are no internal branches, which means that any loop without an empty latch can't be transformed into a zero overhead loop, which is the case with this example. So, loops with empty latches are desired as much as possible. Once GCC starts creating code like this performance on my port drops drastically.
>
> What are your thoughts?
>
> Note: I just noticed this doesn't happen in trunk anymore. Any idea of what went into trunk to fix this?

Most likely changes to SSA coalescing at out-of-SSA time like

2013-09-26  Richard Biener  <rguenther@suse.de>

        * tree-ssa-live.c (var_map_base_init): Handle SSA names with
        DECL_IGNORED_P base VAR_DECLs like anonymous SSA names.
        (loe_visit_block): Use gcc_checking_assert.
        * tree-ssa-coalesce.c (create_outofssa_var_map): Use
        gimple_assign_ssa_name_copy_p.
        (gimple_can_coalesce_p): Adjust according to the var_map_base_init
        change.

and an earlier patch by Jeff Law.

Richard.

> Paulo Matos
>
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]