[Bug tree-optimization/88492] SLP optimization generates ugly code

Fri Jul 12 11:02:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492

Hao Liu <hliu at amperecomputing dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hliu at amperecomputing dot com

--- Comment #4 from Hao Liu <hliu at amperecomputing dot com> ---
It seems Richard Biener's patch (r272843) can remove the redundant load/store.
r272843 comments as following: 
>     2019-07-01  Richard Biener  <rguenther@suse.de>
>    
>            * tree-ssa-sccvn.c (class pass_fre): Add may_iterate
>            pass parameter.
>            (pass_fre::execute): Honor it.
>            * passes.def: Adjust pass_fre invocations to allow iterating,
>            add non-iterating pass_fre before late threading/dom.
>    
>            * gcc.dg/tree-ssa/pr77445-2.c: Adjust.

Tested with Jiangning's case with "gcc -O3", the following code is generated:

  test_slp:
  .LFB0:
        .cfi_startproc
        adrp    x1, .LC0
        ldr     q0, [x0]
        ldr     q1, [x1, #:lo12:.LC0]
        tbl     v0.16b, {v0.16b}, v1.16b
        uxtl    v1.8h, v0.8b
        uxtl2   v0.8h, v0.16b
        uxtl    v4.4s, v1.4h
        uxtl    v2.4s, v0.4h
        uxtl2   v0.4s, v0.8h
        uxtl2   v1.4s, v1.8h
        dup     s21, v4.s[0]
        dup     s22, v2.s[1]
        dup     s3, v0.s[1]
        dup     s6, v1.s[0]
        dup     s23, v4.s[1]
        dup     s16, v2.s[0]
        add     v3.2s, v3.2s, v22.2s
        dup     s20, v0.s[0]
        dup     s17, v1.s[1]
        dup     s5, v0.s[2]
        fmov    w0, s3
        add     v3.2s, v6.2s, v21.2s
        dup     s19, v2.s[2]
        add     v17.2s, v17.2s, v23.2s
        dup     s7, v4.s[2]
        fmov    w1, s3
        add     v3.2s, v16.2s, v20.2s
        dup     s18, v1.s[2]
        fmov    w3, s17
        dup     s2, v2.s[3]
        fmov    w2, s3
        add     v3.2s, v5.2s, v19.2s
        dup     s0, v0.s[3]
        dup     s4, v4.s[3]
        add     w0, w0, w3
        dup     s1, v1.s[3]
        fmov    w3, s3
        add     v3.2s, v7.2s, v18.2s
        add     v0.2s, v2.2s, v0.2s
        add     w1, w1, w2
        add     w0, w0, w1
        fmov    w2, s3
        add     w3, w3, w2
        fmov    w2, s0
        add     v0.2s, v1.2s, v4.2s
        add     w0, w0, w3
        fmov    w1, s0
        add     w1, w2, w1
        add     w0, w0, w1
        ret

Although SLP still generates SIMD code, it looks much better than previous code
with memory load/store. Performance is expected to be better as no redundant
load/store.