95866 – vectorized shift with scalar argument not correctly costed

Bug 95866 - vectorized shift with scalar argument not correctly costed

Summary: vectorized shift with scalar argument not correctly costed

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	11.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Richard Biener

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2020-06-24 12:34 UTC by Richard Biener
Modified:	2020-06-25 10:30 UTC (History)
CC List:	0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2020-06-24 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Richard Biener 2020-06-24 12:34:14 UTC

int x[4];
void foo(int i)
{
  i = (i+1) & 31;
  x[0] = (x[0] << i) + i;
  x[1] = (x[1] << i) + i;
  x[2] = (x[2] << i) + i;
  x[3] = (x[3] << i) + i;
}

for this testcase we're not correctly assessing that we leave the
scalar computation for (i+1) & 31 around for the generated
vectorized shift by a scalar argument.  Since we're in need of
the vector of { i,... } for the vectorized add we ideally should
simply use lane zero for the shift operand rather than the original
SSA name as we do.  Currently:

  _22 = {i_14(D), i_14(D), i_14(D), i_14(D)};
  vect_cst__23 = _22;
  vect_cst__24 = { 1, 1, 1, 1 };
  vect__1.6_25 = vect_cst__23 + vect_cst__24;
  vect_cst__26 = { 31, 31, 31, 31 };
  vect_i_15.7_27 = vect__1.6_25 & vect_cst__26;
  _1 = i_14(D) + 1;
  i_15 = _1 & 31;
  vect__2.5_21 = MEM <vector(4) int> [(int *)&x];
  vect__3.8_28 = vect__2.5_21 << i_15;
  vect__4.9_29 = vect__3.8_28 + vect_i_15.7_27;
  MEM <vector(4) int> [(int *)&x] = vect__4.9_29;
  return;

where arguably we should have done the (i+1) & 31 compute with scalars
and broadcast the result.  Assembly:

foo:
.LFB0:
        .cfi_startproc
        leal    1(%rdi), %eax
        movdqa  x(%rip), %xmm1
        movd    %edi, %xmm3
        andl    $31, %eax
        pshufd  $0, %xmm3, %xmm0
        paddd   .LC0(%rip), %xmm0
        movq    %rax, %xmm2
        pand    .LC1(%rip), %xmm0
        pslld   %xmm2, %xmm1
        paddd   %xmm1, %xmm0
        movaps  %xmm0, x(%rip)
        ret

which is really bad.  Even with that optimization simulated by using
'i' as provided by the function argument we generate

foo:
.LFB0:
        .cfi_startproc
        movdqa  x(%rip), %xmm1
        movslq  %edi, %rax
        movd    %edi, %xmm3
        movq    %rax, %xmm2
        pshufd  $0, %xmm3, %xmm0
        pslld   %xmm2, %xmm1
        paddd   %xmm1, %xmm0
        movaps  %xmm0, x(%rip)
        ret

so RTL optimizations do not recover here either.  combine sees

(insn 8 3 9 2 (set (reg:DI 89 [ i ])
        (sign_extend:DI (reg/v:SI 86 [ i ]))) "t.c":5:16 147 {*extendsidi2_rex64}
     (nil))
(insn 10 9 11 2 (set (reg:V4SI 90 [ vect__2.6 ])
        (ashift:V4SI (reg:V4SI 91 [ MEM <vector(4) int> [(int *)&x] ])
            (reg:DI 89 [ i ]))) "t.c":5:16 3739 {ashlv4si3} 
     (expr_list:REG_DEAD (reg:V4SI 91 [ MEM <vector(4) int> [(int *)&x] ])
        (expr_list:REG_DEAD (reg:DI 89 [ i ])
            (nil))))
(insn 11 10 12 2 (set (reg:V4SI 92)
        (vec_duplicate:V4SI (reg/v:SI 86 [ i ]))) "t.c":5:22 5169 {*vec_dupv4si}
     (expr_list:REG_DEAD (reg/v:SI 86 [ i ])
        (nil)))
(insn 12 11 13 2 (set (reg:V4SI 93 [ vect__3.7 ])
        (plus:V4SI (reg:V4SI 90 [ vect__2.6 ])
            (reg:V4SI 92))) "t.c":5:22 3545 {*addv4si3}

so the opportunity to "back-propagate" the vec_duplicate for the shift
isn't appearant - nor would it ever consider such thing.

So we probably should try to fix this in the vectorizer.  IIRC this support
for non-external/constant scalar shift args is reasonably new (g:49eab32e6e7).
Also if there's a vector-vector shift we should probably prefer that if
we already have a vector.

Comment 1 Richard Biener 2020-06-24 12:34:54 UTC

Mine.

Comment 2 GCC Commits 2020-06-24 17:49:25 UTC

The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:c78907d514d65483c7ddfb4cb1f5c57f23da73d9

commit r11-1633-gc78907d514d65483c7ddfb4cb1f5c57f23da73d9
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Jun 24 15:49:00 2020 +0200

    tree-optimization/95866 - avoid vectorizing uniform SLP subgraphs
    
    This avoids vectorizing SLP subgraphs that just compute uniform
    operations on all-same operands.  That fixes the less interesting
    (but most embarrasing) part of the testcase in the PR.  On the
    way it also fixed a missing matches[0] reset in the last
    refactoring touching that place.
    
    2020-06-24  Richard Biener  <rguenther@suse.de>
    
            PR tree-optimization/95866
            * tree-vect-slp.c (vect_slp_tree_uniform_p): New.
            (vect_build_slp_tree_2): Properly reset matches[0],
            ignore uniform constants.
    
            * gcc.target/i386/pr95866-1.c: New testcase.

Comment 3 GCC Commits 2020-06-25 10:30:09 UTC

The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:86ce59b4f05d8f68ec4d9a14a7732acb370412db

commit r11-1647-g86ce59b4f05d8f68ec4d9a14a7732acb370412db
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Jun 25 11:21:20 2020 +0200

    tree-optimization/95866 - avoid using scalar ops for vectorized shift
    
    This avoids using the original scalar SSA operand when vectorizing
    a shift with a vectorized shift operand where we know all vector
    components have the same value and thus we can use a vector by
    scalar shift.  Using the scalar SSA operand causes a possibly
    long chain of scalar computation to be retained so it's better
    to simply extract lane zero from the available vectorized shift
    operand.
    
    2020-06-25  Richard Biener  <rguenther@suse.de>
    
            PR tree-optimization/95866
            * tree-vect-stmts.c (vectorizable_shift): Reject incompatible
            vectorized shift operands.  For scalar shifts use lane zero
            of a vectorized shift operand.
    
            * gcc.dg/vect/bb-slp-pr95866.c: New testcase.

Comment 4 Richard Biener 2020-06-25 10:30:36 UTC

Now fixed.