Bug 97351 - gcc.dg/vect/bb-slp-subgroups-3.c bad vectorization with AVX
Summary: gcc.dg/vect/bb-slp-subgroups-3.c bad vectorization with AVX
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 11.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2020-10-09 10:54 UTC by Richard Biener
Modified: 2020-10-09 10:58 UTC (History)
0 users

See Also:
Host:
Target: x86_64-*-* i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Biener 2020-10-09 10:54:44 UTC
int __attribute__((__aligned__(8))) a[8];
int __attribute__((__aligned__(8))) b[8];

void
test ()
{
    a[0] = b[0] + 1;
    a[1] = b[1] + 2;
    a[2] = b[2] + 3;
    a[3] = b[3] + 4;
    a[4] = b[0] * 3;
    a[5] = b[2] * 4;
    a[6] = b[4] * 5;
    a[7] = b[6] * 7;
}

should be vectorized using V4SI vectors in two SLP groups so we can
vectorize not only the store but also the loads and the add.  When
using -mavx2 we instead get only the store vectorized (even with
cost modeling enabled) because we try vectorizing that first.

It might be possible to guide SLP splitting during the SLP build
in a similar way how we try commutating operands. So when we figure

/home/rguenther/src/gcc3/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c:12:10: note:   Build SLP for _9 = _1 * 3;
/home/rguenther/src/gcc3/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c:12:10: note:   get vectype for scalar type (group size 8): int
/home/rguenther/src/gcc3/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c:12:10: note:   vectype: vector(8) int
/home/rguenther/src/gcc3/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c:12:10: note:   nunits = 8
/home/rguenther/src/gcc3/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c:12:10: missed:   Build SLP failed: different operation in stmt _9 = _1 * 3;
/home/rguenther/src/gcc3/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c:12:10: missed:   original stmt _2 = _1 + 1;

and see the parent op (the store in this case) cannot be commutated we
can see whether matches[] divides the vector with some constraints
and whether the other lanes with matches[] == false form a valid SLP
operand (we know the == true ones likely would).  The results would then
be concatenated via a permute node.

This should eventually also replace the splitting done in SLP instance
analysis (though splitting stores might still be necessary there).

The other option is to somehow tackle this with vector size iteration,
doing multiple analyses and comparing costs/benefit though it's hard
to not compare apples & oranges since the amount of code vectorized will
usually differ (as compared to loop vectorization)