Bug 61194 - [4.9/5/6/7 Regression] vectorization failed with "bit-precision arithmetic not supported" even if conversion to int is requested
Summary: [4.9/5/6/7 Regression] vectorization failed with "bit-precision arithmetic no...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.9.0
: P2 normal
Target Milestone: 7.0
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2014-05-15 12:09 UTC by vincenzo Innocente
Modified: 2016-06-08 14:00 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work: 4.7.0
Known to fail:
Last reconfirmed: 2014-05-15 00:00:00


Attachments
patch (1.45 KB, patch)
2014-05-16 09:18 UTC, Richard Biener
Details | Diff
patch fixing the regression (961 bytes, patch)
2014-05-16 11:45 UTC, Richard Biener
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description vincenzo Innocente 2014-05-15 12:09:12 UTC
z[i] = ( (x[i]>0) & (w[i]<0)) ? z[i] : y[i];
produces

bit-precision arithmetic not supported.
note: not vectorized: relevant stmt not supported: _6 = _5 > 0.0;

while integer bitwise operators are vectorized at will
 z[i] = ( k[i] & j[i] ) ? z[i] : y[i];

I tried to force conversion to int in many ways (see below) such as 
z[i] = ( int(x[i]>0) & int(w[i]<0)) ? z[i] : y[i];
succeeding only with the quite expensive
z[i] = (2==(int(x[i]>0) + int(w[i]<0))) ? z[i] : y[i];

is there a way to force gcc to use integer w/o using arithmetic operators?

c++ -Ofast -Wall -fno-tree-slp-vectorize -ftree-loop-if-convert-stores -S cond.cc -msse4.2 -fopt-info-vec -fno-tree-slp-vectorize  -fopenmp
cat cond.cc
float x[1024];
float y[1024];
float z[1024];
float w[1024];

int k[1024];
int j[1024];


void bar() {
  for (int i=0; i<1024; ++i)
    z[i] = ( (x[i]>0) & (w[i]<0)) ? z[i] : y[i];
}


void barMP() {
#pragma omp simd
  for (int i=0; i<1024; ++i)
    z[i] = ( int(x[i]>0) & int(w[i]<0)) ? z[i] : y[i];
}


void barInt() {
  for (int i=0; i<1024; ++i)
    z[i] = ( int(x[i]>0) & int(w[i]<0)) ? z[i] : y[i];
}

void barInt0() {
  for (int i=0; i<1024; ++i)
    z[i] = ( (0+int(x[i]>0)) & (0+int(w[i]<0)) ) ? z[i] : y[i];
}


void barPlus() {
  for (int i=0; i<1024; ++i)
    z[i] = (2==(int(x[i]>0) + int(w[i]<0))) ? z[i] : y[i];
}


void foo() {
  for (int i=0; i<1024; ++i)
    z[i] = ( k[i] & j[i] ) ? z[i] : y[i];
}


void foo2() {
  for (int i=0; i<1024; ++i) {
    k[i] = x[i]>0; j[i] = w[i]<0;
  }
}

void bar2() {
  for (int i=0; i<1024; ++i) {
    k[i] = x[i]>0; j[i] = w[i]<0;
    z[i] = ( k[i] & j[i]) ? z[i] : y[i];
 }
}
Comment 1 vincenzo Innocente 2014-05-15 14:20:11 UTC
what I find quite absurd is that
void barX() {
  for (int i=0; i<1024; ++i) {
    k[i] = x[i]>0;
    k[i] &=  w[i]<y[i];
//    z[i] = (k[i]) ? z[i] : y[i];
 }
}
vectorize and
void barX() {
  for (int i=0; i<1024; ++i) {
    k[i] = x[i]>0;
    k[i] &=  w[i]<y[i];
    z[i] = (k[i]) ? z[i] : y[i];
 }
}
does not with gcc 4.9.0

This is a regression w.r.t. 4.7.0
compiled as
c++ -Ofast -Wall -fno-tree-slp-vectorize -ftree-loop-if-convert-stores -S cond.cc -msse4.2 -ftree-vectorizer-verbose=1
that produced
Z4barXv:
.LFB1:
        .cfi_startproc
        xorps   %xmm4, %xmm4
        xorl    %eax, %eax
        pxor    %xmm3, %xmm3
        movdqa  .LC1(%rip), %xmm5
        .p2align 4,,10
        .p2align 3
.L9:
        movaps  y(%rax), %xmm2
        movaps  %xmm4, %xmm1
        movaps  w(%rax), %xmm0
        cmpltps x(%rax), %xmm1
        cmpltps %xmm2, %xmm0
        pand    %xmm5, %xmm0
        pand    %xmm1, %xmm0
        movaps  z(%rax), %xmm1
        movdqa  %xmm0, k(%rax)
        pcmpeqd %xmm3, %xmm0
        blendvps        %xmm0, %xmm2, %xmm1
        movaps  %xmm1, z(%rax)
        addq    $16, %rax
        cmpq    $4096, %rax
        jne     .L9
        rep
        ret
        .cfi_endproc
Comment 2 vincenzo Innocente 2014-05-15 14:28:59 UTC
new test code
cat cond0.cc
float x[1024];
float y[1024];
float z[1024];
float w[1024];

int k[1024];

void barX() {
  for (int i=0; i<1024; ++i) {
    k[i] = (x[i]>0) & (w[i]<y[i]);
    z[i] = (k[i]) ? z[i] : y[i];
 }
}
c++ -Ofast -Wall -fno-tree-slp-vectorize -ftree-loop-if-convert-stores -S cond0.cc -msse4.2 -ftree-vectorizer-verbose=1

Analyzing loop at cond0.cc:9

cond0.cc:9: note: vect_recog_bool_pattern: detected: 
cond0.cc:9: note: pattern recognized: patt_23 = (int) patt_24;

cond0.cc:9: note: additional pattern stmt: patt_25 = _7 < _8 ? 1 : 0;

cond0.cc:9: note: additional pattern stmt: patt_24 = _5 > 0.0 ? patt_25 : 0;


Vectorizing loop at cond0.cc:9

cond0.cc:9: note: LOOP VECTORIZED.
cond0.cc:8: note: vectorized 1 loops in function.
c++ -v
Using built-in specs.
COLLECT_GCC=c++
COLLECT_LTO_WRAPPER=/afs/cern.ch/sw/lcg/contrib/gcc/4.8.1/x86_64-slc6-gcc48-opt/bin/../libexec/gcc/x86_64-unknown-linux-gnu/4.8.1/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: /build/vdiez/gcc-4.8.1/configure --prefix=/build/vdiez/gcc-4.8.1-installation --with-mpfr=/afs/cern.ch/sw/lcg/external/mpfr/3.1.2/x86_64-slc6-gcc48-opt --with-gmp=/afs/cern.ch/sw/lcg/external/gmp/5.1.1/x86_64-slc6-gcc48-opt --with-mpc=/afs/cern.ch/sw/lcg/external/mpc/1.0.1/x86_64-slc6-gcc48-opt --enable-libstdcxx-time --enable-lto --with-isl=/afs/cern.ch/sw/lcg/external/isl/0.11.1/x86_64-slc6-gcc48-opt --with-cloog=/afs/cern.ch/sw/lcg/external/cloog/0.18.0/x86_64-slc6-gcc48-opt --enable-languages=c,c++,fortran,go
Thread model: posix
gcc version 4.8.1 (GCC) 


c++ -Ofast -Wall -fno-tree-slp-vectorize -ftree-loop-if-convert-stores -S cond0.cc -msse4.2 -fopt-info-vec-missed

cond0.cc:9:3: note: misalign = 0 bytes of ref x[i_18]
cond0.cc:9:3: note: misalign = 0 bytes of ref w[i_18]
cond0.cc:9:3: note: misalign = 0 bytes of ref y[i_18]
cond0.cc:9:3: note: misalign = 0 bytes of ref k[i_18]
cond0.cc:9:3: note: misalign = 0 bytes of ref z[i_18]
cond0.cc:9:3: note: misalign = 0 bytes of ref z[i_18]
cond0.cc:9:3: note: virtual phi. skip.
cond0.cc:9:3: note: num. args = 4 (not unary/binary/ternary op).
cond0.cc:9:3: note: not ssa-name.
cond0.cc:9:3: note: use not simple.
cond0.cc:9:3: note: bit-precision arithmetic not supported.
cond0.cc:9:3: note: not vectorized: relevant stmt not supported: _6 = _5 > 0.0;

cond0.cc:9:3: note: bad operation or unsupported loop bound.
Vincenzos-MacBook-Pro-2:vectorize innocent$ c++ -v
Using built-in specs.
COLLECT_GCC=c++
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-apple-darwin13.1.0/4.10.0/lto-wrapper
Target: x86_64-apple-darwin13.1.0
Configured with: ../gcc-trunk/configure --disable-multilib --disable-bootstrap --disable-libitm --enable-languages=c,c++,fortran,lto --disable-libsanitizer --enable-lto
Thread model: posix
gcc version 4.10.0 20140430 (experimental) [trunk revision 209930] (GCC)
Comment 3 Richard Biener 2014-05-15 14:58:01 UTC
I see on trunk after if-conversion

  _6 = _5 > 0.0;
  _9 = _7 < _8;
  _10 = _9 & _6;
  _11 = (int) _10;
  k[i_18] = _11;
  iftmp.0_13 = z[i_18];
  iftmp.0_2 = _10 ? iftmp.0_13 : _8;
  z[i_18] = iftmp.0_2;

so what happens is that we do have "bit-precision" arithmetic with the
bitwise and.

This is a regression because of the way we lower comparisons now I guess.

I will have a look.
Comment 4 Richard Biener 2014-05-15 15:06:25 UTC
Actually the vectorizer punts on the comparisons itself.  The pattern recognizer handles some of them as

  patt_10 = _4 > 0.0 ? 1 : 0;

but not those feeding the BIT expressions which would need to be widened then
(though they are supported as bit-precision).
Comment 5 vincenzo Innocente 2014-05-15 15:35:53 UTC
of course if you can make
z[i] = ( (x[i]>0) & (w[i]<0)) ? z[i] : y[i];
to vectorize would be even better!
Comment 6 Richard Biener 2014-05-16 09:18:40 UTC
Created attachment 32803 [details]
patch

Like this.
Comment 7 vincenzo Innocente 2014-05-16 09:51:35 UTC
great!

the original version (that vectorized in 4.8.1)
void barX() {
  for (int i=0; i<1024; ++i) {
    k[i] = (x[i]>0) & (w[i]<y[i]);
    z[i] = (k[i]) ? z[i] : y[i];
 }
}

does not vectorize yet.

On the other hand I am very happy to see
void bar() {
  for (int i=0; i<1024; ++i) {
    auto c = ( (x[i]>0) & (w[i]<y[i])) | (y[i]>0.5f);
    z[i] = c ? y[i] : z[i];
 }
}
vectorized
if (c) z[i] = y[i];
does not even with -ftree-loop-if-convert-stores
not a real issue at least for what I am concerned
Comment 8 rguenther@suse.de 2014-05-16 10:30:58 UTC
On Fri, 16 May 2014, vincenzo.innocente at cern dot ch wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61194
> 
> --- Comment #7 from vincenzo Innocente <vincenzo.innocente at cern dot ch> ---
> great!
> 
> the original version (that vectorized in 4.8.1)
> void barX() {
>   for (int i=0; i<1024; ++i) {
>     k[i] = (x[i]>0) & (w[i]<y[i]);
>     z[i] = (k[i]) ? z[i] : y[i];
>  }
> }
> 
> does not vectorize yet.

That's because we hit

check_bool_pattern (var=<ssa_name 0x7ffff6c36e10>, loop_vinfo=0x1f3e900, 
    bb_vinfo=0x0)
    at /space/rguenther/src/svn/trunk/gcc/tree-vect-patterns.c:2596
2596                               &dt))
...
2605      if (!has_single_use (def))
2606        return false;

because

  _5 = x[i_18];
  _6 = _5 > 0.0;
  _7 = w[i_18];
  _8 = y[i_18];
  _9 = _7 < _8;
  _10 = _9 & _6;
  _11 = (int) _10;
  k[i_18] = _11;
  iftmp.0_13 = z[i_18];
  iftmp.0_2 = _10 ? iftmp.0_13 : _8;

thus we have CSEd the load from k and propagated from the
conversion.  VRP does this:

   _11 = (int) _10;
-  k[i_1] = _11;
-  if (_11 != 0)
+  k[i_18] = _11;
+  if (_10 != 0)

and -fno-tree-vrp "fixes" the regression.  If k were of type
_Bool then it likely wouldn't vectorize with 4.8 either.

The vectorizer cannot handle multi-uses of a pattern part
(in this case it's the start which would be doable, but it's
far from trivial ...).  That said,

static float x[1024];
static float y[1024];
static float z[1024];
static float w[1024];

static _Bool k[1024];

void __attribute__((noinline,noclone)) barX()
{
  int i;
  for (i=0; i<1024; ++i) {
      k[i] = (x[i]>0) & (w[i]<y[i]);
      z[i] = (k[i]) ? z[i] : y[i];
  }
}

is not vectorized even in 4.8 for the cited reason.

> On the other hand I am very happy to see
> void bar() {
>   for (int i=0; i<1024; ++i) {
>     auto c = ( (x[i]>0) & (w[i]<y[i])) | (y[i]>0.5f);
>     z[i] = c ? y[i] : z[i];
>  }
> }
> vectorized
> if (c) z[i] = y[i];
> does not even with -ftree-loop-if-convert-stores
> not a real issue at least for what I am concerned

I think it doesn't introduce data races unless you
also specify --param allow-store-data-races=1.

I also don't see the testcases vectorized when using
&& instead of &.

If not already there, these warrant (different) bugreports.
Comment 9 Richard Biener 2014-05-16 11:21:43 UTC
Author: rguenth
Date: Fri May 16 11:21:11 2014
New Revision: 210514

URL: http://gcc.gnu.org/viewcvs?rev=210514&root=gcc&view=rev
Log:
2014-05-16  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/61194
	* tree-vect-patterns.c (adjust_bool_pattern): Also handle
	bool patterns ending in a COND_EXPR.

	* gcc.dg/vect/pr61194.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.dg/vect/pr61194.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-patterns.c
Comment 10 Richard Biener 2014-05-16 11:45:38 UTC
Created attachment 32805 [details]
patch fixing the regression

This would fix the regression (also without the previous patch?)
Comment 11 Richard Biener 2014-05-16 12:10:28 UTC
(In reply to Richard Biener from comment #10)
> Created attachment 32805 [details]
> patch fixing the regression
> 
> This would fix the regression (also without the previous patch?)

It does, on the 4.9 branch at least, for

static float x[1024];
static float y[1024];
static float z[1024];
static float w[1024];

static int k[1024];

void __attribute__((noinline,noclone)) barX()
{
  int i;
  for (i=0; i<1024; ++i)
    {
      k[i] = x[i]>0;
      k[i] &=  w[i]<y[i];
      z[i] = (k[i]) ? z[i] : y[i];
    }
}

but it doesn't change the outcome of the big testcase in the original report.
It does together with the other patch though:

> g++-4.9 t.C -Ofast -ftree-loop-if-convert-stores  -fopt-info-vec -B. -fopenmp
t.C:11:5: note: loop vectorized
t.C:19:23: note: loop vectorized
t.C:24:5: note: loop vectorized
t.C:29:5: note: loop vectorized
t.C:35:5: note: loop vectorized
t.C:41:5: note: loop vectorized
t.C:47:5: note: loop vectorized

bar2 still not vectorized there.

But with 4.7 I see the same as with 4.8 and 4.9:

35: LOOP VECTORIZED.
41: LOOP VECTORIZED.
47: LOOP VECTORIZED.

so where exactly does the "regression" part appear for you?  Is that only
for the code in comment#1?
Comment 12 Richard Biener 2014-05-16 12:15:58 UTC
void bar2() {
  for (int i=0; i<1024; ++i) {
    k[i] = x[i]>0; j[i] = w[i]<0;
    z[i] = ( k[i] & j[i]) ? z[i] : y[i];
 }
}

has similar issues (non-single-uses due to CSE and propagating from the
conversion sources):

  _5 = x[i_20];
  _6 = _5 > 0.0;
  _7 = (int) _6;
  k[i_20] = _7;
  _9 = w[i_20];
  _10 = _9 < 0.0;
  _11 = (int) _10;
  j[i_20] = _11;
  _18 = _10 & _6;
  iftmp.0_14 = z[i_20];
  iftmp.0_15 = y[i_20];
  iftmp.0_2 = _18 ? iftmp.0_14 : iftmp.0_15;
  z[i_20] = iftmp.0_2;

This is generally caused by optimizing code to use smaller precisions.  So
I think we need a more general solution for this than just the 2nd patch
I attached (which I won't pursue - I figure the first one would be way
more useful as it results in the same result for your initial large testcase
where the 2nd patch doesn't make a difference).
Comment 13 vincenzo Innocente 2014-05-16 12:20:16 UTC
I confirm that with last patch the regression is gone also in a more complex actual application I had.

The regression concerns only comment 2 and 3.

all the other cases in comment 1 were various attempt of mine to see if anything was changed that allowed vectorization using a different syntax.
I am happy that now they all vectorize (but bar2...)

when, in 2011, I wrote the original test case, I introduced the int vector to make it vectorize (most probably I also submitted a bug report on the subject)
Comment 14 vincenzo Innocente 2014-05-16 12:25:52 UTC
provided that future patches will make the code in comment 1 and 2 (and bar) go vectorize is fine  with me.
if it ends up to vectorize also with "bool" instead of "int" even better.
(I am not sure that bit/byte handling is really more efficient in sse and avx w.r.t plain 32bit int)
Comment 15 Marc Glisse 2014-05-17 12:23:43 UTC
Seems related to PR 57328.
Comment 16 Jakub Jelinek 2014-07-16 13:27:11 UTC
GCC 4.9.1 has been released.
Comment 17 Jakub Jelinek 2014-10-30 10:37:25 UTC
GCC 4.9.2 has been released.
Comment 18 Richard Biener 2014-12-01 12:04:30 UTC
There is a proposed patch to if-conversion that solves the multiple-use issue by
duplicating the involved statements (ugh).
Comment 19 Jakub Jelinek 2015-06-26 19:56:21 UTC
GCC 4.9.3 has been released.
Comment 20 Richard Biener 2016-06-08 14:00:57 UTC
All functions in the description are vectorized on trunk so do those from comment#1 and comment#2.  All but bar2 are vectorized with GCC 6 already.

Thus fixed on trunk (or with GCC5/6 with OMP SIMD aka force-vect).