61194 – [4.9/5/6/7 Regression] vectorization failed with "bit-precision arithmetic not supported" even if conversion to int is requested

Bug 61194 - [4.9/5/6/7 Regression] vectorization failed with "bit-precision arithmetic not supported" even if conversion to int is requested

Summary: [4.9/5/6/7 Regression] vectorization failed with "bit-precision arithmetic no...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.9.0

Importance:	P2 normal
Target Milestone:	7.0
Assignee:	Richard Biener

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2014-05-15 12:09 UTC by vincenzo Innocente
Modified:	2016-06-08 14:00 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Known to work:	4.7.0
Known to fail:
Last reconfirmed:	2014-05-15 00:00:00

Attachments
patch (1.45 KB, patch) 2014-05-16 09:18 UTC, Richard Biener	Details \| Diff
patch fixing the regression (961 bytes, patch) 2014-05-16 11:45 UTC, Richard Biener	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description vincenzo Innocente 2014-05-15 12:09:12 UTC

z[i] = ( (x[i]>0) & (w[i]<0)) ? z[i] : y[i];
produces

bit-precision arithmetic not supported.
note: not vectorized: relevant stmt not supported: _6 = _5 > 0.0;

while integer bitwise operators are vectorized at will
 z[i] = ( k[i] & j[i] ) ? z[i] : y[i];

I tried to force conversion to int in many ways (see below) such as 
z[i] = ( int(x[i]>0) & int(w[i]<0)) ? z[i] : y[i];
succeeding only with the quite expensive
z[i] = (2==(int(x[i]>0) + int(w[i]<0))) ? z[i] : y[i];

is there a way to force gcc to use integer w/o using arithmetic operators?

c++ -Ofast -Wall -fno-tree-slp-vectorize -ftree-loop-if-convert-stores -S cond.cc -msse4.2 -fopt-info-vec -fno-tree-slp-vectorize  -fopenmp
cat cond.cc
float x[1024];
float y[1024];
float z[1024];
float w[1024];

int k[1024];
int j[1024];


void bar() {
  for (int i=0; i<1024; ++i)
    z[i] = ( (x[i]>0) & (w[i]<0)) ? z[i] : y[i];
}


void barMP() {
#pragma omp simd
  for (int i=0; i<1024; ++i)
    z[i] = ( int(x[i]>0) & int(w[i]<0)) ? z[i] : y[i];
}


void barInt() {
  for (int i=0; i<1024; ++i)
    z[i] = ( int(x[i]>0) & int(w[i]<0)) ? z[i] : y[i];
}

void barInt0() {
  for (int i=0; i<1024; ++i)
    z[i] = ( (0+int(x[i]>0)) & (0+int(w[i]<0)) ) ? z[i] : y[i];
}


void barPlus() {
  for (int i=0; i<1024; ++i)
    z[i] = (2==(int(x[i]>0) + int(w[i]<0))) ? z[i] : y[i];
}


void foo() {
  for (int i=0; i<1024; ++i)
    z[i] = ( k[i] & j[i] ) ? z[i] : y[i];
}


void foo2() {
  for (int i=0; i<1024; ++i) {
    k[i] = x[i]>0; j[i] = w[i]<0;
  }
}

void bar2() {
  for (int i=0; i<1024; ++i) {
    k[i] = x[i]>0; j[i] = w[i]<0;
    z[i] = ( k[i] & j[i]) ? z[i] : y[i];
 }
}

Comment 1 vincenzo Innocente 2014-05-15 14:20:11 UTC

what I find quite absurd is that
void barX() {
  for (int i=0; i<1024; ++i) {
    k[i] = x[i]>0;
    k[i] &=  w[i]<y[i];
//    z[i] = (k[i]) ? z[i] : y[i];
 }
}
vectorize and
void barX() {
  for (int i=0; i<1024; ++i) {
    k[i] = x[i]>0;
    k[i] &=  w[i]<y[i];
    z[i] = (k[i]) ? z[i] : y[i];
 }
}
does not with gcc 4.9.0

This is a regression w.r.t. 4.7.0
compiled as
c++ -Ofast -Wall -fno-tree-slp-vectorize -ftree-loop-if-convert-stores -S cond.cc -msse4.2 -ftree-vectorizer-verbose=1
that produced
Z4barXv:
.LFB1:
        .cfi_startproc
        xorps   %xmm4, %xmm4
        xorl    %eax, %eax
        pxor    %xmm3, %xmm3
        movdqa  .LC1(%rip), %xmm5
        .p2align 4,,10
        .p2align 3
.L9:
        movaps  y(%rax), %xmm2
        movaps  %xmm4, %xmm1
        movaps  w(%rax), %xmm0
        cmpltps x(%rax), %xmm1
        cmpltps %xmm2, %xmm0
        pand    %xmm5, %xmm0
        pand    %xmm1, %xmm0
        movaps  z(%rax), %xmm1
        movdqa  %xmm0, k(%rax)
        pcmpeqd %xmm3, %xmm0
        blendvps        %xmm0, %xmm2, %xmm1
        movaps  %xmm1, z(%rax)
        addq    $16, %rax
        cmpq    $4096, %rax
        jne     .L9
        rep
        ret
        .cfi_endproc

Comment 2 vincenzo Innocente 2014-05-15 14:28:59 UTC

new test code
cat cond0.cc
float x[1024];
float y[1024];
float z[1024];
float w[1024];

int k[1024];

void barX() {
  for (int i=0; i<1024; ++i) {
    k[i] = (x[i]>0) & (w[i]<y[i]);
    z[i] = (k[i]) ? z[i] : y[i];
 }
}
c++ -Ofast -Wall -fno-tree-slp-vectorize -ftree-loop-if-convert-stores -S cond0.cc -msse4.2 -ftree-vectorizer-verbose=1

Analyzing loop at cond0.cc:9

cond0.cc:9: note: vect_recog_bool_pattern: detected: 
cond0.cc:9: note: pattern recognized: patt_23 = (int) patt_24;

cond0.cc:9: note: additional pattern stmt: patt_25 = _7 < _8 ? 1 : 0;

cond0.cc:9: note: additional pattern stmt: patt_24 = _5 > 0.0 ? patt_25 : 0;


Vectorizing loop at cond0.cc:9

cond0.cc:9: note: LOOP VECTORIZED.
cond0.cc:8: note: vectorized 1 loops in function.
c++ -v
Using built-in specs.
COLLECT_GCC=c++
COLLECT_LTO_WRAPPER=/afs/cern.ch/sw/lcg/contrib/gcc/4.8.1/x86_64-slc6-gcc48-opt/bin/../libexec/gcc/x86_64-unknown-linux-gnu/4.8.1/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: /build/vdiez/gcc-4.8.1/configure --prefix=/build/vdiez/gcc-4.8.1-installation --with-mpfr=/afs/cern.ch/sw/lcg/external/mpfr/3.1.2/x86_64-slc6-gcc48-opt --with-gmp=/afs/cern.ch/sw/lcg/external/gmp/5.1.1/x86_64-slc6-gcc48-opt --with-mpc=/afs/cern.ch/sw/lcg/external/mpc/1.0.1/x86_64-slc6-gcc48-opt --enable-libstdcxx-time --enable-lto --with-isl=/afs/cern.ch/sw/lcg/external/isl/0.11.1/x86_64-slc6-gcc48-opt --with-cloog=/afs/cern.ch/sw/lcg/external/cloog/0.18.0/x86_64-slc6-gcc48-opt --enable-languages=c,c++,fortran,go
Thread model: posix
gcc version 4.8.1 (GCC) 


c++ -Ofast -Wall -fno-tree-slp-vectorize -ftree-loop-if-convert-stores -S cond0.cc -msse4.2 -fopt-info-vec-missed

cond0.cc:9:3: note: misalign = 0 bytes of ref x[i_18]
cond0.cc:9:3: note: misalign = 0 bytes of ref w[i_18]
cond0.cc:9:3: note: misalign = 0 bytes of ref y[i_18]
cond0.cc:9:3: note: misalign = 0 bytes of ref k[i_18]
cond0.cc:9:3: note: misalign = 0 bytes of ref z[i_18]
cond0.cc:9:3: note: misalign = 0 bytes of ref z[i_18]
cond0.cc:9:3: note: virtual phi. skip.
cond0.cc:9:3: note: num. args = 4 (not unary/binary/ternary op).
cond0.cc:9:3: note: not ssa-name.
cond0.cc:9:3: note: use not simple.
cond0.cc:9:3: note: bit-precision arithmetic not supported.
cond0.cc:9:3: note: not vectorized: relevant stmt not supported: _6 = _5 > 0.0;

cond0.cc:9:3: note: bad operation or unsupported loop bound.
Vincenzos-MacBook-Pro-2:vectorize innocent$ c++ -v
Using built-in specs.
COLLECT_GCC=c++
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-apple-darwin13.1.0/4.10.0/lto-wrapper
Target: x86_64-apple-darwin13.1.0
Configured with: ../gcc-trunk/configure --disable-multilib --disable-bootstrap --disable-libitm --enable-languages=c,c++,fortran,lto --disable-libsanitizer --enable-lto
Thread model: posix
gcc version 4.10.0 20140430 (experimental) [trunk revision 209930] (GCC)

Comment 3 Richard Biener 2014-05-15 14:58:01 UTC

I see on trunk after if-conversion

  _6 = _5 > 0.0;
  _9 = _7 < _8;
  _10 = _9 & _6;
  _11 = (int) _10;
  k[i_18] = _11;
  iftmp.0_13 = z[i_18];
  iftmp.0_2 = _10 ? iftmp.0_13 : _8;
  z[i_18] = iftmp.0_2;

so what happens is that we do have "bit-precision" arithmetic with the
bitwise and.

This is a regression because of the way we lower comparisons now I guess.

I will have a look.

Comment 4 Richard Biener 2014-05-15 15:06:25 UTC

Actually the vectorizer punts on the comparisons itself.  The pattern recognizer handles some of them as

  patt_10 = _4 > 0.0 ? 1 : 0;

but not those feeding the BIT expressions which would need to be widened then
(though they are supported as bit-precision).

Comment 5 vincenzo Innocente 2014-05-15 15:35:53 UTC

of course if you can make
z[i] = ( (x[i]>0) & (w[i]<0)) ? z[i] : y[i];
to vectorize would be even better!

Comment 6 Richard Biener 2014-05-16 09:18:40 UTC

Created attachment 32803 [details]
patch

Like this.

Comment 7 vincenzo Innocente 2014-05-16 09:51:35 UTC

great!

the original version (that vectorized in 4.8.1)
void barX() {
  for (int i=0; i<1024; ++i) {
    k[i] = (x[i]>0) & (w[i]<y[i]);
    z[i] = (k[i]) ? z[i] : y[i];
 }
}

does not vectorize yet.

On the other hand I am very happy to see
void bar() {
  for (int i=0; i<1024; ++i) {
    auto c = ( (x[i]>0) & (w[i]<y[i])) | (y[i]>0.5f);
    z[i] = c ? y[i] : z[i];
 }
}
vectorized
if (c) z[i] = y[i];
does not even with -ftree-loop-if-convert-stores
not a real issue at least for what I am concerned

Comment 8 rguenther@suse.de 2014-05-16 10:30:58 UTC

On Fri, 16 May 2014, vincenzo.innocente at cern dot ch wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61194
> 
> --- Comment #7 from vincenzo Innocente <vincenzo.innocente at cern dot ch> ---
> great!
> 
> the original version (that vectorized in 4.8.1)
> void barX() {
>   for (int i=0; i<1024; ++i) {
>     k[i] = (x[i]>0) & (w[i]<y[i]);
>     z[i] = (k[i]) ? z[i] : y[i];
>  }
> }
> 
> does not vectorize yet.

That's because we hit

check_bool_pattern (var=<ssa_name 0x7ffff6c36e10>, loop_vinfo=0x1f3e900, 
    bb_vinfo=0x0)
    at /space/rguenther/src/svn/trunk/gcc/tree-vect-patterns.c:2596
2596                               &dt))
...
2605      if (!has_single_use (def))
2606        return false;

because

  _5 = x[i_18];
  _6 = _5 > 0.0;
  _7 = w[i_18];
  _8 = y[i_18];
  _9 = _7 < _8;
  _10 = _9 & _6;
  _11 = (int) _10;
  k[i_18] = _11;
  iftmp.0_13 = z[i_18];
  iftmp.0_2 = _10 ? iftmp.0_13 : _8;

thus we have CSEd the load from k and propagated from the
conversion.  VRP does this:

   _11 = (int) _10;
-  k[i_1] = _11;
-  if (_11 != 0)
+  k[i_18] = _11;
+  if (_10 != 0)

and -fno-tree-vrp "fixes" the regression.  If k were of type
_Bool then it likely wouldn't vectorize with 4.8 either.

The vectorizer cannot handle multi-uses of a pattern part
(in this case it's the start which would be doable, but it's
far from trivial ...).  That said,

static float x[1024];
static float y[1024];
static float z[1024];
static float w[1024];

static _Bool k[1024];

void __attribute__((noinline,noclone)) barX()
{
  int i;
  for (i=0; i<1024; ++i) {
      k[i] = (x[i]>0) & (w[i]<y[i]);
      z[i] = (k[i]) ? z[i] : y[i];
  }
}

is not vectorized even in 4.8 for the cited reason.

> On the other hand I am very happy to see
> void bar() {
>   for (int i=0; i<1024; ++i) {
>     auto c = ( (x[i]>0) & (w[i]<y[i])) | (y[i]>0.5f);
>     z[i] = c ? y[i] : z[i];
>  }
> }
> vectorized
> if (c) z[i] = y[i];
> does not even with -ftree-loop-if-convert-stores
> not a real issue at least for what I am concerned

I think it doesn't introduce data races unless you
also specify --param allow-store-data-races=1.

I also don't see the testcases vectorized when using
&& instead of &.

If not already there, these warrant (different) bugreports.

Comment 9 Richard Biener 2014-05-16 11:21:43 UTC

Author: rguenth
Date: Fri May 16 11:21:11 2014
New Revision: 210514

URL: http://gcc.gnu.org/viewcvs?rev=210514&root=gcc&view=rev
Log:
2014-05-16  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/61194
	* tree-vect-patterns.c (adjust_bool_pattern): Also handle
	bool patterns ending in a COND_EXPR.

	* gcc.dg/vect/pr61194.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.dg/vect/pr61194.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-patterns.c

Comment 10 Richard Biener 2014-05-16 11:45:38 UTC

Created attachment 32805 [details]
patch fixing the regression

This would fix the regression (also without the previous patch?)

Comment 11 Richard Biener 2014-05-16 12:10:28 UTC

(In reply to Richard Biener from comment #10)
> Created attachment 32805 [details]
> patch fixing the regression
> 
> This would fix the regression (also without the previous patch?)

It does, on the 4.9 branch at least, for

static float x[1024];
static float y[1024];
static float z[1024];
static float w[1024];

static int k[1024];

void __attribute__((noinline,noclone)) barX()
{
  int i;
  for (i=0; i<1024; ++i)
    {
      k[i] = x[i]>0;
      k[i] &=  w[i]<y[i];
      z[i] = (k[i]) ? z[i] : y[i];
    }
}

but it doesn't change the outcome of the big testcase in the original report.
It does together with the other patch though:

> g++-4.9 t.C -Ofast -ftree-loop-if-convert-stores  -fopt-info-vec -B. -fopenmp
t.C:11:5: note: loop vectorized
t.C:19:23: note: loop vectorized
t.C:24:5: note: loop vectorized
t.C:29:5: note: loop vectorized
t.C:35:5: note: loop vectorized
t.C:41:5: note: loop vectorized
t.C:47:5: note: loop vectorized

bar2 still not vectorized there.

But with 4.7 I see the same as with 4.8 and 4.9:

35: LOOP VECTORIZED.
41: LOOP VECTORIZED.
47: LOOP VECTORIZED.

so where exactly does the "regression" part appear for you?  Is that only
for the code in comment#1?

Comment 12 Richard Biener 2014-05-16 12:15:58 UTC

void bar2() {
  for (int i=0; i<1024; ++i) {
    k[i] = x[i]>0; j[i] = w[i]<0;
    z[i] = ( k[i] & j[i]) ? z[i] : y[i];
 }
}

has similar issues (non-single-uses due to CSE and propagating from the
conversion sources):

  _5 = x[i_20];
  _6 = _5 > 0.0;
  _7 = (int) _6;
  k[i_20] = _7;
  _9 = w[i_20];
  _10 = _9 < 0.0;
  _11 = (int) _10;
  j[i_20] = _11;
  _18 = _10 & _6;
  iftmp.0_14 = z[i_20];
  iftmp.0_15 = y[i_20];
  iftmp.0_2 = _18 ? iftmp.0_14 : iftmp.0_15;
  z[i_20] = iftmp.0_2;

This is generally caused by optimizing code to use smaller precisions.  So
I think we need a more general solution for this than just the 2nd patch
I attached (which I won't pursue - I figure the first one would be way
more useful as it results in the same result for your initial large testcase
where the 2nd patch doesn't make a difference).

Comment 13 vincenzo Innocente 2014-05-16 12:20:16 UTC

I confirm that with last patch the regression is gone also in a more complex actual application I had.

The regression concerns only comment 2 and 3.

all the other cases in comment 1 were various attempt of mine to see if anything was changed that allowed vectorization using a different syntax.
I am happy that now they all vectorize (but bar2...)

when, in 2011, I wrote the original test case, I introduced the int vector to make it vectorize (most probably I also submitted a bug report on the subject)

Comment 14 vincenzo Innocente 2014-05-16 12:25:52 UTC

provided that future patches will make the code in comment 1 and 2 (and bar) go vectorize is fine  with me.
if it ends up to vectorize also with "bool" instead of "int" even better.
(I am not sure that bit/byte handling is really more efficient in sse and avx w.r.t plain 32bit int)

Comment 15 Marc Glisse 2014-05-17 12:23:43 UTC

Seems related to PR 57328.

Comment 16 Jakub Jelinek 2014-07-16 13:27:11 UTC

GCC 4.9.1 has been released.

Comment 17 Jakub Jelinek 2014-10-30 10:37:25 UTC

GCC 4.9.2 has been released.

Comment 18 Richard Biener 2014-12-01 12:04:30 UTC

There is a proposed patch to if-conversion that solves the multiple-use issue by
duplicating the involved statements (ugh).

Comment 19 Jakub Jelinek 2015-06-26 19:56:21 UTC

GCC 4.9.3 has been released.

Comment 20 Richard Biener 2016-06-08 14:00:57 UTC

All functions in the description are vectorized on trunk so do those from comment#1 and comment#2.  All but bar2 are vectorized with GCC 6 already.

Thus fixed on trunk (or with GCC5/6 with OMP SIMD aka force-vect).