z[i] = ( (x[i]>0) & (w[i]<0)) ? z[i] : y[i]; produces bit-precision arithmetic not supported. note: not vectorized: relevant stmt not supported: _6 = _5 > 0.0; while integer bitwise operators are vectorized at will z[i] = ( k[i] & j[i] ) ? z[i] : y[i]; I tried to force conversion to int in many ways (see below) such as z[i] = ( int(x[i]>0) & int(w[i]<0)) ? z[i] : y[i]; succeeding only with the quite expensive z[i] = (2==(int(x[i]>0) + int(w[i]<0))) ? z[i] : y[i]; is there a way to force gcc to use integer w/o using arithmetic operators? c++ -Ofast -Wall -fno-tree-slp-vectorize -ftree-loop-if-convert-stores -S cond.cc -msse4.2 -fopt-info-vec -fno-tree-slp-vectorize -fopenmp cat cond.cc float x[1024]; float y[1024]; float z[1024]; float w[1024]; int k[1024]; int j[1024]; void bar() { for (int i=0; i<1024; ++i) z[i] = ( (x[i]>0) & (w[i]<0)) ? z[i] : y[i]; } void barMP() { #pragma omp simd for (int i=0; i<1024; ++i) z[i] = ( int(x[i]>0) & int(w[i]<0)) ? z[i] : y[i]; } void barInt() { for (int i=0; i<1024; ++i) z[i] = ( int(x[i]>0) & int(w[i]<0)) ? z[i] : y[i]; } void barInt0() { for (int i=0; i<1024; ++i) z[i] = ( (0+int(x[i]>0)) & (0+int(w[i]<0)) ) ? z[i] : y[i]; } void barPlus() { for (int i=0; i<1024; ++i) z[i] = (2==(int(x[i]>0) + int(w[i]<0))) ? z[i] : y[i]; } void foo() { for (int i=0; i<1024; ++i) z[i] = ( k[i] & j[i] ) ? z[i] : y[i]; } void foo2() { for (int i=0; i<1024; ++i) { k[i] = x[i]>0; j[i] = w[i]<0; } } void bar2() { for (int i=0; i<1024; ++i) { k[i] = x[i]>0; j[i] = w[i]<0; z[i] = ( k[i] & j[i]) ? z[i] : y[i]; } }
what I find quite absurd is that void barX() { for (int i=0; i<1024; ++i) { k[i] = x[i]>0; k[i] &= w[i]<y[i]; // z[i] = (k[i]) ? z[i] : y[i]; } } vectorize and void barX() { for (int i=0; i<1024; ++i) { k[i] = x[i]>0; k[i] &= w[i]<y[i]; z[i] = (k[i]) ? z[i] : y[i]; } } does not with gcc 4.9.0 This is a regression w.r.t. 4.7.0 compiled as c++ -Ofast -Wall -fno-tree-slp-vectorize -ftree-loop-if-convert-stores -S cond.cc -msse4.2 -ftree-vectorizer-verbose=1 that produced Z4barXv: .LFB1: .cfi_startproc xorps %xmm4, %xmm4 xorl %eax, %eax pxor %xmm3, %xmm3 movdqa .LC1(%rip), %xmm5 .p2align 4,,10 .p2align 3 .L9: movaps y(%rax), %xmm2 movaps %xmm4, %xmm1 movaps w(%rax), %xmm0 cmpltps x(%rax), %xmm1 cmpltps %xmm2, %xmm0 pand %xmm5, %xmm0 pand %xmm1, %xmm0 movaps z(%rax), %xmm1 movdqa %xmm0, k(%rax) pcmpeqd %xmm3, %xmm0 blendvps %xmm0, %xmm2, %xmm1 movaps %xmm1, z(%rax) addq $16, %rax cmpq $4096, %rax jne .L9 rep ret .cfi_endproc
new test code cat cond0.cc float x[1024]; float y[1024]; float z[1024]; float w[1024]; int k[1024]; void barX() { for (int i=0; i<1024; ++i) { k[i] = (x[i]>0) & (w[i]<y[i]); z[i] = (k[i]) ? z[i] : y[i]; } } c++ -Ofast -Wall -fno-tree-slp-vectorize -ftree-loop-if-convert-stores -S cond0.cc -msse4.2 -ftree-vectorizer-verbose=1 Analyzing loop at cond0.cc:9 cond0.cc:9: note: vect_recog_bool_pattern: detected: cond0.cc:9: note: pattern recognized: patt_23 = (int) patt_24; cond0.cc:9: note: additional pattern stmt: patt_25 = _7 < _8 ? 1 : 0; cond0.cc:9: note: additional pattern stmt: patt_24 = _5 > 0.0 ? patt_25 : 0; Vectorizing loop at cond0.cc:9 cond0.cc:9: note: LOOP VECTORIZED. cond0.cc:8: note: vectorized 1 loops in function. c++ -v Using built-in specs. COLLECT_GCC=c++ COLLECT_LTO_WRAPPER=/afs/cern.ch/sw/lcg/contrib/gcc/4.8.1/x86_64-slc6-gcc48-opt/bin/../libexec/gcc/x86_64-unknown-linux-gnu/4.8.1/lto-wrapper Target: x86_64-unknown-linux-gnu Configured with: /build/vdiez/gcc-4.8.1/configure --prefix=/build/vdiez/gcc-4.8.1-installation --with-mpfr=/afs/cern.ch/sw/lcg/external/mpfr/3.1.2/x86_64-slc6-gcc48-opt --with-gmp=/afs/cern.ch/sw/lcg/external/gmp/5.1.1/x86_64-slc6-gcc48-opt --with-mpc=/afs/cern.ch/sw/lcg/external/mpc/1.0.1/x86_64-slc6-gcc48-opt --enable-libstdcxx-time --enable-lto --with-isl=/afs/cern.ch/sw/lcg/external/isl/0.11.1/x86_64-slc6-gcc48-opt --with-cloog=/afs/cern.ch/sw/lcg/external/cloog/0.18.0/x86_64-slc6-gcc48-opt --enable-languages=c,c++,fortran,go Thread model: posix gcc version 4.8.1 (GCC) c++ -Ofast -Wall -fno-tree-slp-vectorize -ftree-loop-if-convert-stores -S cond0.cc -msse4.2 -fopt-info-vec-missed cond0.cc:9:3: note: misalign = 0 bytes of ref x[i_18] cond0.cc:9:3: note: misalign = 0 bytes of ref w[i_18] cond0.cc:9:3: note: misalign = 0 bytes of ref y[i_18] cond0.cc:9:3: note: misalign = 0 bytes of ref k[i_18] cond0.cc:9:3: note: misalign = 0 bytes of ref z[i_18] cond0.cc:9:3: note: misalign = 0 bytes of ref z[i_18] cond0.cc:9:3: note: virtual phi. skip. cond0.cc:9:3: note: num. args = 4 (not unary/binary/ternary op). cond0.cc:9:3: note: not ssa-name. cond0.cc:9:3: note: use not simple. cond0.cc:9:3: note: bit-precision arithmetic not supported. cond0.cc:9:3: note: not vectorized: relevant stmt not supported: _6 = _5 > 0.0; cond0.cc:9:3: note: bad operation or unsupported loop bound. Vincenzos-MacBook-Pro-2:vectorize innocent$ c++ -v Using built-in specs. COLLECT_GCC=c++ COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-apple-darwin13.1.0/4.10.0/lto-wrapper Target: x86_64-apple-darwin13.1.0 Configured with: ../gcc-trunk/configure --disable-multilib --disable-bootstrap --disable-libitm --enable-languages=c,c++,fortran,lto --disable-libsanitizer --enable-lto Thread model: posix gcc version 4.10.0 20140430 (experimental) [trunk revision 209930] (GCC)
I see on trunk after if-conversion _6 = _5 > 0.0; _9 = _7 < _8; _10 = _9 & _6; _11 = (int) _10; k[i_18] = _11; iftmp.0_13 = z[i_18]; iftmp.0_2 = _10 ? iftmp.0_13 : _8; z[i_18] = iftmp.0_2; so what happens is that we do have "bit-precision" arithmetic with the bitwise and. This is a regression because of the way we lower comparisons now I guess. I will have a look.
Actually the vectorizer punts on the comparisons itself. The pattern recognizer handles some of them as patt_10 = _4 > 0.0 ? 1 : 0; but not those feeding the BIT expressions which would need to be widened then (though they are supported as bit-precision).
of course if you can make z[i] = ( (x[i]>0) & (w[i]<0)) ? z[i] : y[i]; to vectorize would be even better!
Created attachment 32803 [details] patch Like this.
great! the original version (that vectorized in 4.8.1) void barX() { for (int i=0; i<1024; ++i) { k[i] = (x[i]>0) & (w[i]<y[i]); z[i] = (k[i]) ? z[i] : y[i]; } } does not vectorize yet. On the other hand I am very happy to see void bar() { for (int i=0; i<1024; ++i) { auto c = ( (x[i]>0) & (w[i]<y[i])) | (y[i]>0.5f); z[i] = c ? y[i] : z[i]; } } vectorized if (c) z[i] = y[i]; does not even with -ftree-loop-if-convert-stores not a real issue at least for what I am concerned
On Fri, 16 May 2014, vincenzo.innocente at cern dot ch wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61194 > > --- Comment #7 from vincenzo Innocente <vincenzo.innocente at cern dot ch> --- > great! > > the original version (that vectorized in 4.8.1) > void barX() { > for (int i=0; i<1024; ++i) { > k[i] = (x[i]>0) & (w[i]<y[i]); > z[i] = (k[i]) ? z[i] : y[i]; > } > } > > does not vectorize yet. That's because we hit check_bool_pattern (var=<ssa_name 0x7ffff6c36e10>, loop_vinfo=0x1f3e900, bb_vinfo=0x0) at /space/rguenther/src/svn/trunk/gcc/tree-vect-patterns.c:2596 2596 &dt)) ... 2605 if (!has_single_use (def)) 2606 return false; because _5 = x[i_18]; _6 = _5 > 0.0; _7 = w[i_18]; _8 = y[i_18]; _9 = _7 < _8; _10 = _9 & _6; _11 = (int) _10; k[i_18] = _11; iftmp.0_13 = z[i_18]; iftmp.0_2 = _10 ? iftmp.0_13 : _8; thus we have CSEd the load from k and propagated from the conversion. VRP does this: _11 = (int) _10; - k[i_1] = _11; - if (_11 != 0) + k[i_18] = _11; + if (_10 != 0) and -fno-tree-vrp "fixes" the regression. If k were of type _Bool then it likely wouldn't vectorize with 4.8 either. The vectorizer cannot handle multi-uses of a pattern part (in this case it's the start which would be doable, but it's far from trivial ...). That said, static float x[1024]; static float y[1024]; static float z[1024]; static float w[1024]; static _Bool k[1024]; void __attribute__((noinline,noclone)) barX() { int i; for (i=0; i<1024; ++i) { k[i] = (x[i]>0) & (w[i]<y[i]); z[i] = (k[i]) ? z[i] : y[i]; } } is not vectorized even in 4.8 for the cited reason. > On the other hand I am very happy to see > void bar() { > for (int i=0; i<1024; ++i) { > auto c = ( (x[i]>0) & (w[i]<y[i])) | (y[i]>0.5f); > z[i] = c ? y[i] : z[i]; > } > } > vectorized > if (c) z[i] = y[i]; > does not even with -ftree-loop-if-convert-stores > not a real issue at least for what I am concerned I think it doesn't introduce data races unless you also specify --param allow-store-data-races=1. I also don't see the testcases vectorized when using && instead of &. If not already there, these warrant (different) bugreports.
Author: rguenth Date: Fri May 16 11:21:11 2014 New Revision: 210514 URL: http://gcc.gnu.org/viewcvs?rev=210514&root=gcc&view=rev Log: 2014-05-16 Richard Biener <rguenther@suse.de> PR tree-optimization/61194 * tree-vect-patterns.c (adjust_bool_pattern): Also handle bool patterns ending in a COND_EXPR. * gcc.dg/vect/pr61194.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/vect/pr61194.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-patterns.c
Created attachment 32805 [details] patch fixing the regression This would fix the regression (also without the previous patch?)
(In reply to Richard Biener from comment #10) > Created attachment 32805 [details] > patch fixing the regression > > This would fix the regression (also without the previous patch?) It does, on the 4.9 branch at least, for static float x[1024]; static float y[1024]; static float z[1024]; static float w[1024]; static int k[1024]; void __attribute__((noinline,noclone)) barX() { int i; for (i=0; i<1024; ++i) { k[i] = x[i]>0; k[i] &= w[i]<y[i]; z[i] = (k[i]) ? z[i] : y[i]; } } but it doesn't change the outcome of the big testcase in the original report. It does together with the other patch though: > g++-4.9 t.C -Ofast -ftree-loop-if-convert-stores -fopt-info-vec -B. -fopenmp t.C:11:5: note: loop vectorized t.C:19:23: note: loop vectorized t.C:24:5: note: loop vectorized t.C:29:5: note: loop vectorized t.C:35:5: note: loop vectorized t.C:41:5: note: loop vectorized t.C:47:5: note: loop vectorized bar2 still not vectorized there. But with 4.7 I see the same as with 4.8 and 4.9: 35: LOOP VECTORIZED. 41: LOOP VECTORIZED. 47: LOOP VECTORIZED. so where exactly does the "regression" part appear for you? Is that only for the code in comment#1?
void bar2() { for (int i=0; i<1024; ++i) { k[i] = x[i]>0; j[i] = w[i]<0; z[i] = ( k[i] & j[i]) ? z[i] : y[i]; } } has similar issues (non-single-uses due to CSE and propagating from the conversion sources): _5 = x[i_20]; _6 = _5 > 0.0; _7 = (int) _6; k[i_20] = _7; _9 = w[i_20]; _10 = _9 < 0.0; _11 = (int) _10; j[i_20] = _11; _18 = _10 & _6; iftmp.0_14 = z[i_20]; iftmp.0_15 = y[i_20]; iftmp.0_2 = _18 ? iftmp.0_14 : iftmp.0_15; z[i_20] = iftmp.0_2; This is generally caused by optimizing code to use smaller precisions. So I think we need a more general solution for this than just the 2nd patch I attached (which I won't pursue - I figure the first one would be way more useful as it results in the same result for your initial large testcase where the 2nd patch doesn't make a difference).
I confirm that with last patch the regression is gone also in a more complex actual application I had. The regression concerns only comment 2 and 3. all the other cases in comment 1 were various attempt of mine to see if anything was changed that allowed vectorization using a different syntax. I am happy that now they all vectorize (but bar2...) when, in 2011, I wrote the original test case, I introduced the int vector to make it vectorize (most probably I also submitted a bug report on the subject)
provided that future patches will make the code in comment 1 and 2 (and bar) go vectorize is fine with me. if it ends up to vectorize also with "bool" instead of "int" even better. (I am not sure that bit/byte handling is really more efficient in sse and avx w.r.t plain 32bit int)
Seems related to PR 57328.
GCC 4.9.1 has been released.
GCC 4.9.2 has been released.
There is a proposed patch to if-conversion that solves the multiple-use issue by duplicating the involved statements (ugh).
GCC 4.9.3 has been released.
All functions in the description are vectorized on trunk so do those from comment#1 and comment#2. All but bar2 are vectorized with GCC 6 already. Thus fixed on trunk (or with GCC5/6 with OMP SIMD aka force-vect).