Created attachment 29754 [details] Test case When compiling the attached file with GCC 4.8.0 on an AVX capable system the main loop isn't vectorized. This is a regression compared with 4.7.2 on the same system (where the loop is fully vectorized). I apologize for the length of the test case -- smaller examples do not reproduce the behaviour in question. Output of -save-temps: gcc-mp-4.8 -v -save-temps -Ofast -march=native -std=c99 -S test.c Using built-in specs. COLLECT_GCC=gcc-mp-4.8 Target: x86_64-apple-darwin12 Configured with: ../gcc-4.8-20130321/configure --prefix=/opt/local --build=x86_64-apple-darwin12 --enable-languages=c,c++,objc,obj-c++,fortran,java --libdir=/opt/local/lib/gcc48 --includedir=/opt/local/include/gcc48 --infodir=/opt/local/share/info --mandir=/opt/local/share/man --datarootdir=/opt/local/share/gcc-4.8 --with-local-prefix=/opt/local --with-system-zlib --disable-nls --program-suffix=-mp-4.8 --with-gxx-include-dir=/opt/local/include/gcc48/c++/ --with-gmp=/opt/local --with-mpfr=/opt/local --with-mpc=/opt/local --with-ppl=/opt/local --with-cloog=/opt/local --enable-cloog-backend=isl --disable-cloog-version-check --enable-stage1-checking --disable-multilib --enable-lto --enable-libstdcxx-time --with-as=/opt/local/bin/as --with-ld=/opt/local/bin/ld --with-ar=/opt/local/bin/ar --with-bugurl=https://trac.macports.org/newticket --with-pkgversion='MacPorts gcc48 4.8-20130321_0' Thread model: posix gcc version 4.8.0 20130321 (prerelease) (MacPorts gcc48 4.8-20130321_0) COLLECT_GCC_OPTIONS='-mmacosx-version-min=10.8.3' '-v' '-save-temps' '-Ofast' '-march=native' '-std=c99' '-S' /opt/local/libexec/gcc/x86_64-apple-darwin12/4.8.0/cc1 -E -quiet -v -D__DYNAMIC__ test.c -march=corei7-avx -mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=3072 -mtune=corei7-avx -fPIC -mmacosx-version-min=10.8.3 -std=c99 -Ofast -fpch-preprocess -o test.i ignoring nonexistent directory "/opt/local/lib/gcc48/gcc/x86_64-apple-darwin12/4.8.0/../../../../../x86_64-apple-darwin12/include" #include "..." search starts here: #include <...> search starts here: /opt/local/lib/gcc48/gcc/x86_64-apple-darwin12/4.8.0/include /opt/local/include /opt/local/lib/gcc48/gcc/x86_64-apple-darwin12/4.8.0/include-fixed /usr/include /System/Library/Frameworks /Library/Frameworks End of search list. COLLECT_GCC_OPTIONS='-mmacosx-version-min=10.8.3' '-v' '-save-temps' '-Ofast' '-march=native' '-std=c99' '-S' /opt/local/libexec/gcc/x86_64-apple-darwin12/4.8.0/cc1 -fpreprocessed test.i -march=corei7-avx -mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=3072 -mtune=corei7-avx -fPIC -quiet -dumpbase test.c -mmacosx-version-min=10.8.3 -auxbase test -Ofast -std=c99 -version -o test.s GNU C (MacPorts gcc48 4.8-20130321_0) version 4.8.0 20130321 (prerelease) (x86_64-apple-darwin12) compiled by GNU C version 4.8.0 20130321 (prerelease), GMP version 5.0.5, MPFR version 3.1.1-p2, MPC version 1.0.1 GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072 GNU C (MacPorts gcc48 4.8-20130321_0) version 4.8.0 20130321 (prerelease) (x86_64-apple-darwin12) compiled by GNU C version 4.8.0 20130321 (prerelease), GMP version 5.0.5, MPFR version 3.1.1-p2, MPC version 1.0.1 GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072 Compiler executable checksum: 6291d2010395c7dee8043d72914d31cb COMPILER_PATH=/opt/local/libexec/gcc/x86_64-apple-darwin12/4.8.0/:/opt/local/libexec/gcc/x86_64-apple-darwin12/4.8.0/:/opt/local/libexec/gcc/x86_64-apple-darwin12/:/opt/local/lib/gcc48/gcc/x86_64-apple-darwin12/4.8.0/:/opt/local/lib/gcc48/gcc/x86_64-apple-darwin12/ LIBRARY_PATH=/opt/local/lib/gcc48/gcc/x86_64-apple-darwin12/4.8.0/:/opt/local/lib/gcc48/gcc/x86_64-apple-darwin12/4.8.0/../../../:/usr/lib/ COLLECT_GCC_OPTIONS='-mmacosx-version-min=10.8.3' '-v' '-save-temps' '-Ofast' '-march=native' '-std=c99' '-S'
t.c:137: note: not vectorized: no vectype for stmt: u ={v} {CLOBBER}; scalar_type: float[5] t.c:137: note: bad data references. t.c:137: note: ***** Re-trying analysis with vector size 16
Somewhat reduced testcase: inline void bar (const float s[5], float z[3][5]) { float a = s[0], b = s[1], c = s[2], d = s[3], e = s[4]; float f = 1.0f / a; float u = f * b, v = f * c, w = f * d; float p = 0.4f * (e - 0.5f * (b * u + c * v + d * w)); z[0][3] = b * w; z[1][3] = c * w; z[2][3] = d * w + p; } void foo (unsigned long n, const float *__restrict u0, const float *__restrict u1, const float *__restrict u2, const float *__restrict u3, const float *__restrict u4, const float *__restrict s0, const float *__restrict s1, const float *__restrict s2, float *__restrict t3, float *__restrict t4) { unsigned long i; for (i = 0; i < n; i++) { float u[5], f[3][5]; u[0] = u0[i]; u[1] = u1[i]; u[2] = u2[i]; u[3] = u3[i]; u[4] = u4[i]; bar (u, f); t3[i] = s0[i] * f[0][3] + s1[i] * f[1][3] + s2[i] * f[2][3]; } }
The clobbers are dead and useless btw, but we only remove clobbers from within remove_unused_locals which doesn't run inbetween after IPA inlining and right before RTL expansion (rightfully so). Vectorizing without removing the clobbers requires us to honor them at least for placement of aliasing vectorized stores / loads and also IV adjustments in case the clobber is a MEM of an SSA name and that is loop variant (now possible, but not on the 4.8 branch). So the simplest solution is to discard all clobbers inside the vectorized loop body.
Author: rguenth Date: Tue May 28 13:36:25 2013 New Revision: 199380 URL: http://gcc.gnu.org/viewcvs?rev=199380&root=gcc&view=rev Log: 2013-05-28 Richard Biener <rguenther@suse.de> PR tree-optimization/56787 * tree-vect-data-refs.c (vect_analyze_data_refs): Drop clobbers from the list of data references. * tree-vect-loop.c (vect_determine_vectorization_factor): Skip clobbers. (vect_analyze_loop_operations): Likewise. (vect_transform_loop): Remove clobbers. * gcc.dg/vect/pr56787.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/vect/pr56787.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-data-refs.c trunk/gcc/tree-vect-loop.c
The new testcase fails on powerpc64-linux, as can be seen in http://gcc.gnu.org/ml/gcc-testresults/2013-06/msg00904.html.
The patch fixed x86_64 but the new testcase fails on PPC64.
GCC 4.8.2 has been released.
Also fails on arm-* btw.
Author: rguenth Date: Thu Dec 5 09:20:51 2013 New Revision: 205696 URL: http://gcc.gnu.org/viewcvs?rev=205696&root=gcc&view=rev Log: 2013-12-05 Richard Biener <rguenther@suse.de> PR tree-optimization/56787 * gcc.dg/vect/pr56787.c: Adjust to not require vector float division. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/vect/pr56787.c
Maybe it works now.
(In reply to Richard Biener from comment #10) > Maybe it works now. PASSes on arm* now, thanks.
Working on PowerPC also.
The patch cannot be backported easily, not going to fix it for 4.8.
GCC 4.9.0 has been released
GCC 4.9.1 has been released.
GCC 4.9.2 has been released.
Fixed for 4.9.3.