Created attachment 23991 [details] reduced testcase I just tested GCC 4.6 with -mxop and found that my testShift unit test is miscompiled. I tried hard to reduce the testcase further than what I got now, but it is very fragile, i.e. it produces correct code if I remove unrelated code, such as test calls after the problematic test. Compile the testcase with % g++ -mxop -O2 -ftree-vectorize testcase.cpp to get the error. Let me know if I can help further.
I see no vectorized loop which makes me suspect if-conversion. Can you try if -ftree-loop-if-convert is enough to trigger the failure?
You also haven't specified your target, whether you compile it as 32-bit or 64-bit and where exactly it misbehaves.
(In reply to comment #1) > I see no vectorized loop which makes me suspect if-conversion. Can you > try if -ftree-loop-if-convert is enough to trigger the failure? % /opt/gcc-4.6.0/bin/g++ -mxop -O2 -ftree-vectorize xop_miscompile.cpp && ./a.out FAIL: ┍ at /home/mkretz/src/Vc/tests/arithmetics.cpp:259: FAIL: │ a[i] (16) == x (8) -> false FAIL: ┕ testShift<int_v> FAIL: ┍ at /home/mkretz/src/Vc/tests/arithmetics.cpp:259: FAIL: │ a[i] (16) == x (8) -> false FAIL: ┕ testShift<uint_v> Testing done. 0 tests passed. 2 tests failed. % /opt/gcc-4.6.0/bin/g++ -mxop -O2 -ftree-loop-if-convert xop_miscompile.cpp && ./a.out PASS: testShift<int_v> PASS: testShift<uint_v> Testing done. 2 tests passed. 0 tests failed. (In reply to comment #2) > You also haven't specified your target, whether you compile it as 32-bit or > 64-bit and where exactly it misbehaves. target is AMD64 (i.e. Bulldozer/Family 15h) The failure is in line 1724, or more specifically in lines 1288 and 1290. The operator>>= seems to get lost, so that the a[i] contain 16 instead of 16>>i. XOP has an instruction for this shift. For operator<<= (lines 1287 and 1289) GCC correctly translates the code to vpshad.
Managed to reproduce it, looking into it.
The shift corresponding to a = Vec(16); a >>= shifts; is actually there, but a wrong one, V4SImode = V4SImode >> SImode rather than V4SImode = V4SImode >> V4SImode which needs to be done. As the first int in shifts is 0, this means %xmm0 containing {16,16,16,16} initially is shifted down by {0,0,0,0} rather than {0,1,2,3}.
I should mention it is the only vpsrad insn in the testcase. In *.cunroll we have: [pr48616.C : 113:76] D.52396_86 = MEM[(const AliasingEntryType * {ref-all})&shifts]; ... [pr48616.C : 113:76] D.52395_89 = MEM[(const AliasingEntryType * {ref-all})&shifts + 4B]; ... [pr48616.C : 113:76] D.52394_92 = MEM[(const AliasingEntryType * {ref-all})&shifts + 8B]; ... [pr48616.C : 113:76] D.52393_95 = MEM[(const AliasingEntryType * {ref-all})&shifts + 12B]; ... <L12>: [pr48616.C : 1721:5] [pr48616.C : 1721] [pr48616.C : 1721] MEM[(struct Vector *)&a].data = { 68719476752, 68719476752 }; [pr48616.C : 1288:172] D.52507_129 = MEM[(AliasingEntryType & {ref-all})&a]; [pr48616.C : 1288:172] D.52506_131 = D.52507_129 >> D.52396_86; [pr48616.C : 1288:172] MEM[(AliasingEntryType & {ref-all})&a] = D.52506_131; [pr48616.C : 1288:258] D.52504_132 = MEM[(AliasingEntryType & {ref-all})&a + 4]; [pr48616.C : 1288:258] D.52503_134 = D.52504_132 >> D.52395_89; [pr48616.C : 1288:258] MEM[(AliasingEntryType & {ref-all})&a + 4] = D.52503_134; [pr48616.C : 1288:344] D.52501_135 = MEM[(AliasingEntryType & {ref-all})&a + 8]; [pr48616.C : 1288:344] D.52500_137 = D.52501_135 >> D.52394_92; [pr48616.C : 1288:344] MEM[(AliasingEntryType & {ref-all})&a + 8] = D.52500_137; [pr48616.C : 1288:430] D.52498_138 = MEM[(AliasingEntryType & {ref-all})&a + 12]; [pr48616.C : 1288:430] D.52497_140 = D.52498_138 >> D.52393_95; [pr48616.C : 1288:430] MEM[(AliasingEntryType & {ref-all})&a + 12] = D.52497_140; *.slp has: <L12>: [pr48616.C : 1721:5] # .MEMD.52319_50 = VDEF <.MEMD.52319_1126> [pr48616.C : 1721] [pr48616.C : 1721] MEM[(struct VectorD.34523 *)&aD.38703].dataD.34590 = { 68719476752, 68719476752 }; # PT = anything # ALIGN = 16, MISALIGN = 0 vect_p.1912D.63297_1085 = &MEM[(AliasingEntryTypeD.34571 & {ref-all})&aD.38703]; [pr48616.C : 1288:172] # VUSE <.MEMD.52319_50> vect_var_.1913D.63298_534 = MEM[(AliasingEntryTypeD.34571 & {ref-all})vect_p.1912D.63297_1085]; [pr48616.C : 1288:172] # VUSE <.MEMD.52319_50> D.52507_129 = MEM[(AliasingEntryTypeD.34571 & {ref-all})&aD.38703]; [pr48616.C : 1288:172] vect_var_.1914D.63299_575 = vect_var_.1913D.63298_534 >> D.52396_86; [pr48616.C : 1288:172] D.52506_131 = D.52507_129 >> D.52396_86; # PT = anything # ALIGN = 16, MISALIGN = 0 vect_p.1918D.63303_576 = &MEM[(AliasingEntryTypeD.34571 & {ref-all})&aD.38703]; [pr48616.C : 1288:258] # VUSE <.MEMD.52319_50> D.52504_132 = MEM[(AliasingEntryTypeD.34571 & {ref-all})&aD.38703 + 4]; [pr48616.C : 1288:258] D.52503_134 = D.52504_132 >> D.52395_89; [pr48616.C : 1288:344] # VUSE <.MEMD.52319_50> D.52501_135 = MEM[(AliasingEntryTypeD.34571 & {ref-all})&aD.38703 + 8]; [pr48616.C : 1288:344] D.52500_137 = D.52501_135 >> D.52394_92; [pr48616.C : 1288:430] # VUSE <.MEMD.52319_50> D.52498_138 = MEM[(AliasingEntryTypeD.34571 & {ref-all})&aD.38703 + 12]; [pr48616.C : 1288:430] D.52497_140 = D.52498_138 >> D.52393_95; [pr48616.C : 1288:430] # .MEMD.52319_580 = VDEF <.MEMD.52319_50> MEM[(AliasingEntryTypeD.34571 & {ref-all})vect_p.1918D.63303_576] = vect_var_.1914D.63299_575;
Created attachment 23997 [details] pr48616.ii.bz2 Preprocessed, for -O2 -ftree-vectorize -mxop -m64. The bug is during the second vect_schedule_slp call on that testcase, or perhaps in its analysis. The first vect_schedule_slp call is on a bb that reads from shift and is vectorized correctly. The problem on the second bb is that all the 4 rhs2s of the shifts don't have the definition in the current bb (thus vect_is_simple_use does: if ((loop && !flow_bb_inside_loop_p (loop, bb)) || (!loop && bb != BB_VINFO_BB (bb_vinfo)) || (!loop && gimple_code (*def_stmt) == GIMPLE_PHI)) *dt = vect_external_def; and vectorizable_shift assumes that vect_external_def means vector shift by scalar and nothing during the analysis checks that if it assumes it is a shift by scalar, the shift count must be actually the same between all the to be slp replaced stmts. In this case if it remembered the vector_var_ from the defining BB it could very well just use it as the rhs2 operand of the vector shift, but that assumes the bb in which it is defined has been slp vectorized already.
extern void abort (void); int a[4] __attribute__((aligned (32))); int b[4] __attribute__((aligned (32))); int c[4] __attribute__((aligned (32))); int d[4] __attribute__((aligned (32))); int e[4] __attribute__((aligned (32))); __attribute__((noinline, noclone)) void foo (int i) { a[0] = b[0] << c[0]; a[1] = b[1] << c[1]; a[2] = b[2] << c[2]; a[3] = b[3] << c[3]; if (i) { d[0] = e[0] >> c[0]; d[1] = e[1] >> c[1]; d[2] = e[2] >> c[2]; d[3] = e[3] >> c[3]; } } int main () { int i; int *t; for (i = 0; i < 4; i++) { b[i] = 32; c[i] = i; e[i] = 32; } asm volatile ("" : : "r" (b) : "memory"); asm volatile ("" : : "r" (c) : "memory"); asm volatile ("" : "=r" (t) : "0" (d) : "memory"); foo (t != 0); for (i = 0; i < 4; i++) if (a[i] != (32 << i) || d[i] != (32 >> i)) abort (); return 0; } is reduced testcase for -O2 -ftree-vectorize -mxop. I might try to fix this on Monday, unless Ira beats me to do it.
Created attachment 24003 [details] gcc46-pr48616.patch This untested patch seems to work for me, though the generated code is in some cases still less than ideal, because it doesn't figure it can just use a vector computed in the earlier bb. It would be nice if it could figure that out or some later pass be able to clean that up. Additionally, it doesn't seem to SLP optimize fn3's first bb, where it very well could use a scalar shift. The testcase will fail without the patch just when testing with -mxop in flags, we don't have unfortunately yet xop and xop_runtime effective targets in testsuite/lib/
See PR48317 for one reason why vector stuff isn't optimized too well.
Author: jakub Date: Mon Apr 18 06:55:13 2011 New Revision: 172638 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=172638 Log: PR tree-optimization/48616 * tree-vect-stmts.c (vectorizable_shift): If SLP, determine whether the shift is by scalar or vector based on whether all SLP scalar stmts have the same rhs. * gcc.dg/pr48616.c: New test. Added: trunk/gcc/testsuite/gcc.dg/pr48616.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-stmts.c
Author: jakub Date: Mon Apr 18 07:38:11 2011 New Revision: 172640 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=172640 Log: PR tree-optimization/48616 * tree-vect-stmts.c (vectorizable_shift): If SLP, determine whether the shift is by scalar or vector based on whether all SLP scalar stmts have the same rhs. * gcc.dg/pr48616.c: New test. Added: branches/gcc-4_6-branch/gcc/testsuite/gcc.dg/pr48616.c Modified: branches/gcc-4_6-branch/gcc/ChangeLog branches/gcc-4_6-branch/gcc/testsuite/ChangeLog branches/gcc-4_6-branch/gcc/tree-vect-stmts.c
Fixed.
thanks! I verified the patch with GCC 4.6.0 and my unit test is fine again. Great response time! :)