On Linux/i686, r62283 miscompiled 433.milc in SPEC CPU 2006: gcc -m32 -c -o com_vanilla.o -DSPEC_CPU -DNDEBUG -I. -DFN -DFAST -DCONGRAD_TMP_VECTORS -DDSLASH_TMP_LINKS -O3 -funroll-loops -msse2 -mfpmath=sse -ffast-math -fwhole-program -flto=jobserver -fuse-linker-plugin com_vanilla.c gcc -m32 -c -o io_nonansi.o -DSPEC_CPU -DNDEBUG -I. -DFN -DFAST -DCONGRAD_TMP_VECTORS -DDSLASH_TMP_LINKS -O3 -funroll-loops -msse2 -mfpmath=sse -ffast-math -fwhole-program -flto=jobserver -fuse-linker-plugin io_nonansi.c gcc -m32 -O3 -funroll-loops -msse2 -mfpmath=sse -ffast-math -fwhole-program -flto=jobserver -fuse-linker-plugin control.o f_meas.o gauge_info.o setup.o update.o update_h.o update_u.o layout_hyper.o check_unitarity.o d_plaq4.o gaugefix2.o io_helpers.o io_lat4.o make_lattice.o path_product.o ploop3.o ranmom.o ranstuff.o reunitarize2.o gauge_stuff.o grsource_imp.o mat_invert.o quark_stuff.o rephase.o cmplx.o addmat.o addvec.o clear_mat.o clearvec.o m_amatvec.o m_mat_an.o m_mat_na.o m_mat_nn.o m_matvec.o make_ahmat.o rand_ahmat.o realtr.o s_m_a_mat.o s_m_a_vec.o s_m_s_mat.o s_m_vec.o s_m_mat.o su3_adjoint.o su3_dot.o su3_rdot.o su3_proj.o su3mat_copy.o submat.o subvec.o trace_su3.o uncmp_ahmat.o msq_su3vec.o sub4vecs.o m_amv_4dir.o m_amv_4vec.o m_mv_s_4dir.o m_su2_mat_vec_n.o l_su2_hit_n.o r_su2_hit_a.o m_su2_mat_vec_a.o gaussrand.o byterevn.o m_mat_hwvec.o m_amat_hwvec.o dslash_fn2.o d_congrad5_fn.o com_vanilla.o io_nonansi.o -lm -o milc Running 433.milc ref peak lto default 433.milc: copy 0 non-zero return code (exit code=1, signal=0) WARMUPS COMPLETED Unitarity problem on node 0, site 0, dir 0 tolerance=1.000000e-04 SU3 matrix: 0.991448 0.021247 -0.099836 -0.054012 0.051686 0.032000 0.100302 -0.049616 0.989806 -0.028206 0.027581 -0.078778 -0.059070 0.023841 -0.022247 -0.078163 0.994634 0.006470 repeat in hex: 82a7e05f 9c8421c0 b159fbbd cc2c8301 eca1e2e4 4a88bfbf 7c12f3d5 f7ba11aa 1939e0e5 a49c5e6c 0095ea4b a16c5754 cfce59e0 209d3be4 702a8d28 2fdd37ea d75f9836 f0058a61 ... Unitarity error count exceeded. termination: Wed Apr 29 11:44:07 2015 Termination: status = 1
gcc -m32 -O3 -funroll-loops -msse2 -mfpmath=sse -ffast-math also miscompiled.
What exact revision do you mean? r62283 is quite a bit old...
I suppose r222514.
(In reply to Richard Biener from comment #3) > I suppose r222514. Yes, it is.
Also reproduces with -Ofast -march=bdver2 -m32 so I suppose -O3 -ffast-math -m32 -msse2 should also reproduce it. I wont get to look into this before Monday, any help in bisecting (-fno-tree-vectorize for which file(s) fix it? for which function?) appreciated until then. At least milc is C and not too big ;)
Also fails with the test input.
Building rephase.c:rephase with -fno-tree-slp-vectorize makes it succeed. The loop is basically register int i,j,k,dir; register site *s; for(i=0,s=lattice;i<sites_on_node;i++,s++){ for(dir=0;dir<=3;dir++){ for(j=0;j<3;j++)for(k=0;k<3;k++){ s->link[dir].e[j][k].real *= s->phase[dir]; s->link[dir].e[j][k].imag *= s->phase[dir]; } } } where the inner two loops are unrolled and SLP applies to them. We need to perform swapping on some of the mult operands and then somehow fail to use the correct vectors to build up s->phase[dir] from parts. vect_cst_.35_44 = {_117, _121}; vect_cst_.36_45 = {_110, _114}; vect_cst_.37_46 = {_103, _107}; vect_cst_.38_47 = {_92, _96}; vect_cst_.39_48 = {_85, _89}; vect_cst_.40_52 = {_78, _82}; vect_cst_.41_53 = {_49, _71}; vect_cst_.42_54 = {_31, _60}; vect_cst_.43_55 = {_24, _27}; note how these should be all {_24, _24} but only the first one is correct. I think this is a latent issue since we do the swapping tricks.
Testcase that also fails with -m64. /* { dg-do run } */ /* { dg-additional-options "-O3" } */ /* { dg-require-effective-target vect_double } */ #include "tree-vect.h" extern void abort (void); extern void *malloc (__SIZE_TYPE__); struct site { struct { struct { double real; double imag; } e[3][3]; } link[32]; double phase[32]; } *lattice; int sites_on_node; void rephase (void) { int i,j,k,dir; struct site *s; for(i=0,s=lattice;i<sites_on_node;i++,s++) for(dir=0;dir<32;dir++) for(j=0;j<3;j++)for(k=0;k<3;k++) { s->link[dir].e[j][k].real *= s->phase[dir]; s->link[dir].e[j][k].imag *= s->phase[dir]; } } int main() { int i,j,k; check_vect (); sites_on_node = 1; lattice = malloc (sizeof (struct site) * sites_on_node); for (i = 0; i < 32; ++i) { lattice->phase[i] = i; for (j = 0; j < 3; ++j) for (k = 0; k < 3; ++k) { lattice->link[i].e[j][k].real = 1.0; lattice->link[i].e[j][k].imag = 1.0; __asm__ volatile ("" : : : "memory"); } } rephase (); for (i = 0; i < 32; ++i) for (j = 0; j < 3; ++j) for (k = 0; k < 3; ++k) if (lattice->link[i].e[j][k].real != i || lattice->link[i].e[j][k].imag != i) abort (); return 0; } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "slp1" } } */ /* { dg-final { cleanup-tree-dump "slp1" } } */ /* { dg-final { cleanup-tree-dump "vect" } } */
Testing a fix. The issue is latent with at least constant operands - but it's probably impossible(?) to get bogus operand order on those. In theory if the loads from phase were just in a different BB the issue would reproduce as well. OTOH then GCC 5 swaps in vect_get_and_check_slp_defs which already does swap operands in the IL...
Author: rguenth Date: Mon May 4 13:31:02 2015 New Revision: 222764 URL: https://gcc.gnu.org/viewcvs?rev=222764&root=gcc&view=rev Log: 2015-05-04 Richard Biener <rguenther@suse.de> PR tree-optimization/65935 * tree-vect-slp.c (vect_build_slp_tree): If we swapped operands then make sure to apply that swapping to the IL. * gcc.dg/vect/bb-slp-pr65935.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/vect/bb-slp-pr65935.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-slp.c
Fixed.