65935 – [6 Regression] 433.milc in SPEC CPU 2006 is miscompiled

Bug 65935 - [6 Regression] 433.milc in SPEC CPU 2006 is miscompiled

Summary: [6 Regression] 433.milc in SPEC CPU 2006 is miscompiled

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	6.0

Importance:	P3 normal
Target Milestone:	6.0
Assignee:	Richard Biener

URL:
Keywords:

Depends on:
Blocks:

Reported:	2015-04-30 02:25 UTC by H.J. Lu
Modified:	2015-05-04 13:31 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2015-04-30 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description H.J. Lu 2015-04-30 02:25:25 UTC

On Linux/i686, r62283 miscompiled 433.milc in SPEC CPU 2006:

gcc -m32 -c -o com_vanilla.o -DSPEC_CPU -DNDEBUG -I. -DFN -DFAST -DCONGRAD_TMP_VECTORS -DDSLASH_TMP_LINKS  -O3 -funroll-loops -msse2 -mfpmath=sse -ffast-math -fwhole-program -flto=jobserver -fuse-linker-plugin                com_vanilla.c
gcc -m32 -c -o io_nonansi.o -DSPEC_CPU -DNDEBUG -I. -DFN -DFAST -DCONGRAD_TMP_VECTORS -DDSLASH_TMP_LINKS  -O3 -funroll-loops -msse2 -mfpmath=sse -ffast-math -fwhole-program -flto=jobserver -fuse-linker-plugin                io_nonansi.c
gcc -m32  -O3 -funroll-loops -msse2 -mfpmath=sse -ffast-math -fwhole-program -flto=jobserver -fuse-linker-plugin          control.o f_meas.o gauge_info.o setup.o update.o update_h.o update_u.o layout_hyper.o check_unitarity.o d_plaq4.o gaugefix2.o io_helpers.o io_lat4.o make_lattice.o path_product.o ploop3.o ranmom.o ranstuff.o reunitarize2.o gauge_stuff.o grsource_imp.o mat_invert.o quark_stuff.o rephase.o cmplx.o addmat.o addvec.o clear_mat.o clearvec.o m_amatvec.o m_mat_an.o m_mat_na.o m_mat_nn.o m_matvec.o make_ahmat.o rand_ahmat.o realtr.o s_m_a_mat.o s_m_a_vec.o s_m_s_mat.o s_m_vec.o s_m_mat.o su3_adjoint.o su3_dot.o su3_rdot.o su3_proj.o su3mat_copy.o submat.o subvec.o trace_su3.o uncmp_ahmat.o msq_su3vec.o sub4vecs.o m_amv_4dir.o m_amv_4vec.o m_mv_s_4dir.o m_su2_mat_vec_n.o l_su2_hit_n.o r_su2_hit_a.o m_su2_mat_vec_a.o gaussrand.o byterevn.o m_mat_hwvec.o m_amat_hwvec.o dslash_fn2.o d_congrad5_fn.o com_vanilla.o io_nonansi.o             -lm        -o milc

  Running 433.milc ref peak lto default

433.milc: copy 0 non-zero return code (exit code=1, signal=0)

WARMUPS COMPLETED
Unitarity problem on node 0, site 0, dir 0 tolerance=1.000000e-04
SU3 matrix:
0.991448 0.021247 -0.099836 -0.054012 0.051686 0.032000 
0.100302 -0.049616 0.989806 -0.028206 0.027581 -0.078778 
-0.059070 0.023841 -0.022247 -0.078163 0.994634 0.006470 
repeat in hex:
82a7e05f 9c8421c0 b159fbbd cc2c8301 eca1e2e4 4a88bfbf 
7c12f3d5 f7ba11aa 1939e0e5 a49c5e6c 0095ea4b a16c5754 
cfce59e0 209d3be4 702a8d28 2fdd37ea d75f9836 f0058a61 
...
Unitarity error count exceeded.
termination: Wed Apr 29 11:44:07 2015

Termination: status = 1

Comment 1 H.J. Lu 2015-04-30 04:29:47 UTC

gcc -m32  -O3 -funroll-loops -msse2 -mfpmath=sse -ffast-math also miscompiled.

Comment 2 Jakub Jelinek 2015-04-30 07:31:26 UTC

What exact revision do you mean?  r62283 is quite a bit old...

Comment 3 Richard Biener 2015-04-30 07:40:01 UTC

I suppose r222514.

Comment 4 H.J. Lu 2015-04-30 12:08:00 UTC

(In reply to Richard Biener from comment #3)
> I suppose r222514.

Yes, it is.

Comment 5 Richard Biener 2015-04-30 12:34:16 UTC

Also reproduces with -Ofast -march=bdver2 -m32 so I suppose -O3 -ffast-math -m32 -msse2 should also reproduce it.

I wont get to look into this before Monday, any help in bisecting (-fno-tree-vectorize for which file(s) fix it?  for which function?) appreciated
until then.

At least milc is C and not too big ;)

Comment 6 Richard Biener 2015-04-30 14:48:13 UTC

Also fails with the test input.

Comment 7 Richard Biener 2015-04-30 15:28:54 UTC

Building rephase.c:rephase with -fno-tree-slp-vectorize makes it succeed.

The loop is basically

register int i,j,k,dir;
register site *s;

    for(i=0,s=lattice;i<sites_on_node;i++,s++){
 for(dir=0;dir<=3;dir++){
     for(j=0;j<3;j++)for(k=0;k<3;k++){
  s->link[dir].e[j][k].real *= s->phase[dir];
  s->link[dir].e[j][k].imag *= s->phase[dir];
     }
 }
    }

where the inner two loops are unrolled and SLP applies to them.  We need to perform swapping on some of the mult operands and then somehow fail to use
the correct vectors to build up s->phase[dir] from parts.

  vect_cst_.35_44 = {_117, _121};
  vect_cst_.36_45 = {_110, _114};
  vect_cst_.37_46 = {_103, _107};
  vect_cst_.38_47 = {_92, _96};
  vect_cst_.39_48 = {_85, _89};
  vect_cst_.40_52 = {_78, _82};
  vect_cst_.41_53 = {_49, _71};
  vect_cst_.42_54 = {_31, _60};
  vect_cst_.43_55 = {_24, _27};

note how these should be all {_24, _24} but only the first one is correct.

I think this is a latent issue since we do the swapping tricks.

Comment 8 Richard Biener 2015-05-04 09:24:01 UTC

Testcase that also fails with -m64.

/* { dg-do run } */
/* { dg-additional-options "-O3" } */
/* { dg-require-effective-target vect_double } */

#include "tree-vect.h"

extern void abort (void);
extern void *malloc (__SIZE_TYPE__);

struct site {
    struct {
        struct {
            double real;
            double imag;
        } e[3][3];
    } link[32];
    double phase[32];
} *lattice;
int sites_on_node;

void rephase (void)
{
  int i,j,k,dir;
  struct site *s;
  for(i=0,s=lattice;i<sites_on_node;i++,s++)
    for(dir=0;dir<32;dir++)
      for(j=0;j<3;j++)for(k=0;k<3;k++)
        {
          s->link[dir].e[j][k].real *= s->phase[dir];
          s->link[dir].e[j][k].imag *= s->phase[dir];
        }
}

int main()
{
  int i,j,k;
  check_vect ();
  sites_on_node = 1;
  lattice = malloc (sizeof (struct site) * sites_on_node);
  for (i = 0; i < 32; ++i)
    {
      lattice->phase[i] = i;
      for (j = 0; j < 3; ++j)
        for (k = 0; k < 3; ++k)
          {
            lattice->link[i].e[j][k].real = 1.0;
            lattice->link[i].e[j][k].imag = 1.0;
            __asm__ volatile ("" : : : "memory");
          }
    }
  rephase ();
  for (i = 0; i < 32; ++i)
    for (j = 0; j < 3; ++j)
      for (k = 0; k < 3; ++k)
        if (lattice->link[i].e[j][k].real != i
            || lattice->link[i].e[j][k].imag != i)
          abort ();
  return 0;
}

/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "slp1" } } */
/* { dg-final { cleanup-tree-dump "slp1" } } */
/* { dg-final { cleanup-tree-dump "vect" } } */

Comment 9 Richard Biener 2015-05-04 10:03:30 UTC

Testing a fix.  The issue is latent with at least constant operands - but it's
probably impossible(?) to get bogus operand order on those.  In theory if the
loads from phase were just in a different BB the issue would reproduce as well.
OTOH then GCC 5 swaps in vect_get_and_check_slp_defs which already does swap
operands in the IL...

Comment 10 Richard Biener 2015-05-04 13:31:34 UTC

Author: rguenth
Date: Mon May  4 13:31:02 2015
New Revision: 222764

URL: https://gcc.gnu.org/viewcvs?rev=222764&root=gcc&view=rev
Log:
2015-05-04  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/65935
	* tree-vect-slp.c (vect_build_slp_tree): If we swapped operands
	then make sure to apply that swapping to the IL.

	* gcc.dg/vect/bb-slp-pr65935.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.dg/vect/bb-slp-pr65935.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-slp.c

Comment 11 Richard Biener 2015-05-04 13:31:37 UTC

Fixed.