Bug 48616 - -ftree-vectorize -mxop miscompiles right shift
Summary: -ftree-vectorize -mxop miscompiles right shift
Status: VERIFIED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.6.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-04-15 08:16 UTC by Matthias Kretz (Vir)
Modified: 2014-04-16 18:08 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64
Build:
Known to work:
Known to fail:
Last reconfirmed: 2011-04-15 12:31:54


Attachments
reduced testcase (13.20 KB, text/x-c++src)
2011-04-15 08:16 UTC, Matthias Kretz (Vir)
Details
pr48616.ii.bz2 (87.36 KB, application/octet-stream)
2011-04-15 16:00 UTC, Jakub Jelinek
Details
gcc46-pr48616.patch (1.69 KB, patch)
2011-04-15 19:08 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Kretz (Vir) 2011-04-15 08:16:13 UTC
Created attachment 23991 [details]
reduced testcase

I just tested GCC 4.6 with -mxop and found that my testShift unit test is miscompiled. I tried hard to reduce the testcase further than what I got now, but it is very fragile, i.e. it produces correct code if I remove unrelated code, such as test calls after the problematic test.

Compile the testcase with
% g++ -mxop -O2 -ftree-vectorize testcase.cpp
to get the error.

Let me know if I can help further.
Comment 1 Richard Biener 2011-04-15 10:38:38 UTC
I see no vectorized loop which makes me suspect if-conversion.  Can you
try if -ftree-loop-if-convert is enough to trigger the failure?
Comment 2 Jakub Jelinek 2011-04-15 11:37:33 UTC
You also haven't specified your target, whether you compile it as 32-bit or 64-bit and where exactly it misbehaves.
Comment 3 Matthias Kretz (Vir) 2011-04-15 12:11:35 UTC
(In reply to comment #1)
> I see no vectorized loop which makes me suspect if-conversion.  Can you
> try if -ftree-loop-if-convert is enough to trigger the failure?

% /opt/gcc-4.6.0/bin/g++ -mxop -O2 -ftree-vectorize xop_miscompile.cpp && ./a.out
 FAIL: ┍ at /home/mkretz/src/Vc/tests/arithmetics.cpp:259:
 FAIL: │ a[i] (16) == x (8) -> false
 FAIL: ┕ testShift<int_v>
 FAIL: ┍ at /home/mkretz/src/Vc/tests/arithmetics.cpp:259:
 FAIL: │ a[i] (16) == x (8) -> false
 FAIL: ┕ testShift<uint_v>

 Testing done. 0 tests passed. 2 tests failed.

% /opt/gcc-4.6.0/bin/g++ -mxop -O2 -ftree-loop-if-convert xop_miscompile.cpp && ./a.out
 PASS: testShift<int_v>
 PASS: testShift<uint_v>

 Testing done. 2 tests passed. 0 tests failed.

(In reply to comment #2)
> You also haven't specified your target, whether you compile it as 32-bit or
> 64-bit and where exactly it misbehaves.

target is AMD64 (i.e. Bulldozer/Family 15h)

The failure is in line 1724, or more specifically in lines 1288 and 1290. The operator>>= seems to get lost, so that the a[i] contain 16 instead of 16>>i. XOP has an instruction for this shift. For operator<<= (lines 1287 and 1289) GCC correctly translates the code to vpshad.
Comment 4 Jakub Jelinek 2011-04-15 12:31:54 UTC
Managed to reproduce it, looking into it.
Comment 5 Jakub Jelinek 2011-04-15 13:07:56 UTC
The shift corresponding to
    a = Vec(16);
    a >>= shifts;
is actually there, but a wrong one, V4SImode = V4SImode >> SImode
rather than V4SImode = V4SImode >> V4SImode which needs to be done.  As the first int in shifts is 0, this means %xmm0 containing {16,16,16,16} initially
is shifted down by {0,0,0,0} rather than {0,1,2,3}.
Comment 6 Jakub Jelinek 2011-04-15 13:39:08 UTC
I should mention it is the only vpsrad insn in the testcase.

In *.cunroll we have:

  [pr48616.C : 113:76] D.52396_86 = MEM[(const AliasingEntryType * {ref-all})&shifts];
...
  [pr48616.C : 113:76] D.52395_89 = MEM[(const AliasingEntryType * {ref-all})&shifts + 4B];
...
  [pr48616.C : 113:76] D.52394_92 = MEM[(const AliasingEntryType * {ref-all})&shifts + 8B];
...
  [pr48616.C : 113:76] D.52393_95 = MEM[(const AliasingEntryType * {ref-all})&shifts + 12B];
...

<L12>:
  [pr48616.C : 1721:5] [pr48616.C : 1721] [pr48616.C : 1721] MEM[(struct Vector *)&a].data = { 68719476752, 68719476752 };
  [pr48616.C : 1288:172] D.52507_129 = MEM[(AliasingEntryType & {ref-all})&a];
  [pr48616.C : 1288:172] D.52506_131 = D.52507_129 >> D.52396_86;
  [pr48616.C : 1288:172] MEM[(AliasingEntryType & {ref-all})&a] = D.52506_131;
  [pr48616.C : 1288:258] D.52504_132 = MEM[(AliasingEntryType & {ref-all})&a + 4];
  [pr48616.C : 1288:258] D.52503_134 = D.52504_132 >> D.52395_89;
  [pr48616.C : 1288:258] MEM[(AliasingEntryType & {ref-all})&a + 4] = D.52503_134;
  [pr48616.C : 1288:344] D.52501_135 = MEM[(AliasingEntryType & {ref-all})&a + 8];
  [pr48616.C : 1288:344] D.52500_137 = D.52501_135 >> D.52394_92;
  [pr48616.C : 1288:344] MEM[(AliasingEntryType & {ref-all})&a + 8] = D.52500_137;
  [pr48616.C : 1288:430] D.52498_138 = MEM[(AliasingEntryType & {ref-all})&a + 12];
  [pr48616.C : 1288:430] D.52497_140 = D.52498_138 >> D.52393_95;
  [pr48616.C : 1288:430] MEM[(AliasingEntryType & {ref-all})&a + 12] = D.52497_140;

*.slp has:
<L12>:
  [pr48616.C : 1721:5] # .MEMD.52319_50 = VDEF <.MEMD.52319_1126>
  [pr48616.C : 1721] [pr48616.C : 1721] MEM[(struct VectorD.34523 *)&aD.38703].dataD.34590 = { 68719476752, 68719476752 };
  # PT = anything 
  # ALIGN = 16, MISALIGN = 0
  vect_p.1912D.63297_1085 = &MEM[(AliasingEntryTypeD.34571 & {ref-all})&aD.38703];
  [pr48616.C : 1288:172] # VUSE <.MEMD.52319_50>
  vect_var_.1913D.63298_534 = MEM[(AliasingEntryTypeD.34571 & {ref-all})vect_p.1912D.63297_1085];
  [pr48616.C : 1288:172] # VUSE <.MEMD.52319_50>
  D.52507_129 = MEM[(AliasingEntryTypeD.34571 & {ref-all})&aD.38703];
  [pr48616.C : 1288:172] vect_var_.1914D.63299_575 = vect_var_.1913D.63298_534 >> D.52396_86;
  [pr48616.C : 1288:172] D.52506_131 = D.52507_129 >> D.52396_86;
  # PT = anything 
  # ALIGN = 16, MISALIGN = 0
  vect_p.1918D.63303_576 = &MEM[(AliasingEntryTypeD.34571 & {ref-all})&aD.38703];
  [pr48616.C : 1288:258] # VUSE <.MEMD.52319_50>
  D.52504_132 = MEM[(AliasingEntryTypeD.34571 & {ref-all})&aD.38703 + 4];
  [pr48616.C : 1288:258] D.52503_134 = D.52504_132 >> D.52395_89;
  [pr48616.C : 1288:344] # VUSE <.MEMD.52319_50>
  D.52501_135 = MEM[(AliasingEntryTypeD.34571 & {ref-all})&aD.38703 + 8];
  [pr48616.C : 1288:344] D.52500_137 = D.52501_135 >> D.52394_92;
  [pr48616.C : 1288:430] # VUSE <.MEMD.52319_50>
  D.52498_138 = MEM[(AliasingEntryTypeD.34571 & {ref-all})&aD.38703 + 12];
  [pr48616.C : 1288:430] D.52497_140 = D.52498_138 >> D.52393_95; 
  [pr48616.C : 1288:430] # .MEMD.52319_580 = VDEF <.MEMD.52319_50>
  MEM[(AliasingEntryTypeD.34571 & {ref-all})vect_p.1918D.63303_576] = vect_var_.1914D.63299_575;
Comment 7 Jakub Jelinek 2011-04-15 16:00:43 UTC
Created attachment 23997 [details]
pr48616.ii.bz2

Preprocessed, for -O2 -ftree-vectorize -mxop -m64.

The bug is during the second vect_schedule_slp call on that testcase, or perhaps in its analysis.  The first vect_schedule_slp call is on a bb that reads from
shift and is vectorized correctly.  The problem on the second bb is that all the
4 rhs2s of the shifts don't have the definition in the current bb (thus vect_is_simple_use does:
  if ((loop && !flow_bb_inside_loop_p (loop, bb))
      || (!loop && bb != BB_VINFO_BB (bb_vinfo))
      || (!loop && gimple_code (*def_stmt) == GIMPLE_PHI))
    *dt = vect_external_def;
and vectorizable_shift assumes that vect_external_def means vector shift by scalar and nothing during the analysis checks that if it assumes it is a shift by scalar, the shift count must be actually the same between all the to be slp replaced stmts.
In this case if it remembered the vector_var_ from the defining BB it could very well just use it as the rhs2 operand of the vector shift, but that assumes the bb in which it is defined has been slp vectorized already.
Comment 8 Jakub Jelinek 2011-04-15 16:49:36 UTC
extern void abort (void);
int a[4] __attribute__((aligned (32)));
int b[4] __attribute__((aligned (32)));
int c[4] __attribute__((aligned (32)));
int d[4] __attribute__((aligned (32)));
int e[4] __attribute__((aligned (32)));

__attribute__((noinline, noclone))
void
foo (int i)
{
  a[0] = b[0] << c[0];
  a[1] = b[1] << c[1];
  a[2] = b[2] << c[2];
  a[3] = b[3] << c[3];
  if (i)
    {
      d[0] = e[0] >> c[0];
      d[1] = e[1] >> c[1];
      d[2] = e[2] >> c[2];
      d[3] = e[3] >> c[3];
    }
}

int
main ()
{
  int i;
  int *t;
  for (i = 0; i < 4; i++)
    {
      b[i] = 32;
      c[i] = i;
      e[i] = 32;
    }
  asm volatile ("" : : "r" (b) : "memory");
  asm volatile ("" : : "r" (c) : "memory");
  asm volatile ("" : "=r" (t) : "0" (d) : "memory");
  foo (t != 0);
  for (i = 0; i < 4; i++)
    if (a[i] != (32 << i) || d[i] != (32 >> i))
      abort ();
  return 0;
}

is reduced testcase for -O2 -ftree-vectorize -mxop.  I might try to fix this on Monday, unless Ira beats me to do it.
Comment 9 Jakub Jelinek 2011-04-15 19:08:50 UTC
Created attachment 24003 [details]
gcc46-pr48616.patch

This untested patch seems to work for me, though the generated code is in some cases still less than ideal, because it doesn't figure it can just use a vector computed in the earlier bb.  It would be nice if it could figure that out or some later pass be able to clean that up.

Additionally, it doesn't seem to SLP optimize fn3's first bb, where it very well could use a scalar shift.

The testcase will fail without the patch just when testing with -mxop in flags, we don't have unfortunately yet xop and xop_runtime effective targets in testsuite/lib/
Comment 10 Richard Biener 2011-04-15 20:08:19 UTC
See PR48317 for one reason why vector stuff isn't optimized too well.
Comment 11 Jakub Jelinek 2011-04-18 06:55:17 UTC
Author: jakub
Date: Mon Apr 18 06:55:13 2011
New Revision: 172638

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=172638
Log:
	PR tree-optimization/48616
	* tree-vect-stmts.c (vectorizable_shift): If SLP, determine
	whether the shift is by scalar or vector based on whether all SLP
	scalar stmts have the same rhs.

	* gcc.dg/pr48616.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/pr48616.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-stmts.c
Comment 12 Jakub Jelinek 2011-04-18 07:38:17 UTC
Author: jakub
Date: Mon Apr 18 07:38:11 2011
New Revision: 172640

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=172640
Log:
	PR tree-optimization/48616
	* tree-vect-stmts.c (vectorizable_shift): If SLP, determine
	whether the shift is by scalar or vector based on whether all SLP
	scalar stmts have the same rhs.

	* gcc.dg/pr48616.c: New test.

Added:
    branches/gcc-4_6-branch/gcc/testsuite/gcc.dg/pr48616.c
Modified:
    branches/gcc-4_6-branch/gcc/ChangeLog
    branches/gcc-4_6-branch/gcc/testsuite/ChangeLog
    branches/gcc-4_6-branch/gcc/tree-vect-stmts.c
Comment 13 Jakub Jelinek 2011-04-18 07:38:55 UTC
Fixed.
Comment 14 Matthias Kretz (Vir) 2011-04-18 17:00:32 UTC
thanks! I verified the patch with GCC 4.6.0 and my unit test is fine again. Great response time! :)