Bug 63341 - [4.8/4.9/5 Regression] Vectorization miscompilation with -mcpu=power7
Summary: [4.8/4.9/5 Regression] Vectorization miscompilation with -mcpu=power7
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.8.3
: P3 normal
Target Milestone: 4.8.4
Assignee: Jakub Jelinek
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-09-23 11:03 UTC by Jakub Jelinek
Modified: 2014-09-25 08:24 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2014-09-23 00:00:00


Attachments
gcc5-pr63341.patch (2.84 KB, patch)
2014-09-23 13:37 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jakub Jelinek 2014-09-23 11:03:22 UTC

    
Comment 1 Jakub Jelinek 2014-09-23 11:15:05 UTC
The following testcase is miscompiled on powerpc64-linux with -O2 -ftree-vectorize -m32 -mcpu=power7 -mtune=power7:
typedef union U { unsigned short s; unsigned char c; } __attribute__((packed)) U;
struct S { char e __attribute__((aligned (16))); U s[32]; };
struct S t = {0, {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8},
  {9}, {10}, {11}, {12}, {13}, {14}, {15}, {16},
  {17}, {18}, {19}, {20}, {21}, {22}, {23}, {24},
  {25}, {26}, {27}, {28}, {29}, {30}, {31}, {32}}};
unsigned short d[32];

int
main ()
{
  int i;
  for (i = 0; i < 32; i++)
    d[i] = t.s[i].s;
  if (__builtin_memcmp (d, t.s, sizeof d))
    __builtin_abort ();
  return 0;
}

The problem as I see is that the loop is vectorized with dr_explicit_realign_optimized, and calls vect_create_data_ref_ptr with offset
of TYPE_VECT_SUBPARTS (type) - 1, which is 7 in this case.  This function
then calls vect_create_addr_base_for_vector_ref which multiplies the offset
by step (2 in this case), and thus adds 14 bytes to the base address.  The
base address is &t + 1 and &t is 16 bytes aligned, so it loads the first part
from (&t + 1) & -16 (correct) and the second part from (&t + 1 + 14) & -16, which is still wrong, because (&t + 15) & -16 is still &t, not &t + 16 we want to use.  The problem is that as the offset given to vect_create_data_ref_ptr is not in bytes, but in number of subparts, after multiplying it by step we end up
with offset that doesn't have the lowest bit(s) set, which is required for this kind of realignment.  Richard, thoughts on this?  Either we'd need the offset always already in bytes and let all the callers ensure it is in bytes, or some flag whether offset is in bytes or units, or some flag to add not just
offset * step, but offset * step + (step - 1), or do that always.
Comment 2 Jakub Jelinek 2014-09-23 11:22:08 UTC
I see non-NULL offset passed to vect_create_data_ref_ptr by vectorizable_store (for negative case), vectorizable_load (for this dr_explicit_realign_optimized),
and to vect_create_addr_base_for_vector_ref in vect_gen_niters_for_prolog_loop (for negative case) and vect_create_cond_for_align_checks (also for negative case).  So guess we'd need to check all these cases what we really want in those cases.
Comment 3 Richard Biener 2014-09-23 11:59:24 UTC
Hum, from the comment on vect_create_data_ref_ptr I would expect offset to be
in units of the _vector_ size ...

   Output:
   1. Declare a new ptr to vector_type, and have it point to the base of the
      data reference (initial addressed accessed by the data reference).
      For example, for vector of type V8HI, the following code is generated:

      v8hi *ap;
      ap = (v8hi *)initial_address;

      if OFFSET is not supplied:
         initial_address = &a[init];
      if OFFSET is supplied:
         initial_address = &a[init + OFFSET];

      Return the initial_address in INITIAL_ADDRESS.

looking at the implementation it expects byte offsets.
Comment 4 Richard Biener 2014-09-23 12:03:06 UTC
Btw, should be reproducible with negative step cases on x86_64 as well, as we do
there

  if (negative)
    offset = size_int (-TYPE_VECTOR_SUBPARTS (vectype) + 1);

?

I suppose interpreting offset in terms of scalar element size is what was
intended.
Comment 5 Richard Biener 2014-09-23 12:08:11 UTC
Note that vect_create_addr_base_for_vector_ref _does_ multiply offset by
the scalar element size (even if it calls it 'step'):

  tree step = TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (dr)));
...
  if (offset)
    {
      offset = fold_build2 (MULT_EXPR, sizetype,
                            fold_convert (sizetype, offset), step);

so that would be consistent with the interpretation and all uses I can see.
Comment 6 Jakub Jelinek 2014-09-23 12:40:35 UTC
Another testcase:
typedef union U { unsigned short s; unsigned char c; } __attribute__((packed)) U;
struct S { char e __attribute__((aligned (16))); U s[32]; };
struct S t = {0, {{0x5010}, {0x5111}, {0x5212}, {0x5313}, {0x5414}, {0x5515}, {0x5616}, {0x5717},
		  {0x5818}, {0x5919}, {0x5a1a}, {0x5b1b}, {0x5c1c}, {0x5d1d}, {0x5e1e}, {0x5f1f},
		  {0x6020}, {0x6121}, {0x6222}, {0x6323}, {0x6424}, {0x6525}, {0x6626}, {0x6727},
		  {0x6828}, {0x6929}, {0x6a2a}, {0x6b2b}, {0x6c2c}, {0x6d2d}, {0x6e2e}, {0x6f2f}}};
unsigned short d[32];

int
main ()
{
  int i;
  for (i = 0; i < 32; i++)
    d[i] = t.s[i].s + 4;
  for (i = 0; i < 32; i++)
    if (d[i] != t.s[i].s + 4)
      __builtin_abort ();
    else
      asm volatile ("" : : : "memory");
  return 0;
}

which fails similarly.  For both testcases, if I manually change the addi 9,10,15 instruction to addi 9,10,16, it passes.
That instruction corresponds to the *.vect created:
vectp_t.9_3 = &MEM[(void *)&t + 15B];
and I change it to
vectp_t.9_3 = &MEM[(void *)&t + 16B];
Here is what the vectorizer emits for this second testcase:
  <bb 2>:
  vectp_t.5_19 = &MEM[(void *)&t + 1B];
  vectp_t.5_5 = vectp_t.5_19 & 4294967280B;
  vect__7.3_4 = MEM[(short unsigned int *)vectp_t.5_5];
  vect__7.6_2 = __builtin_altivec_mask_for_load (vectp_t.5_19);
  vectp_t.9_3 = &MEM[(void *)&t + 15B];
  vect_cst_.13_33 = { 4, 4, 4, 4, 4, 4, 4, 4 };
  vectp_d.15_35 = &d;

  <bb 3>:
  # i_24 = PHI <i_10(4), 0(2)>
  # ivtmp_21 = PHI <ivtmp_20(4), 32(2)>
  # vect__7.7_1 = PHI <vect__7.10_31(4), vect__7.3_4(2)>
  # vectp_t.8_28 = PHI <vectp_t.8_29(4), vectp_t.9_3(2)>
  # vectp_d.14_36 = PHI <vectp_d.14_37(4), vectp_d.15_35(2)>
  # ivtmp_9 = PHI <ivtmp_39(4), 0(2)>
  vectp_t.8_30 = vectp_t.8_28 & 4294967280B;
  vect__7.10_31 = MEM[(short unsigned int *)vectp_t.8_30];
  vect__7.11_32 = REALIGN_LOAD <vect__7.7_1, vect__7.10_31, vect__7.6_2>;
  _7 = t.s[i_24].s;
  vect__8.12_34 = vect__7.11_32 + vect_cst_.13_33;
  _8 = _7 + 4;
  MEM[(short unsigned int *)vectp_d.14_36] = vect__8.12_34;
  i_10 = i_24 + 1;
  ivtmp_20 = ivtmp_21 - 1;
  vectp_t.8_29 = vectp_t.8_28 + 16;
  vectp_d.14_37 = vectp_d.14_36 + 16;
  ivtmp_39 = ivtmp_9 + 1;
  if (ivtmp_39 < 4)
    goto <bb 4>;
  else
    goto <bb 5>;
  
  <bb 4>:
  goto <bb 3>;

SSA_NAMEs _1, _31 aren't really used for anything but arguments for REALIGN_LOAD, so as long as the targets that support this (seems only rs6000 and spu)
handle the misaligned units fine (it is about whether __builtin_altivec_mask_for_load computes the right mask for the permutations), I'd say just fixing up the offset should be all that is needed.
Note that at least two of the three negative == true cases I saw on one x86_64 testcase use negative offset and depend on it not to have the low bits set
(so offset -7 must become -14 for V8HImode and not -15).
So, at least as a hack, adding step - 1 to offset in vect_create_addr_base_for_vector_ref if offset is non-NULL and positive might DTRT (because in all the 3 negative cases offset will be negative).
But perhaps cleaner will be a bool flag that the offset is already in bytes or something similar.
Comment 7 Jakub Jelinek 2014-09-23 13:37:29 UTC
Created attachment 33539 [details]
gcc5-pr63341.patch

Untested fix.  Fixes the testcases on ppc32 -mcpu=power7 -mtune=power7 though.
Comment 8 Jakub Jelinek 2014-09-25 08:13:21 UTC
Author: jakub
Date: Thu Sep 25 08:12:49 2014
New Revision: 215583

URL: https://gcc.gnu.org/viewcvs?rev=215583&root=gcc&view=rev
Log:
	PR tree-optimization/63341
	* tree-vectorizer.h (vect_create_data_ref_ptr,
	vect_create_addr_base_for_vector_ref): Add another tree argument
	defaulting to NULL_TREE.
	* tree-vect-data-refs.c (vect_create_data_ref_ptr): Add byte_offset
	argument, pass it down to vect_create_addr_base_for_vector_ref.
	(vect_create_addr_base_for_vector_ref): Add byte_offset argument,
	add that to base_offset too if non-NULL.
	* tree-vect-stmts.c (vectorizable_load): Add byte_offset variable,
	for dr_explicit_realign_optimized set it to vector byte size
	- 1 instead of setting offset, pass byte_offset down to
	vect_create_data_ref_ptr.

	* gcc.dg/vect/pr63341-1.c: New test.
	* gcc.dg/vect/pr63341-2.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/vect/pr63341-1.c
    trunk/gcc/testsuite/gcc.dg/vect/pr63341-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-data-refs.c
    trunk/gcc/tree-vect-stmts.c
    trunk/gcc/tree-vectorizer.h
Comment 9 Jakub Jelinek 2014-09-25 08:18:03 UTC
Author: jakub
Date: Thu Sep 25 08:17:32 2014
New Revision: 215585

URL: https://gcc.gnu.org/viewcvs?rev=215585&root=gcc&view=rev
Log:
	PR tree-optimization/63341
	* tree-vectorizer.h (vect_create_data_ref_ptr,
	vect_create_addr_base_for_vector_ref): Add another tree argument
	defaulting to NULL_TREE.
	* tree-vect-data-refs.c (vect_create_data_ref_ptr): Add byte_offset
	argument, pass it down to vect_create_addr_base_for_vector_ref.
	(vect_create_addr_base_for_vector_ref): Add byte_offset argument,
	add that to base_offset too if non-NULL.
	* tree-vect-stmts.c (vectorizable_load): Add byte_offset variable,
	for dr_explicit_realign_optimized set it to vector byte size
	- 1 instead of setting offset, pass byte_offset down to
	vect_create_data_ref_ptr.

	* gcc.dg/vect/pr63341-1.c: New test.
	* gcc.dg/vect/pr63341-2.c: New test.

Added:
    branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/vect/pr63341-1.c
    branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/vect/pr63341-2.c
Modified:
    branches/gcc-4_9-branch/gcc/ChangeLog
    branches/gcc-4_9-branch/gcc/testsuite/ChangeLog
    branches/gcc-4_9-branch/gcc/tree-vect-data-refs.c
    branches/gcc-4_9-branch/gcc/tree-vect-stmts.c
    branches/gcc-4_9-branch/gcc/tree-vectorizer.h
Comment 10 Jakub Jelinek 2014-09-25 08:19:45 UTC
Author: jakub
Date: Thu Sep 25 08:19:14 2014
New Revision: 215587

URL: https://gcc.gnu.org/viewcvs?rev=215587&root=gcc&view=rev
Log:
	PR tree-optimization/63341
	* tree-vectorizer.h (vect_create_data_ref_ptr,
	vect_create_addr_base_for_vector_ref): Add another tree argument
	defaulting to NULL_TREE.
	* tree-vect-data-refs.c (vect_create_data_ref_ptr): Add byte_offset
	argument, pass it down to vect_create_addr_base_for_vector_ref.
	(vect_create_addr_base_for_vector_ref): Add byte_offset argument,
	add that to base_offset too if non-NULL.
	* tree-vect-stmts.c (vectorizable_load): Add byte_offset variable,
	for dr_explicit_realign_optimized set it to vector byte size
	- 1 instead of setting offset, pass byte_offset down to
	vect_create_data_ref_ptr.

	* gcc.dg/vect/pr63341-1.c: New test.
	* gcc.dg/vect/pr63341-2.c: New test.

Added:
    branches/gcc-4_8-branch/gcc/testsuite/gcc.dg/vect/pr63341-1.c
    branches/gcc-4_8-branch/gcc/testsuite/gcc.dg/vect/pr63341-2.c
Modified:
    branches/gcc-4_8-branch/gcc/ChangeLog
    branches/gcc-4_8-branch/gcc/testsuite/ChangeLog
    branches/gcc-4_8-branch/gcc/tree-vect-data-refs.c
    branches/gcc-4_8-branch/gcc/tree-vect-stmts.c
    branches/gcc-4_8-branch/gcc/tree-vectorizer.h
Comment 11 Jakub Jelinek 2014-09-25 08:24:54 UTC
Should be fixed now.