Bug 59354 - [4.8 Regression] Element swizzling produces invalid result with -O3
Summary: [4.8 Regression] Element swizzling produces invalid result with -O3
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.8.2
: P3 normal
Target Milestone: 4.9.3
Assignee: Richard Biener
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2013-11-30 14:25 UTC by Jori Liesenborgs
Modified: 2015-06-23 08:45 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work: 4.9.3, 5.0
Known to fail: 4.8.5, 4.9.2
Last reconfirmed: 2015-01-07 00:00:00


Attachments
test code (65.80 KB, text/plain)
2013-11-30 14:25 UTC, Jori Liesenborgs
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jori Liesenborgs 2013-11-30 14:25:24 UTC
Created attachment 31339 [details]
test code

In the test code, the integer values from sixteen 4x4 blocks (in an stl vector) are copied into an unsigned char array for 64x4 values. Using just -O3 this appears to produce an incorrect result.

Some remarks:
 - Adding -mno-sse again yields the right output.
 - If the vector is replaced by a simple array, the correct result is generated.
 - For 'xBlocks' less than 16, the correct result is also generated

To generate and run the executable:
g++ -O3 -o test test.ii
./test

Generated output (all four lines should actually be the same):
aaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbccccccccccccccccdddddddddddddddd
aaaabbbbccccddddeeeeffffgggghhhhiiiijjjjkkkkllllmmmmnnnnoooopppp
aaaabbbbccccddddeeeeffffgggghhhhiiiijjjjkkkkllllmmmmnnnnoooopppp
aaaabbbbccccddddeeeeffffgggghhhhiiiijjjjkkkkllllmmmmnnnnoooopppp

GCC version info:
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-slackware-linux/4.8.2/lto-wrapper
Target: x86_64-slackware-linux
Configured with: ../gcc-4.8.2/configure --prefix=/usr --libdir=/usr/lib64 --mandir=/usr/man --infodir=/usr/info --enable-shared --enable-bootstrap --enable-languages=ada,c,c++,fortran,go,java,lto,objc --enable-threads=posix --enable-checking=release --enable-objc-gc --with-system-zlib --with-python-dir=/lib64/python2.7/site-packages --disable-libunwind-exceptions --enable-__cxa_atexit --enable-libssp --enable-lto --with-gnu-ld --verbose --enable-java-home --with-java-home=/usr/lib64/jvm/jre --with-jvm-root-dir=/usr/lib64/jvm --with-jvm-jar-dir=/usr/lib64/jvm/jvm-exports --with-arch-directory=amd64 --with-antlr-jar=/tmp/gcc/antlr-runtime-3.4.jar --enable-java-awt=gtk --disable-gtktest --disable-multilib --target=x86_64-slackware-linux --build=x86_64-slackware-linux --host=x86_64-slackware-linux
Thread model: posix
gcc version 4.8.2 (GCC)

Test system was an 8 core slackware 14.0 system, with a /proc/cpuinfo content like this:
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 58
model name	: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
stepping	: 9
microcode	: 0x17
cpu MHz		: 2550.000
cache size	: 8192 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips	: 6784.34
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:
Comment 1 Eelis 2015-01-01 13:54:12 UTC
This is not specific to std::vector and not specific to C++.

C testcase:

  #include <stdio.h>
  #include <stdint.h>

  int main()
  {
    uint32_t a[256] = {};
    uint8_t b[1000] = {};

    for(int i = 0; i != 256; ++i)
      a[i] = i % 5;

    for (int z = 0 ; z < 16; z++)
    for (int y = 0 ; y <  4; y++)
    for (int x = 0 ; x <  4; x++)
      b[y * 64 + z*4 + x] = a[z * 16 + y * 4 + x];

    printf("%d\n", b[4]);

    return 0;
  }

Prints '4' without -mno-sse, prints '1' with -mno-sse.
Comment 2 Ville Voutilainen 2015-01-07 20:30:01 UTC
Moving to middle-end.
Comment 3 Uroš Bizjak 2015-01-08 11:04:09 UTC
It looks to me that cunrolli pass is messing up element swizzling code.

bisection-friendly C testcase:

--cut here--
void abort (void);

unsigned int a[256];
unsigned char b[256];

int main()
{
  int i, z, x, y;

  for(i = 0; i < 256; i++)
    a[i] = i % 5;

  for (z = 0; z < 16; z++)
    for (y = 0; y < 4; y++)
      for (x = 0; x < 4; x++)
        b[y*64 + z*4 + x] = a[z*16 + y*4 + x];

  if (b[4] != 1)
    abort ();

  return 0;
}
--cut here--

gcc-5 on x86_64-linux-gnu with the above testcase:

~/gcc-build/gcc/cc1 -O3 pr59354.c
Aborted
~/gcc-build/gcc/cc1 -O3 -fdisable-tree-cunrolli pr59354.c
OK

"Working" code produces following array:

Breakpoint 1, main () at pr59354.c:18
18        if (b[4] != 1)
(gdb) p b
$1 = "\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\002\003\004\000\003\004\000\001"...
(gdb)

while "non-working" produces:

(gdb) p b
$1 = "\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001", '\000' <repeats 63 times>
(gdb)
Comment 4 H.J. Lu 2015-01-08 14:52:35 UTC
It is caused by r147829 (the new SLP pass).
Comment 5 H.J. Lu 2015-01-08 14:59:36 UTC
(In reply to H.J. Lu from comment #4)
> It is caused by r147829 (the new SLP pass).

It may just expose the latent bug.
Comment 6 Uroš Bizjak 2015-01-14 07:01:42 UTC
This is tree vectorizer problem, involving VECT_PACK_TRUNC expressions.

Initializing the source array with:

  for (i = 0; i < 256; i++)
    a[i] = i;

we can see the difference in the destination array:

-O2 (correct):

0, 1, 2, 3, 16, 17, 18, 19, 32, 33, 34, 35, 48, 49, 50, 51, 64, 65, 66, 67, 80, 81, 82, 83, 96, 97, 98, 99, 112, 113, 114, 115, 128, 129, 130, 131, 144, 145, 146, 147, 160, 161, 162, 163, 176, 177, 178, 179, 192, 193, 194, 195, 208, 209, 210, 211, 224, 225, 226, 227, 240, 241, 242, 243, 4, 5, 6, 7, 20, 21, 22, 23, 36, 37, 38, 39, 52, 53, 54, 55, 68, 69, 70, 71, 84, 85, 86, 87, 100, 101, 102, 103, 116, 117, 118, 119, 132, 133, 134, 135, 148, 149, 150, 151, 164, 165, 166, 167, 180, 181, 182, 183, 196, 197, 198, 199, 212, 213, 214, 215, 228, 229, 230, 231, 244, 245, 246, 247, 8, 9, 10, 11, 24, 25, 26, 27, 40, 41, 42, 43, 56, 57, 58, 59, 72, 73, 74, 75, 88, 89, 90, 91, 104, 105, 106, 107, 120, 121, 122, 123, 136, 137, 138, 139, 152, 153, 154, 155, 168, 169, 170, 171, 184, 185, 186, 187, 200, 201, 202, 203, 216, 217, 218, 219, 232, 233, 234, 235, 248, 249, 250, 251, 12, 13, 14, 15, 28, 29, 30, 31, 44, 45, 46, 47, 60, 61, 62, 63, 76, 77, 78, 79, 92, 93, 94, 95, 108, 109, 110, 111, 124, 125, 126, 127, 140, 141, 142, 143, 156, 157, 158, 159, 172, 173, 174, 175, 188, 189, 190, 191, 204, 205, 206, 207, 220, 221, 222, 223, 236, 237, 238, 239, 252, 253, 254, 255

-O3 (incorrect):

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 4, 5, 6, 7, 20, 21, 22, 23, 36, 37, 38, 39, 52, 53, 54, 55, 68, 69, 70, 71, 84, 85, 86, 87, 100, 101, 102, 103, 116, 117, 118, 119, 132, 133, 134, 135, 148, 149, 150, 151, 164, 165, 166, 167, 180, 181, 182, 183, 196, 197, 198, 199, 212, 213, 214, 215, 228, 229, 230, 231, 244, 245, 246, 247, 8, 9, 10, 11, 24, 25, 26, 27, 40, 41, 42, 43, 56, 57, 58, 59, 72, 73, 74, 75, 88, 89, 90, 91, 104, 105, 106, 107, 120, 121, 122, 123, 136, 137, 138, 139, 152, 153, 154, 155, 168, 169, 170, 171, 184, 185, 186, 187, 200, 201, 202, 203, 216, 217, 218, 219, 232, 233, 234, 235, 248, 249, 250, 251, 12, 13, 14, 15, 28, 29, 30, 31, 44, 45, 46, 47, 60, 61, 62, 63, 76, 77, 78, 79, 92, 93, 94, 95, 108, 109, 110, 111, 124, 125, 126, 127, 140, 141, 142, 143, 156, 157, 158, 159, 172, 173, 174, 175, 188, 189, 190, 191, 204, 205, 206, 207, 220, 221, 222, 223, 236, 237, 238, 239, 252, 253, 254, 255

b[4] is already wrong.

This is the beginning of the optimized dump:

  vect__24.6_185 = MEM[(unsigned int *)&a];
  vect__24.7_182 = MEM[(unsigned int *)&a + 16B];
  vect__24.8_176 = MEM[(unsigned int *)&a + 32B];
  vect__24.9_173 = MEM[(unsigned int *)&a + 48B];
  vect__24.10_160 = MEM[(unsigned int *)&a + 64B];
  vect__24.11_158 = MEM[(unsigned int *)&a + 80B];
  vect__24.12_155 = MEM[(unsigned int *)&a + 96B];
  vect__24.13_148 = MEM[(unsigned int *)&a + 112B];
  vect__24.14_145 = MEM[(unsigned int *)&a + 128B];
  vect__24.15_138 = MEM[(unsigned int *)&a + 144B];
  vect__24.16_136 = MEM[(unsigned int *)&a + 160B];
  vect__24.17_133 = MEM[(unsigned int *)&a + 176B];
  vect__24.18_126 = MEM[(unsigned int *)&a + 192B];
  vect__24.19_124 = MEM[(unsigned int *)&a + 208B];
  vect__24.20_111 = MEM[(unsigned int *)&a + 224B];
  vect__24.21_108 = MEM[(unsigned int *)&a + 240B];
  vect__25.23_107 = VEC_PACK_TRUNC_EXPR <vect__24.6_185, vect__24.7_182>;
  vect__25.23_101 = VEC_PACK_TRUNC_EXPR <vect__24.8_176, vect__24.9_173>;
  vect__25.23_100 = VEC_PACK_TRUNC_EXPR <vect__24.10_160, vect__24.11_158>;
  vect__25.23_99 = VEC_PACK_TRUNC_EXPR <vect__24.12_155, vect__24.13_148>;
  vect__25.23_87 = VEC_PACK_TRUNC_EXPR <vect__24.14_145, vect__24.15_138>;
  vect__25.23_86 = VEC_PACK_TRUNC_EXPR <vect__24.16_136, vect__24.17_133>;
  vect__25.23_85 = VEC_PACK_TRUNC_EXPR <vect__24.18_126, vect__24.19_124>;
  vect__25.23_81 = VEC_PACK_TRUNC_EXPR <vect__24.20_111, vect__24.21_108>;
  vect__25.22_80 = VEC_PACK_TRUNC_EXPR <vect__25.23_107, vect__25.23_101>;
  vect__25.22_79 = VEC_PACK_TRUNC_EXPR <vect__25.23_100, vect__25.23_99>;
  vect__25.22_78 = VEC_PACK_TRUNC_EXPR <vect__25.23_87, vect__25.23_86>;
  vect__25.22_77 = VEC_PACK_TRUNC_EXPR <vect__25.23_85, vect__25.23_81>;
  MEM[(unsigned char[256] *)&b] = vect__25.22_80;
  MEM[(unsigned char[256] *)&b + 16B] = vect__25.22_79;
  MEM[(unsigned char[256] *)&b + 32B] = vect__25.22_78;
  MEM[(unsigned char[256] *)&b + 48B] = vect__25.22_77;

As can be seen, the first 64 elements of the destination array are simply truncated first 64 elements of source array.
Comment 7 Uroš Bizjak 2015-01-14 07:14:12 UTC
Difference (-O2 vs -O3) of neatly formatted results in a 16x16 array:

--- res-O2.txt  2015-01-14 08:08:39.000000000 +0100
+++ res-O3.txt  2015-01-14 08:09:57.000000000 +0100
@@ -1,7 +1,7 @@
-000, 001, 002, 003, 016, 017, 018, 019, 032, 033, 034, 035, 048, 049, 050, 051,
-064, 065, 066, 067, 080, 081, 082, 083, 096, 097, 098, 099, 112, 113, 114, 115,
-128, 129, 130, 131, 144, 145, 146, 147, 160, 161, 162, 163, 176, 177, 178, 179,
-192, 193, 194, 195, 208, 209, 210, 211, 224, 225, 226, 227, 240, 241, 242, 243,
+000, 001, 002, 003, 004, 005, 006, 007, 008, 009, 010, 011, 012, 013, 014, 015,
+016, 017, 018, 019, 020, 021, 022, 023, 024, 025, 026, 027, 028, 029, 030, 031,
+032, 033, 034, 035, 036, 037, 038, 039, 040, 041, 042, 043, 044, 045, 046, 047,
+048, 049, 050, 051, 052, 053, 054, 055, 056, 057, 058, 059, 060, 061, 062, 063,
 004, 005, 006, 007, 020, 021, 022, 023, 036, 037, 038, 039, 052, 053, 054, 055,
 068, 069, 070, 071, 084, 085, 086, 087, 100, 101, 102, 103, 116, 117, 118, 119,
 132, 133, 134, 135, 148, 149, 150, 151, 164, 165, 166, 167, 180, 181, 182, 183,

The problem is only with the first 64 elements.
Comment 8 Richard Biener 2015-01-14 08:52:20 UTC
(In reply to Uroš Bizjak from comment #3)
> It looks to me that cunrolli pass is messing up element swizzling code.
> 
> bisection-friendly C testcase:
> 
> --cut here--
> void abort (void);
> 
> unsigned int a[256];
> unsigned char b[256];
> 
> int main()
> {
>   int i, z, x, y;
> 
>   for(i = 0; i < 256; i++)
>     a[i] = i % 5;
> 
>   for (z = 0; z < 16; z++)
>     for (y = 0; y < 4; y++)
>       for (x = 0; x < 4; x++)
>         b[y*64 + z*4 + x] = a[z*16 + y*4 + x];
> 
>   if (b[4] != 1)
>     abort ();
> 
>   return 0;
> }
> --cut here--

This testcase works for me on trunk now (maybe one of my recent vectorizer
fixes) but it miscompiles on the 4.9 and 4.8 branches (4.7 seems to work).

Maybe somebody can bisect what fixed it on trunk? (and confirm the bug is
indeed gone on trunk)
Comment 9 Jakub Jelinek 2015-01-14 08:57:35 UTC
I can reproduce it even with r219580.
Comment 10 Ville Voutilainen 2015-01-14 08:59:17 UTC
(In reply to Jakub Jelinek from comment #9)
> I can reproduce it even with r219580.

Likewise. People, remember to use -O3 when reproducing.
Comment 11 rguenther@suse.de 2015-01-14 09:47:33 UTC
On Wed, 14 Jan 2015, ville.voutilainen at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59354
> 
> --- Comment #10 from Ville Voutilainen <ville.voutilainen at gmail dot com> ---
> (In reply to Jakub Jelinek from comment #9)
> > I can reproduce it even with r219580.
> 
> Likewise. People, remember to use -O3 when reproducing.

Will try in a non-dev tree then (and then identify which of the gazillion
local changes I have "fixes" it...)
Comment 12 Richard Biener 2015-01-14 10:49:08 UTC
Ok, we SLP the first block

node
        stmt 0 b[_17] = _24;

        stmt 1 b[_3] = _50;

        stmt 2 b[_93] = _99;

        stmt 3 b[_104] = _110;

node
        stmt 0 _24 = (unsigned char) _23;

        stmt 1 _50 = (unsigned char) _63;

        stmt 2 _99 = (unsigned char) _98;

        stmt 3 _110 = (unsigned char) _109;

node
        stmt 0 _23 = a[_21];

        stmt 1 _63 = a[_64];

        stmt 2 _98 = a[_97];

        stmt 3 _109 = a[_108];

and for all other instances claim the load permutation is not supported.

For the stmts visble above the load permutation _is_ 1:1, but as we need
to gobble up more loads due to the truncation the effective SLP group
we deal with has gaps (bah, the gaps code...)

Thus the underlying issue is that we have

t.c:13:3: note: Detected interleaving of size 16
t.c:13:3: note: Detected interleaving of size 4
t.c:13:3: note: Detected interleaving of size 4
t.c:13:3: note: Detected interleaving of size 4
t.c:13:3: note: Detected interleaving of size 4

with the group-size of the stores (4) determining the SLP group size but
the analysis code being confused by the non-matching group size of the
loads.

In the end we should probably split up the groups if a single group ends up
being refered to in different SLP instances (thus also postpone most of
the dependence-kind tests until after SLP discovery).  Let's see if a simple
workaround doesn't regress anything.

Index: gcc/tree-vect-slp.c
===================================================================
--- gcc/tree-vect-slp.c (revision 219581)
+++ gcc/tree-vect-slp.c (working copy)
@@ -729,8 +729,11 @@ vect_build_slp_tree_1 (loop_vec_info loo
                 ???  We should enhance this to only disallow gaps
                 inside vectors.  */
               if ((unrolling_factor > 1
-                  && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) == stmt
-                  && GROUP_GAP (vinfo_for_stmt (stmt)) != 0)
+                  && ((GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) == stmt
+                       && GROUP_GAP (vinfo_for_stmt (stmt)) != 0)
+                      /* If the group is split up then GROUP_GAP
+                         isn't correct here, nor is GROUP_FIRST_ELEMENT.  */
+                      || GROUP_SIZE (vinfo_for_stmt (stmt)) > group_size))
                  || (GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) != stmt
                      && GROUP_GAP (vinfo_for_stmt (stmt)) != 1))
                 {
Comment 13 Richard Biener 2015-01-14 14:06:41 UTC
Author: rguenth
Date: Wed Jan 14 14:06:07 2015
New Revision: 219603

URL: https://gcc.gnu.org/viewcvs?rev=219603&root=gcc&view=rev
Log:
2015-01-14  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/59354
	* tree-vect-slp.c (vect_build_slp_tree_1): Treat loads from
	groups larger than the slp group size as having gaps.

	* gcc.dg/vect/pr59354.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.dg/vect/pr59354.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-slp.c
Comment 14 Richard Biener 2015-01-14 14:09:29 UTC
Fixed on trunk sofar.
Comment 15 Richard Biener 2015-02-23 11:14:57 UTC
Author: rguenth
Date: Mon Feb 23 11:14:25 2015
New Revision: 220912

URL: https://gcc.gnu.org/viewcvs?rev=220912&root=gcc&view=rev
Log:
2015-02-23  Richard Biener  <rguenther@suse.de>

	Backport from mainline
	2014-11-19  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/63844
	* omp-low.c (fixup_child_record_type): Use a restrict qualified
	referece type for the receiver parameter.

	2014-11-27  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/61634
	* tree-vect-slp.c: Include gimple-walk.h.
	(vect_detect_hybrid_slp_stmts): Rewrite to propagate hybrid
	down the SLP tree for one scalar statement.
	(vect_detect_hybrid_slp_1): New walker function.
	(vect_detect_hybrid_slp_2): Likewise.
	(vect_detect_hybrid_slp): Properly handle pattern statements
	in a pre-scan over all loop stmts.

	* gcc.dg/vect/pr61634.c: New testcase.

	2015-01-14  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/59354
	* tree-vect-slp.c (vect_build_slp_tree_1): Treat loads from
	groups larger than the slp group size as having gaps.

	* gcc.dg/vect/pr59354.c: New testcase.

	2015-02-10  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/64909
	* tree-vect-loop.c (vect_estimate_min_profitable_iters): Properly
	pass a scalar-stmt count estimate to the cost model.
	* tree-vect-data-refs.c (vect_peeling_hash_get_lowest_cost): Likewise.

	* gcc.dg/vect/costmodel/x86_64/costmodel-pr64909.c: New testcase.

Added:
    branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr64909.c
    branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/vect/pr59354.c
    branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/vect/pr61634.c
Modified:
    branches/gcc-4_9-branch/gcc/ChangeLog
    branches/gcc-4_9-branch/gcc/omp-low.c
    branches/gcc-4_9-branch/gcc/testsuite/ChangeLog
    branches/gcc-4_9-branch/gcc/tree-vect-data-refs.c
    branches/gcc-4_9-branch/gcc/tree-vect-loop.c
    branches/gcc-4_9-branch/gcc/tree-vect-slp.c
Comment 16 Richard Biener 2015-06-23 08:45:30 UTC
Fixed for 4.9.3.