Created attachment 31339 [details] test code In the test code, the integer values from sixteen 4x4 blocks (in an stl vector) are copied into an unsigned char array for 64x4 values. Using just -O3 this appears to produce an incorrect result. Some remarks: - Adding -mno-sse again yields the right output. - If the vector is replaced by a simple array, the correct result is generated. - For 'xBlocks' less than 16, the correct result is also generated To generate and run the executable: g++ -O3 -o test test.ii ./test Generated output (all four lines should actually be the same): aaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbccccccccccccccccdddddddddddddddd aaaabbbbccccddddeeeeffffgggghhhhiiiijjjjkkkkllllmmmmnnnnoooopppp aaaabbbbccccddddeeeeffffgggghhhhiiiijjjjkkkkllllmmmmnnnnoooopppp aaaabbbbccccddddeeeeffffgggghhhhiiiijjjjkkkkllllmmmmnnnnoooopppp GCC version info: COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-slackware-linux/4.8.2/lto-wrapper Target: x86_64-slackware-linux Configured with: ../gcc-4.8.2/configure --prefix=/usr --libdir=/usr/lib64 --mandir=/usr/man --infodir=/usr/info --enable-shared --enable-bootstrap --enable-languages=ada,c,c++,fortran,go,java,lto,objc --enable-threads=posix --enable-checking=release --enable-objc-gc --with-system-zlib --with-python-dir=/lib64/python2.7/site-packages --disable-libunwind-exceptions --enable-__cxa_atexit --enable-libssp --enable-lto --with-gnu-ld --verbose --enable-java-home --with-java-home=/usr/lib64/jvm/jre --with-jvm-root-dir=/usr/lib64/jvm --with-jvm-jar-dir=/usr/lib64/jvm/jvm-exports --with-arch-directory=amd64 --with-antlr-jar=/tmp/gcc/antlr-runtime-3.4.jar --enable-java-awt=gtk --disable-gtktest --disable-multilib --target=x86_64-slackware-linux --build=x86_64-slackware-linux --host=x86_64-slackware-linux Thread model: posix gcc version 4.8.2 (GCC) Test system was an 8 core slackware 14.0 system, with a /proc/cpuinfo content like this: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz stepping : 9 microcode : 0x17 cpu MHz : 2550.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms bogomips : 6784.34 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:
This is not specific to std::vector and not specific to C++. C testcase: #include <stdio.h> #include <stdint.h> int main() { uint32_t a[256] = {}; uint8_t b[1000] = {}; for(int i = 0; i != 256; ++i) a[i] = i % 5; for (int z = 0 ; z < 16; z++) for (int y = 0 ; y < 4; y++) for (int x = 0 ; x < 4; x++) b[y * 64 + z*4 + x] = a[z * 16 + y * 4 + x]; printf("%d\n", b[4]); return 0; } Prints '4' without -mno-sse, prints '1' with -mno-sse.
Moving to middle-end.
It looks to me that cunrolli pass is messing up element swizzling code. bisection-friendly C testcase: --cut here-- void abort (void); unsigned int a[256]; unsigned char b[256]; int main() { int i, z, x, y; for(i = 0; i < 256; i++) a[i] = i % 5; for (z = 0; z < 16; z++) for (y = 0; y < 4; y++) for (x = 0; x < 4; x++) b[y*64 + z*4 + x] = a[z*16 + y*4 + x]; if (b[4] != 1) abort (); return 0; } --cut here-- gcc-5 on x86_64-linux-gnu with the above testcase: ~/gcc-build/gcc/cc1 -O3 pr59354.c Aborted ~/gcc-build/gcc/cc1 -O3 -fdisable-tree-cunrolli pr59354.c OK "Working" code produces following array: Breakpoint 1, main () at pr59354.c:18 18 if (b[4] != 1) (gdb) p b $1 = "\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\002\003\004\000\003\004\000\001"... (gdb) while "non-working" produces: (gdb) p b $1 = "\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\003\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001\004\000\001\002\000\001\002\003\001\002\003\004\002\003\004\000\003\004\000\001", '\000' <repeats 63 times> (gdb)
It is caused by r147829 (the new SLP pass).
(In reply to H.J. Lu from comment #4) > It is caused by r147829 (the new SLP pass). It may just expose the latent bug.
This is tree vectorizer problem, involving VECT_PACK_TRUNC expressions. Initializing the source array with: for (i = 0; i < 256; i++) a[i] = i; we can see the difference in the destination array: -O2 (correct): 0, 1, 2, 3, 16, 17, 18, 19, 32, 33, 34, 35, 48, 49, 50, 51, 64, 65, 66, 67, 80, 81, 82, 83, 96, 97, 98, 99, 112, 113, 114, 115, 128, 129, 130, 131, 144, 145, 146, 147, 160, 161, 162, 163, 176, 177, 178, 179, 192, 193, 194, 195, 208, 209, 210, 211, 224, 225, 226, 227, 240, 241, 242, 243, 4, 5, 6, 7, 20, 21, 22, 23, 36, 37, 38, 39, 52, 53, 54, 55, 68, 69, 70, 71, 84, 85, 86, 87, 100, 101, 102, 103, 116, 117, 118, 119, 132, 133, 134, 135, 148, 149, 150, 151, 164, 165, 166, 167, 180, 181, 182, 183, 196, 197, 198, 199, 212, 213, 214, 215, 228, 229, 230, 231, 244, 245, 246, 247, 8, 9, 10, 11, 24, 25, 26, 27, 40, 41, 42, 43, 56, 57, 58, 59, 72, 73, 74, 75, 88, 89, 90, 91, 104, 105, 106, 107, 120, 121, 122, 123, 136, 137, 138, 139, 152, 153, 154, 155, 168, 169, 170, 171, 184, 185, 186, 187, 200, 201, 202, 203, 216, 217, 218, 219, 232, 233, 234, 235, 248, 249, 250, 251, 12, 13, 14, 15, 28, 29, 30, 31, 44, 45, 46, 47, 60, 61, 62, 63, 76, 77, 78, 79, 92, 93, 94, 95, 108, 109, 110, 111, 124, 125, 126, 127, 140, 141, 142, 143, 156, 157, 158, 159, 172, 173, 174, 175, 188, 189, 190, 191, 204, 205, 206, 207, 220, 221, 222, 223, 236, 237, 238, 239, 252, 253, 254, 255 -O3 (incorrect): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 4, 5, 6, 7, 20, 21, 22, 23, 36, 37, 38, 39, 52, 53, 54, 55, 68, 69, 70, 71, 84, 85, 86, 87, 100, 101, 102, 103, 116, 117, 118, 119, 132, 133, 134, 135, 148, 149, 150, 151, 164, 165, 166, 167, 180, 181, 182, 183, 196, 197, 198, 199, 212, 213, 214, 215, 228, 229, 230, 231, 244, 245, 246, 247, 8, 9, 10, 11, 24, 25, 26, 27, 40, 41, 42, 43, 56, 57, 58, 59, 72, 73, 74, 75, 88, 89, 90, 91, 104, 105, 106, 107, 120, 121, 122, 123, 136, 137, 138, 139, 152, 153, 154, 155, 168, 169, 170, 171, 184, 185, 186, 187, 200, 201, 202, 203, 216, 217, 218, 219, 232, 233, 234, 235, 248, 249, 250, 251, 12, 13, 14, 15, 28, 29, 30, 31, 44, 45, 46, 47, 60, 61, 62, 63, 76, 77, 78, 79, 92, 93, 94, 95, 108, 109, 110, 111, 124, 125, 126, 127, 140, 141, 142, 143, 156, 157, 158, 159, 172, 173, 174, 175, 188, 189, 190, 191, 204, 205, 206, 207, 220, 221, 222, 223, 236, 237, 238, 239, 252, 253, 254, 255 b[4] is already wrong. This is the beginning of the optimized dump: vect__24.6_185 = MEM[(unsigned int *)&a]; vect__24.7_182 = MEM[(unsigned int *)&a + 16B]; vect__24.8_176 = MEM[(unsigned int *)&a + 32B]; vect__24.9_173 = MEM[(unsigned int *)&a + 48B]; vect__24.10_160 = MEM[(unsigned int *)&a + 64B]; vect__24.11_158 = MEM[(unsigned int *)&a + 80B]; vect__24.12_155 = MEM[(unsigned int *)&a + 96B]; vect__24.13_148 = MEM[(unsigned int *)&a + 112B]; vect__24.14_145 = MEM[(unsigned int *)&a + 128B]; vect__24.15_138 = MEM[(unsigned int *)&a + 144B]; vect__24.16_136 = MEM[(unsigned int *)&a + 160B]; vect__24.17_133 = MEM[(unsigned int *)&a + 176B]; vect__24.18_126 = MEM[(unsigned int *)&a + 192B]; vect__24.19_124 = MEM[(unsigned int *)&a + 208B]; vect__24.20_111 = MEM[(unsigned int *)&a + 224B]; vect__24.21_108 = MEM[(unsigned int *)&a + 240B]; vect__25.23_107 = VEC_PACK_TRUNC_EXPR <vect__24.6_185, vect__24.7_182>; vect__25.23_101 = VEC_PACK_TRUNC_EXPR <vect__24.8_176, vect__24.9_173>; vect__25.23_100 = VEC_PACK_TRUNC_EXPR <vect__24.10_160, vect__24.11_158>; vect__25.23_99 = VEC_PACK_TRUNC_EXPR <vect__24.12_155, vect__24.13_148>; vect__25.23_87 = VEC_PACK_TRUNC_EXPR <vect__24.14_145, vect__24.15_138>; vect__25.23_86 = VEC_PACK_TRUNC_EXPR <vect__24.16_136, vect__24.17_133>; vect__25.23_85 = VEC_PACK_TRUNC_EXPR <vect__24.18_126, vect__24.19_124>; vect__25.23_81 = VEC_PACK_TRUNC_EXPR <vect__24.20_111, vect__24.21_108>; vect__25.22_80 = VEC_PACK_TRUNC_EXPR <vect__25.23_107, vect__25.23_101>; vect__25.22_79 = VEC_PACK_TRUNC_EXPR <vect__25.23_100, vect__25.23_99>; vect__25.22_78 = VEC_PACK_TRUNC_EXPR <vect__25.23_87, vect__25.23_86>; vect__25.22_77 = VEC_PACK_TRUNC_EXPR <vect__25.23_85, vect__25.23_81>; MEM[(unsigned char[256] *)&b] = vect__25.22_80; MEM[(unsigned char[256] *)&b + 16B] = vect__25.22_79; MEM[(unsigned char[256] *)&b + 32B] = vect__25.22_78; MEM[(unsigned char[256] *)&b + 48B] = vect__25.22_77; As can be seen, the first 64 elements of the destination array are simply truncated first 64 elements of source array.
Difference (-O2 vs -O3) of neatly formatted results in a 16x16 array: --- res-O2.txt 2015-01-14 08:08:39.000000000 +0100 +++ res-O3.txt 2015-01-14 08:09:57.000000000 +0100 @@ -1,7 +1,7 @@ -000, 001, 002, 003, 016, 017, 018, 019, 032, 033, 034, 035, 048, 049, 050, 051, -064, 065, 066, 067, 080, 081, 082, 083, 096, 097, 098, 099, 112, 113, 114, 115, -128, 129, 130, 131, 144, 145, 146, 147, 160, 161, 162, 163, 176, 177, 178, 179, -192, 193, 194, 195, 208, 209, 210, 211, 224, 225, 226, 227, 240, 241, 242, 243, +000, 001, 002, 003, 004, 005, 006, 007, 008, 009, 010, 011, 012, 013, 014, 015, +016, 017, 018, 019, 020, 021, 022, 023, 024, 025, 026, 027, 028, 029, 030, 031, +032, 033, 034, 035, 036, 037, 038, 039, 040, 041, 042, 043, 044, 045, 046, 047, +048, 049, 050, 051, 052, 053, 054, 055, 056, 057, 058, 059, 060, 061, 062, 063, 004, 005, 006, 007, 020, 021, 022, 023, 036, 037, 038, 039, 052, 053, 054, 055, 068, 069, 070, 071, 084, 085, 086, 087, 100, 101, 102, 103, 116, 117, 118, 119, 132, 133, 134, 135, 148, 149, 150, 151, 164, 165, 166, 167, 180, 181, 182, 183, The problem is only with the first 64 elements.
(In reply to Uroš Bizjak from comment #3) > It looks to me that cunrolli pass is messing up element swizzling code. > > bisection-friendly C testcase: > > --cut here-- > void abort (void); > > unsigned int a[256]; > unsigned char b[256]; > > int main() > { > int i, z, x, y; > > for(i = 0; i < 256; i++) > a[i] = i % 5; > > for (z = 0; z < 16; z++) > for (y = 0; y < 4; y++) > for (x = 0; x < 4; x++) > b[y*64 + z*4 + x] = a[z*16 + y*4 + x]; > > if (b[4] != 1) > abort (); > > return 0; > } > --cut here-- This testcase works for me on trunk now (maybe one of my recent vectorizer fixes) but it miscompiles on the 4.9 and 4.8 branches (4.7 seems to work). Maybe somebody can bisect what fixed it on trunk? (and confirm the bug is indeed gone on trunk)
I can reproduce it even with r219580.
(In reply to Jakub Jelinek from comment #9) > I can reproduce it even with r219580. Likewise. People, remember to use -O3 when reproducing.
On Wed, 14 Jan 2015, ville.voutilainen at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59354 > > --- Comment #10 from Ville Voutilainen <ville.voutilainen at gmail dot com> --- > (In reply to Jakub Jelinek from comment #9) > > I can reproduce it even with r219580. > > Likewise. People, remember to use -O3 when reproducing. Will try in a non-dev tree then (and then identify which of the gazillion local changes I have "fixes" it...)
Ok, we SLP the first block node stmt 0 b[_17] = _24; stmt 1 b[_3] = _50; stmt 2 b[_93] = _99; stmt 3 b[_104] = _110; node stmt 0 _24 = (unsigned char) _23; stmt 1 _50 = (unsigned char) _63; stmt 2 _99 = (unsigned char) _98; stmt 3 _110 = (unsigned char) _109; node stmt 0 _23 = a[_21]; stmt 1 _63 = a[_64]; stmt 2 _98 = a[_97]; stmt 3 _109 = a[_108]; and for all other instances claim the load permutation is not supported. For the stmts visble above the load permutation _is_ 1:1, but as we need to gobble up more loads due to the truncation the effective SLP group we deal with has gaps (bah, the gaps code...) Thus the underlying issue is that we have t.c:13:3: note: Detected interleaving of size 16 t.c:13:3: note: Detected interleaving of size 4 t.c:13:3: note: Detected interleaving of size 4 t.c:13:3: note: Detected interleaving of size 4 t.c:13:3: note: Detected interleaving of size 4 with the group-size of the stores (4) determining the SLP group size but the analysis code being confused by the non-matching group size of the loads. In the end we should probably split up the groups if a single group ends up being refered to in different SLP instances (thus also postpone most of the dependence-kind tests until after SLP discovery). Let's see if a simple workaround doesn't regress anything. Index: gcc/tree-vect-slp.c =================================================================== --- gcc/tree-vect-slp.c (revision 219581) +++ gcc/tree-vect-slp.c (working copy) @@ -729,8 +729,11 @@ vect_build_slp_tree_1 (loop_vec_info loo ??? We should enhance this to only disallow gaps inside vectors. */ if ((unrolling_factor > 1 - && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) == stmt - && GROUP_GAP (vinfo_for_stmt (stmt)) != 0) + && ((GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) == stmt + && GROUP_GAP (vinfo_for_stmt (stmt)) != 0) + /* If the group is split up then GROUP_GAP + isn't correct here, nor is GROUP_FIRST_ELEMENT. */ + || GROUP_SIZE (vinfo_for_stmt (stmt)) > group_size)) || (GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) != stmt && GROUP_GAP (vinfo_for_stmt (stmt)) != 1)) {
Author: rguenth Date: Wed Jan 14 14:06:07 2015 New Revision: 219603 URL: https://gcc.gnu.org/viewcvs?rev=219603&root=gcc&view=rev Log: 2015-01-14 Richard Biener <rguenther@suse.de> PR tree-optimization/59354 * tree-vect-slp.c (vect_build_slp_tree_1): Treat loads from groups larger than the slp group size as having gaps. * gcc.dg/vect/pr59354.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/vect/pr59354.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-slp.c
Fixed on trunk sofar.
Author: rguenth Date: Mon Feb 23 11:14:25 2015 New Revision: 220912 URL: https://gcc.gnu.org/viewcvs?rev=220912&root=gcc&view=rev Log: 2015-02-23 Richard Biener <rguenther@suse.de> Backport from mainline 2014-11-19 Richard Biener <rguenther@suse.de> PR tree-optimization/63844 * omp-low.c (fixup_child_record_type): Use a restrict qualified referece type for the receiver parameter. 2014-11-27 Richard Biener <rguenther@suse.de> PR tree-optimization/61634 * tree-vect-slp.c: Include gimple-walk.h. (vect_detect_hybrid_slp_stmts): Rewrite to propagate hybrid down the SLP tree for one scalar statement. (vect_detect_hybrid_slp_1): New walker function. (vect_detect_hybrid_slp_2): Likewise. (vect_detect_hybrid_slp): Properly handle pattern statements in a pre-scan over all loop stmts. * gcc.dg/vect/pr61634.c: New testcase. 2015-01-14 Richard Biener <rguenther@suse.de> PR tree-optimization/59354 * tree-vect-slp.c (vect_build_slp_tree_1): Treat loads from groups larger than the slp group size as having gaps. * gcc.dg/vect/pr59354.c: New testcase. 2015-02-10 Richard Biener <rguenther@suse.de> PR tree-optimization/64909 * tree-vect-loop.c (vect_estimate_min_profitable_iters): Properly pass a scalar-stmt count estimate to the cost model. * tree-vect-data-refs.c (vect_peeling_hash_get_lowest_cost): Likewise. * gcc.dg/vect/costmodel/x86_64/costmodel-pr64909.c: New testcase. Added: branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr64909.c branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/vect/pr59354.c branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/vect/pr61634.c Modified: branches/gcc-4_9-branch/gcc/ChangeLog branches/gcc-4_9-branch/gcc/omp-low.c branches/gcc-4_9-branch/gcc/testsuite/ChangeLog branches/gcc-4_9-branch/gcc/tree-vect-data-refs.c branches/gcc-4_9-branch/gcc/tree-vect-loop.c branches/gcc-4_9-branch/gcc/tree-vect-slp.c
Fixed for 4.9.3.