There are runfails for the following benchmarks since r263772: SPEC2017 520/620: (Segmentation fault, minimal optset to reproduce: "-O3 -march=skylake-avx512 -flto") SPEC2006 445: (SPEC miscompare, minimal optset to reproduce: "-O3 -march=skylake-avx512") Running 520.omnetpp_r under GDB: --- ... Program received signal SIGSEGV, Segmentation fault. 0x00000000004a611e in isName (s=<optimized out>, this=<optimized out>) at simulator/ccomponent.cc:143 143 if (paramv[i].isName(parname)) (gdb) backtrace #0 0x00000000004a611e in isName (s=<optimized out>, this=<optimized out>) at simulator/ccomponent.cc:143 #1 cComponent::findPar (this=0x7ffff6633380, parname=0x7ffff6603548 "bs") at simulator/ccomponent.cc:143 #2 0x00000000004a87b3 in cComponent::par(char const*) () at simulator/ccomponent.cc:133 #3 0x00000000004b676d in cNEDNetworkBuilder::doParam(cComponent*, ParamElement*, bool) () at simulator/cnednetworkbuilder.cc:179 #4 0x00000000004b8610 in doParams (isSubcomponent=false, paramsNode=<optimized out>, component=0x7ffff6633380, this=0x7fffffffaaf0) at simulator/cnednetworkbuilder.cc:139 #5 cNEDNetworkBuilder::addParametersAndGatesTo(cComponent*, cNEDDeclaration*) () at simulator/cnednetworkbuilder.cc:105 #6 0x000000000048843b in addParametersAndGatesTo (module=0x7ffff6633380, this=<optimized out>) at <GCC_PATH>/include/c++/9.0.0/bits/stl_tree.h:211 #7 cModuleType::create(char const*, cModule*, int, int) () at simulator/ccomponenttype.cc:156 #8 0x000000000045916f in setupNetwork (network=<optimized out>, this=0x7ffff653bc40) at simulator/cnamedobject.h:117 #9 Cmdenv::run() () at simulator/cmdenv.cc:253 #10 0x00000000005186ec in EnvirBase::run(int, char**, cConfiguration*) () at simulator/envirbase.cc:230 #11 0x000000000043d60d in setupUserInterface(int, char**, cConfiguration*) [clone .constprop.112] () at simulator/startup.cc:234 #12 0x000000000042446a in main (argc=1, argv=0x7fffffffb1c8) at simulator/main.cc:39 --- 403.gcc miscompares: 200.s, g23.s, scilab.s. For example: --- $ diff -u g23_ref.s g23.s | head -n 16 --- g23_ref.s +++ g23.s @@ -1746,19 +1746,19 @@ testq %rbx, %rbx jne .L904 movq %r12, %rdx - xorl %r8d, %r8d + xorl %esi, %esi negq %rdx .L905: addq %rcx, %rdx - leaq (%rax,%r8), %rax + leaq (%rax,%rsi), %rax leaq 1(%rdx), %rcx - cmpq %r8, %rax + cmpq %rsi, %rax --- Unfortunately I didn't manage to create a reproducer.
Can you still reproduce this? There have been several vectorizer fixes since then?
I can't reproduce it with r266551.
On Intel machine with AVX512F, r263772 miscompiled 520.omnetpp_r in SPEC CPU 2017 with -DSPEC -DSPEC_CPU -DNDEBUG -Isimulator/platdep -Isimulator -Imodel -DWITH_NETBUILDER -DSPEC_AUTO_SUPPRESS_OPENMP -fno-unsafe-math-optimizations -mfpmath=sse -g -march=native -Ofast -funroll-loops -flto -DSPEC_LP64 Program received signal SIGSEGV, Segmentation fault. 0x00000000004a8ddb in cObject::isName (s=<optimized out>, this=<optimized out>) at simulator/cobject.h:118 118 bool isName(const char *s) const {return !opp_strcmp(getName(),s);} (gdb) bt #0 0x00000000004a8ddb in cObject::isName (s=<optimized out>, this=<optimized out>) at simulator/cobject.h:118 #1 cComponent::findPar (this=0x699040, parname=0x669c58 "bs") at simulator/ccomponent.cc:143 #2 0x00000000004acdb4 in cComponent::par (this=0x699040, parname=0x669c58 "bs") at simulator/ccomponent.cc:133 #3 0x00000000004be27c in cNEDNetworkBuilder::doParam (this=0x7fffffffd500, component=0x699040, paramNode=0x669bd0, isSubcomponent=<optimized out>) at simulator/cnednetworkbuilder.cc:179 #4 0x00000000004c0020 in cNEDNetworkBuilder::doParams (isSubcomponent=false, paramsNode=<optimized out>, component=0x699040, this=0x7fffffffd500) at simulator/cnednetworkbuilder.cc:139 #5 cNEDNetworkBuilder::addParametersAndGatesTo (this=0x7fffffffd500, component=0x699040, decl=0x695e60) at simulator/cnednetworkbuilder.cc:105 #6 0x000000000048a1bd in cDynamicModuleType::addParametersAndGatesTo ( module=0x699040, this=<optimized out>) at /export/ssd/git/gcc-test-spec/usr/include/c++/9.0.0/bits/stl_tree.h:211 #7 cModuleType::create (this=<optimized out>, modname=<optimized out>, parentmod=<optimized out>, vectorsize=<optimized out>, index=<optimized out>) at simulator/ccomponenttype.cc:156 #8 0x00000000004643aa in cModuleType::create (parentmod=0x0, modname=<optimized out>, this=<optimized out>) at simulator/ccomponenttype.cc:106 --Type <RET> for more, q to quit, c to continue without paging-- #9 cSimulation::setupNetwork (network=<optimized out>, this=<optimized out>) at simulator/csimulation.cc:369 #10 Cmdenv::run (this=0x624d80) at simulator/cmdenv.cc:253 #11 0x000000000051673c in EnvirBase::run (this=0x624d80, argc=<optimized out>, argv=<optimized out>, configobject=0x61a640) at simulator/envirbase.cc:230 #12 0x00000000004421b2 in setupUserInterface(int, char**, cConfiguration*) [clone .constprop.0] (argc=argc@entry=5, argv=argv@entry=0x7fffffffdc18, cfg=0x0) at simulator/startup.cc:234 #13 0x000000000042f2fd in main (argc=5, argv=0x7fffffffdc18) at simulator/main.cc:39 (gdb) ... 0x00000000004a8db5 <+325>: lea 0x1(%r15),%rbx 0x00000000004a8db9 <+329>: mov 0x58(%rbp),%r8 0x00000000004a8dbd <+333>: lea (%rbx,%rbx,2),%rdi 0x00000000004a8dc1 <+337>: lea (%r8,%rdi,8),%rdi 0x00000000004a8dc5 <+341>: mov (%rdi),%r9 0x00000000004a8dc8 <+344>: mov %ebx,%r12d 0x00000000004a8dcb <+347>: mov 0x30(%r9),%rax 0x00000000004a8dcf <+351>: cmp $0x4a8080,%rax 0x00000000004a8dd5 <+357>: je 0x4a8d20 <cComponent::findPar(char const*) const+176> => 0x00000000004a8ddb <+363>: callq *%rax 0x00000000004a8ddd <+365>: mov %rax,%rdi 0x00000000004a8de0 <+368>: test %rax,%rax 0x00000000004a8de3 <+371>: jne 0x4a8d47 <cComponent::findPar(char const*) const+215> 0x00000000004a8de9 <+377>: cmpb $0x0,0x0(%r13) 0x00000000004a8dee <+382>: jne 0x4a8d57 <cComponent::findPar(char const*) const+231> 0x00000000004a8df4 <+388>: add $0x8,%rsp 0x00000000004a8df8 <+392>: pop %rbx 0x00000000004a8df9 <+393>: pop %rbp 0x00000000004a8dfa <+394>: mov %r12d,%eax 0x00000000004a8dfd <+397>: pop %r12 0x00000000004a8dff <+399>: pop %r13 0x00000000004a8e01 <+401>: pop %r14 0x00000000004a8e03 <+403>: pop %r15 0x00000000004a8e05 <+405>: retq 0x00000000004a8e06 <+406>: nopw %cs:0x0(%rax,%rax,1) 0x00000000004a8e10 <+416>: callq *%rax 0x00000000004a8e12 <+418>: mov %rax,%rdi --Type <RET> for more, q to quit, c to continue without paging--q Quit (gdb) p/x $rax $1 = 0x5c8d480000009b85 (gdb) p/x *(long *) $rax Cannot access memory at address 0x5c8d480000009b85 (gdb) This address looks odd.
Mine then.
Adding -fno-strict-aliasing fixes the issue.
Well, omnetpp_r has no known portability issues: https://www.spec.org/cpu2017/Docs/benchmarks/520.omnetpp_r.html So that I would like to know what violates the aliasing. Let me debug that..
On Thu, 24 Jan 2019, marxin at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87214 > > Martin Liška <marxin at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|RESOLVED |ASSIGNED > Resolution|INVALID |--- > Assignee|rsandifo at gcc dot gnu.org |marxin at gcc dot gnu.org > > --- Comment #6 from Martin Liška <marxin at gcc dot gnu.org> --- > Well, omnetpp_r has no known portability issues: > https://www.spec.org/cpu2017/Docs/benchmarks/520.omnetpp_r.html > > So that I would like to know what violates the aliasing. Let me debug that.. A lot of benchmarks end up using spec_qsort...
(In reply to rguenther@suse.de from comment #7) > On Thu, 24 Jan 2019, marxin at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87214 > > > > Martin Liška <marxin at gcc dot gnu.org> changed: > > > > What |Removed |Added > > ---------------------------------------------------------------------------- > > Status|RESOLVED |ASSIGNED > > Resolution|INVALID |--- > > Assignee|rsandifo at gcc dot gnu.org |marxin at gcc dot gnu.org > > > > --- Comment #6 from Martin Liška <marxin at gcc dot gnu.org> --- > > Well, omnetpp_r has no known portability issues: > > https://www.spec.org/cpu2017/Docs/benchmarks/520.omnetpp_r.html > > > > So that I would like to know what violates the aliasing. Let me debug that.. > > A lot of benchmarks end up using spec_qsort... Ah, yes, I overlooked that as the file has a different suffix: ./benchspec/CPU/520.omnetpp_r/src/simulator/spec_qsort.cc So let me test it with fixed qsort function.
I guess it's not related to qsort (the files looks different and fine: #include <cstdlib> #include "spec_qsort.h" static void spec_swap(void *x, void *y, size_t l) { /* Swap elements of an array byte by byte. Note that a version specialized to operate on a specific data type (e.g. int) would be faster. */ char *a = (char *)x, *b = (char *)y, c; while(l--) { c = *a; *a++ = *b; *b++ = c; } } static void spec_sort(char *array, size_t size, int begin, int end, int (*cmp)(const void*,const void*)) { /* Generic qsort algorithm */ if (end > begin) { void *pivot = array + begin; int l = begin + size; int r = end; while(l < r) { if (cmp(array+l,pivot) <= 0) { l += size; } else { r -= size; spec_swap(array+l, array+r, size); } } l -= size; spec_swap(array+begin, array+l, size); spec_sort(array, size, begin, l, cmp); spec_sort(array, size, r, end, cmp); } } void spec_qsort(void *array, size_t nitems, size_t size, int (*cmp)(const void*,const void*)) { spec_sort((char *)array, size, 0, (nitems-1)*size, cmp); }
Only following 2 LTO object files trigger the segfault: simulator/cpar.o and simulator/ccomponent.o (rest are -fno-lto object files).
Created attachment 45520 [details] optimized dump with -mprefer-vector-width=128
Created attachment 45521 [details] optimized dump with -mprefer-vector-width=256
The 2 problematic functions looks like: void cComponent::reallocParamv(int size) { ((void)0); if (size!=(short)size) throw cRuntimeError(this, "reallocParamv(%d): at most %d parameters allowed", size, 0x7fff); cPar *newparamv = new cPar[size]; __builtin_printf ("realloc called with new size: paramvsize: %d\n", numparams); for (int i=0; i<numparams; i++) __builtin_printf ("%d:%s\n", i,paramv[i].getName()); __builtin_printf ("\n"); for (int i=0; i<numparams; i++) paramv[i].moveto(newparamv[i]); for (int i=0; i<numparams; i++) __builtin_printf ("%d:%s\n", i,newparamv[i].getName()); __builtin_printf ("realloc done\n"); delete [] paramv; paramv = newparamv; paramvsize = (short)size; } void cComponent::addPar(cParImpl *value) { __builtin_printf ("addPar: paramvsize: %d, name: %s\n", paramvsize, value->getName()); if (parametersFinalized()) throw cRuntimeError(this, "cannot add parameters at runtime"); if (findPar(value->getName())>=0) throw cRuntimeError(this, "cannot add parameter `%s': already exists", value->getName()); if (numparams==paramvsize) reallocParamv(paramvsize+1); paramv[numparams++].init(this, value); } where the vectorized version prints: Preparing for running configuration General, run #0... Scenario: $repetition=0 Assigned runID=speccpu-runid Setting up network `largeNet'... addPar: paramvsize: 0, name: n findPar: n realloc called with new size: paramvsize: 0 realloc done findPar: n addPar: paramvsize: 1, name: bbs findPar: bbs realloc called with new size: paramvsize: 1 0:n 0:n realloc done findPar: bbs addPar: paramvsize: 2, name: bbm findPar: bbm realloc called with new size: paramvsize: 2 0:n 1:bbs 0:n 1:bbs realloc done findPar: bbm addPar: paramvsize: 3, name: bbl findPar: bbl realloc called with new size: paramvsize: 3 0:n 1:bbs 2:bbm 0:n 1:bbs 2:bbm realloc done findPar: bbl addPar: paramvsize: 4, name: as findPar: as realloc called with new size: paramvsize: 4 0:n 1:bbs 2:bbm 3:bbl 0:n 1:bbs 2:bbm 3:bbl realloc done findPar: as addPar: paramvsize: 5, name: am findPar: am realloc called with new size: paramvsize: 5 0:n 1:bbs 2:bbm 3:bbl 4:as 0:n 1:bbs 2:bbm 3:bbl 4:as realloc done findPar: am addPar: paramvsize: 6, name: al findPar: al realloc called with new size: paramvsize: 6 0:n 1:bbs 2:bbm 3:bbl 4:as 5:am 0:n 1:bbs 2:bbm 3:largeNet 4:as 5:am realloc done findPar: al addPar: paramvsize: 7, name: bs findPar: bs realloc called with new size: paramvsize: 7 0:n 1:bbs 2:bbm 3:largeNet 4:as 5:am 6:al 0:n 1:bbs 2:bbm Segmentation fault (core dumped) As seen the moveto is wrong for paramvsize == 6 (5 old elements), where element #3 should be 'bbl' after copying, but is 'largeNet'. Then we reach a segfault due to it.
and moveto does: void cPar::moveto(cPar& other) { other.ownercomponent = ownercomponent; other.p = p; p = # 62 "simulator/cpar.cc" 3 4 __null # 62 "simulator/cpar.cc" ; }
Created attachment 45522 [details] vectorizer dump
Created attachment 45526 [details] Passing testcase I'm still not sure where the problem is coming in. The loop in the vector dump looks functionally correct now I've had change to look at it more (contrary to my initial comment on IRC). It seems to be equivalent to the attached, which passed on an AVX2 box I found I had accesss to.
(In reply to rsandifo@gcc.gnu.org from comment #16) > Created attachment 45526 [details] > Passing testcase > > I'm still not sure where the problem is coming in. The loop in the vector > dump looks functionally correct now I've had change to look at it more > (contrary to my initial comment on IRC). It seems to be equivalent to the > attached, which passed on an AVX2 box I found I had accesss to. But it fails on a skylake-avx512 machine. Minimal test-case that fails: $ cat avx.c struct s { unsigned long a, b, c; }; void __attribute__ ((noipa)) f (struct s *restrict s1, struct s *restrict s2, int n) { for (int i = 0; i < n; ++i) { s1[i].b = s2[i].b; s1[i].c = s2[i].c; s2[i].c = 0; } } #define N 6 int main (void) { struct s s1[N], s2[N]; for (unsigned int j = 0; j < 6; ++j) { s2[j].a = j * 5; s2[j].b = j * 5 + 2; s2[j].c = j * 5 + 4; } f (s1, s2, 6); for (unsigned int j = 0; j < 6; ++j) if (s1[j].b != j * 5 + 2) { __builtin_printf ("wrong at: %d: is %d, should be %d\n", j, s1[j].b, j * 5 + 2); __builtin_abort (); } __builtin_printf ("OK\n"); return 0; } $ gcc -march=skylake-avx512 avx.c -g && ./a.out && gcc -march=skylake-avx512 avx.c -g -O3 && ./a.out OK wrong at: 3: is 15, should be 17 Aborted (core dumped)
One can reproduce that with Intel SDE simulator: https://software.intel.com/protected-download/267266/144917 $ ./sde-external-8.16.0-2018-01-30-lin/sde -skx -- /tmp/a.out wrong at: 3: is 15, should be 17 Aborted (core dumped)
OK. The .optimized dumps seem to be the same for both -mavx2 and -march=skylake-avx512. Things only diverge during expand. It looks like it might be a bug in: (define_insn "<mask_codefor>avx512dq_shuf_<shuffletype>64x2_1<mask_name>" [(set (match_operand:VI8F_256 0 "register_operand" "=v") (vec_select:VI8F_256 (vec_concat:<ssedoublemode> (match_operand:VI8F_256 1 "register_operand" "v") (match_operand:VI8F_256 2 "nonimmediate_operand" "vm")) (parallel [(match_operand 3 "const_0_to_3_operand") (match_operand 4 "const_0_to_3_operand") (match_operand 5 "const_4_to_7_operand") (match_operand 6 "const_4_to_7_operand")])))] "TARGET_AVX512VL && (INTVAL (operands[3]) == (INTVAL (operands[4]) - 1) && INTVAL (operands[5]) == (INTVAL (operands[6]) - 1))" { int mask; mask = INTVAL (operands[3]) / 2; mask |= (INTVAL (operands[5]) - 4) / 2 << 1; operands[3] = GEN_INT (mask); return "vshuf<shuffletype>64x2\t{%3, %2, %1, %0<mask_operand7>|%0<mask_operand7>, %1, %2, %3}"; } [(set_attr "type" "sselog") (set_attr "length_immediate" "1") (set_attr "prefix" "evex") (set_attr "mode" "XI")]) which AFAICT requires without checking that operands 3 and 5 are even (0 or 2 and 4 or 6 respectively). In this case we're using it to match: (insn 40 39 41 6 (set (reg:V4DI 101 [ vect__5.17 ]) (vec_select:V4DI (vec_concat:V8DI (reg:V4DI 98 [ vect__5.14 ]) (reg:V4DI 140 [ vect__5.15 ])) (parallel [ (const_int 2 [0x2]) (const_int 3 [0x3]) (const_int 5 [0x5]) (const_int 6 [0x6]) ]))) "/tmp/foo.c":8:22 4069 {*avx512dq_shuf_i64x2_1} (nil)) and treat the permute mask as {2, 3, 4, 5} instead.
Not really best placed to fix or test this.
I'll handle this.
Even more reduced testcase: typedef long long int V __attribute__((vector_size (4 * sizeof (long long int)))); __attribute__((noipa)) void foo (V *p) { p[0] = __builtin_shuffle (p[1], p[2], (V) { 2, 3, 5, 6 }); } int main () { V a[3] = { { 0, 0, 0, 0 }, { 10, 11, 12, 13 }, { 14, 15, 16, 17 } }; foo (a); if (a[0][0] != 12 || a[0][1] != 13 || a[0][2] != 15 || a[0][3] != 16) __builtin_abort (); return 0; } Works with -O2 -mavx2, aborts with -O2 -mavx512vl.
Created attachment 45528 [details] gcc9-pr87214-wip.patch Untested fix. Still need to cover all the changes with testcases.
Author: jakub Date: Sun Jan 27 11:56:44 2019 New Revision: 268310 URL: https://gcc.gnu.org/viewcvs?rev=268310&root=gcc&view=rev Log: PR target/87214 * config/i386/sse.md (<mask_codefor>avx512dq_shuf_<shuffletype>64x2_1<mask_name>, avx512f_shuf_<shuffletype>64x2_1<mask_name>): Ensure the first constants in pairs are multiples of 2. Formatting fixes. (avx512vl_shuf_<shuffletype>32x4_1<mask_name>, avx512vl_shuf_<shuffletype>32x4_1<mask_name>): Ensure the first constants in each quadruple are multiples of 4. Formatting fixes. * gcc.target/i386/avx512vl-pr87214-1.c: New test. * gcc.target/i386/avx512vl-pr87214-2.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr87214-1.c trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr87214-2.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/sse.md trunk/gcc/testsuite/ChangeLog
Fixed. Will backport to release branches eventually though, as it is latent there.
Author: jakub Date: Thu Feb 7 14:42:54 2019 New Revision: 268633 URL: https://gcc.gnu.org/viewcvs?rev=268633&root=gcc&view=rev Log: Backported from mainline 2019-01-27 Jakub Jelinek <jakub@redhat.com> PR target/87214 * config/i386/sse.md (<mask_codefor>avx512dq_shuf_<shuffletype>64x2_1<mask_name>, avx512f_shuf_<shuffletype>64x2_1<mask_name>): Ensure the first constants in pairs are multiples of 2. Formatting fixes. (avx512vl_shuf_<shuffletype>32x4_1<mask_name>, avx512vl_shuf_<shuffletype>32x4_1<mask_name>): Ensure the first constants in each quadruple are multiples of 4. Formatting fixes. * gcc.target/i386/avx512vl-pr87214-1.c: New test. * gcc.target/i386/avx512vl-pr87214-2.c: New test. Added: branches/gcc-8-branch/gcc/testsuite/gcc.target/i386/avx512vl-pr87214-1.c branches/gcc-8-branch/gcc/testsuite/gcc.target/i386/avx512vl-pr87214-2.c Modified: branches/gcc-8-branch/gcc/ChangeLog branches/gcc-8-branch/gcc/config/i386/sse.md branches/gcc-8-branch/gcc/testsuite/ChangeLog
Author: jakub Date: Fri Aug 30 11:32:15 2019 New Revision: 275092 URL: https://gcc.gnu.org/viewcvs?rev=275092&root=gcc&view=rev Log: Backported from mainline 2019-01-27 Jakub Jelinek <jakub@redhat.com> PR target/87214 * config/i386/sse.md (<mask_codefor>avx512dq_shuf_<shuffletype>64x2_1<mask_name>, avx512f_shuf_<shuffletype>64x2_1<mask_name>): Ensure the first constants in pairs are multiples of 2. Formatting fixes. (avx512vl_shuf_<shuffletype>32x4_1<mask_name>, avx512vl_shuf_<shuffletype>32x4_1<mask_name>): Ensure the first constants in each quadruple are multiples of 4. Formatting fixes. * gcc.target/i386/avx512vl-pr87214-1.c: New test. * gcc.target/i386/avx512vl-pr87214-2.c: New test. Added: branches/gcc-7-branch/gcc/testsuite/gcc.target/i386/avx512vl-pr87214-1.c branches/gcc-7-branch/gcc/testsuite/gcc.target/i386/avx512vl-pr87214-2.c Modified: branches/gcc-7-branch/gcc/ChangeLog branches/gcc-7-branch/gcc/config/i386/sse.md branches/gcc-7-branch/gcc/testsuite/ChangeLog