[Bug rtl-optimization/104034] New: Miscompilation of LLVM on s390x with -march=z13 -mtune=z14 in GCC 8.x

Fri Jan 14 19:16:33 GMT 2022

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104034

            Bug ID: 104034
           Summary: Miscompilation of LLVM on s390x with -march=z13
                    -mtune=z14 in GCC 8.x
           Product: gcc
           Version: 8.5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: krebbel at gcc dot gnu.org
  Target Milestone: ---

Created attachment 52194
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52194&action=edit
Testcase

Initial analysis done by Jakub Jelinek as part of:
https://bugzilla.redhat.com/show_bug.cgi?id=2028609

The following testcase is miscompiled on s390x with
g++ -fPIC -fvisibility-inlines-hidden -ffunction-sections -fdata-sections -O2
-fPIC  -fno-exceptions -fno-rtti -std=c++14 -mlong-double-128 -march=z13
-mtune=z14
both with the RHEL gcc 8.x and with upstream 8.5.0.
When miscompiled, it prints something like
__insertion_sort 0x3ffd74fd310 0x3ffd74fd348 0xdeadbeefcafebabe
0xdeadbeefcafebabe
__insertion_sort 0x3ffd74fd348 0x3ffd74fd348 0x10006b8 0xdeadbeefcafebabe
rather than
__insertion_sort 0x3ffd74fd310 0x3ffd74fd348 0x10006b8 0xdeadbeefcafebabe
__insertion_sort 0x3ffd74fd348 0x3ffd74fd348 0x10006b8 0xdeadbeefcafebabe

The interesting part is below, .cfi_* directives removed for brevity.
On entry, this function has 3 pointers in %r2, %r3 and %r4 registers, and
%r5 is pointer to the 16-byte function_ref<decltype(foo)> - object with
trivially copyable class
containing 2 8-byte members.
_ZSt24__merge_sort_with_bufferIPPvS1_N4llvm12function_refIFbS0_S0_EEEEvT_S6_T0_T1_:
        stmg    %r6,%r15,48(%r15)
        lgr     %r14,%r15
        lay     %r15,-248(%r15)
        aghi    %r14,-32
        std     %f8,0(%r14)
        std     %f12,8(%r14)
        std     %f14,16(%r14)
        std     %f9,24(%r14)
        sgrk    %r11,%r3,%r2
        lgr     %r1,%r4
        srag    %r13,%r11,3
        agr     %r1,%r11
        lmg     %r8,%r9,0(%r5)
        stmg    %r8,%r9,160(%r15)
! The above stores the whole 16-byte function_ref correctly to %r15+160
        cgijle  %r11,48,.L13
        vlvgp   %v0,%r8,%r9
        ldgr    %f9,%r1
        ldgr    %f12,%r4
        la      %r1,200(%r15)
        lgr     %r10,%r3
        stg     %r11,176(%r15)
        ldgr    %f8,%r2
        lgr     %r6,%r9
        vlgvg   %r7,%v0,1
        stmg    %r8,%r9,184(%r15)
! So does the above
        lgr     %r8,%r1
.L14:
        la      %r11,56(%r2)
        lgr     %r4,%r8
        lgr     %r3,%r11
        stmg    %r6,%r7,200(%r15)
! But this one actually stores both 8-byte words the same to %r15+160, and
%r15+200 is passed as %r4 to the function
        brasl  
%r14,_ZSt16__insertion_sortIPPvN4llvm12function_refIFbS0_S0_EEEEvT_S6_T0_@PLT

In *.postreload, we have still correct:
(insn 16 12 166 2 (set (reg/v:TI 16 %f0 [orig:69 __comp ] [69])
        (reg:TI 8 %r8)) 1268 {movti}
     (nil))
...
(insn 137 136 140 3 (set (reg/v:TI 6 %r6 [orig:69 __comp ] [69])
        (reg/v:TI 16 %f0 [orig:69 __comp ] [69])) 1268 {movti}
     (nil))
The code spills it to 128-bit %f0 register and loads it back from it.
Next, split2 pass splits the latter (but not the former) into:
(insn 167 136 168 3 (set (reg:DI 6 %r6 [ __comp ])
        (reg:DI 16 %f0)) 1269 {*movdi_64}
     (nil))
(insn 168 167 140 3 (set (reg:DI 7 %r7 [orig:69 __comp+8 ] [69])
        (unspec:DI [
                (reg:V2DI 16 %f0)
                (const_int 1 [0x1])
            ] UNSPEC_VEC_EXTRACT)) 402 {*vec_extractv2di}
     (nil))
and finally cprop_hardreg seeing
(insn 187 188 186 3 (set (reg/v:TI 16 %f0 [orig:69 __comp ] [69])
        (reg:TI 8 %r8)) 1268 {movti}
     (nil))
changes insn 167 to:
(insn 167 136 168 3 (set (reg:DI 6 %r6 [ __comp ])
        (reg:DI 9 %r9 [16])) 1269 {*movdi_64}
     (nil))
I'm not sure if this is a bug in the
; Split a VR -> GPR TImode move into 2 vector load GR from VR element.
; For the higher order bits we do simply a DImode move while the
; second part is done via vec extract.  Both will end up as vlgvg.
(define_split
  [(set (match_operand:TI 0 "register_operand" "")
        (match_operand:TI 1 "register_operand" ""))]
  "TARGET_VX && reload_completed
   && GENERAL_REG_P (operands[0])
   && VECTOR_REG_P (operands[1])"
  [(set (match_dup 2) (match_dup 4))
   (set (match_dup 3) (unspec:DI [(match_dup 5) (const_int 1)]
                                 UNSPEC_VEC_EXTRACT))]
{
  operands[2] = operand_subword (operands[0], 0, 0, TImode);
  operands[3] = operand_subword (operands[0], 1, 0, TImode);
  operands[4] = gen_rtx_REG (DImode, REGNO (operands[1]));
  operands[5] = gen_rtx_REG (V2DImode, REGNO (operands[1]));
})
splitter, in cprop_hardreg or the s390x representation of those TImodes in
floating point registers.

In GCC 9 it got "fixed" with https://gcc.gnu.org/r9-3763-gef976be1a23a517 but
that just means it went latent.
And I can't reproduce it even with upstream GCC 9 branch with r9-3763 reverted
- some RA decisions changed.
But that doesn't mean the problem isn't latent even on the trunk, certainly the
above splitter is still there.