Bug 88856 - [8/9 Regression] gfortran producing wrong code with -funroll-loops
Summary: [8/9 Regression] gfortran producing wrong code with -funroll-loops
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 8.2.1
: P2 normal
Target Milestone: 8.3
Assignee: Andreas Krebbel
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2019-01-15 11:58 UTC by Matthias Klose
Modified: 2019-02-06 16:34 UTC (History)
5 users (show)

See Also:
Host:
Target: s390x-linux-gnu
Build:
Known to work: 7.4.0
Known to fail: 8.2.1, 9.0
Last reconfirmed: 2019-01-16 00:00:00


Attachments
qrsolv-reduc.f the miscompiled fortran file autoreduced from scipy (634 bytes, text/plain)
2019-02-01 17:16 UTC, Andreas Krebbel
Details
A C wrapper to call the qrsolv function in the fortran snippet (516 bytes, text/x-csrc)
2019-02-01 17:17 UTC, Andreas Krebbel
Details
experimental patch (498 bytes, patch)
2019-02-01 17:50 UTC, Andreas Krebbel
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Klose 2019-01-15 11:58:06 UTC
seen when build scipy on s390x with the current GCC 8 branch.  According to the Debian report [1] this is tracked down to miscompilation of one Fortran file with -funroll-loops. The Ubuntu report [2] has a "standalone" python test case.

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=915738
[2] https://bugs.launchpad.net/ubuntu/+source/python-scipy/+bug/1811798
Comment 1 Richard Biener 2019-01-15 13:37:23 UTC
I guess we need something better than a python testcase.
Comment 2 Jakub Jelinek 2019-01-16 11:56:13 UTC
Yeah, like that problematic source, all gfortran options used to compile that, any needed modules too + stubbed whatever it calls and whatever is needed in MAIN__ or main to reproduce, ideally with minimal dependencies.
If you know the exact problematic routine, see in the debugger how many times it is called and in which invocation of the routine it misbehaves and try to capture on which arguments it is called.  If you don't know the exact problematic routine, one can e.g. play with assembly bisection between -funroll-loops and no -funroll-loops, rename .L* labels in one of them so that it can be merged by hand more easily.  I think gfortran doesn't have optimize (0) attribute yet.
Comment 3 Andreas Krebbel 2019-01-17 10:17:36 UTC
I've tried building scipy 1.1.0 from github on a Fedora installation. The build already uses -funroll-loops. But I couldn't reproduce the problem with the resulting binary.

gcc version 8.0.1 20180324

Aurelien already tracked it down to a miscompilation of
scipy/optimize/minpack/qrsolv.f

This source file appears to contain just a single function (qrsolv) which is not too big. I think I can work with that after being able to reproduce the problem.

As Jakub mentioned the exact compiler cmdline would be good.
Comment 4 Andreas Krebbel 2019-01-28 10:00:41 UTC
I'm able to reproduce the problem now and will try to have a look.
Comment 5 Andreas Krebbel 2019-02-01 17:16:24 UTC
Created attachment 45586 [details]
qrsolv-reduc.f   the miscompiled fortran file autoreduced from scipy
Comment 6 Andreas Krebbel 2019-02-01 17:17:03 UTC
Created attachment 45587 [details]
A C wrapper to call the qrsolv function in the fortran snippet
Comment 7 Andreas Krebbel 2019-02-01 17:23:47 UTC
gfortran -O3 -march=zEC12 -funroll-loops -fpie qrsolv-reduc.f -c
gcc qrsolv-caller.c -c
gcc qrsolv-caller.o qrsolv-reduc.o -o t

r265191
./t
1.359429

r265193
./t
0.000000
Comment 8 Andreas Krebbel 2019-02-01 17:46:52 UTC
The r265193 patch was found via reghunt. However, it just reveals an underlying issue.

The problem can also be seen with mainline.

The miscompile happens in the following loop:
      do 110 j = 1, n
         if (sdiag(j) .eq. zero .and. nsing .eq. n) nsing = j - 1
         if (nsing .lt. n) wa(j) = zero
  110    continue

The problem appears to be rather related to ifcvt. ifcvt generates a load on condition for the sdiag(j) .eq. zero comparison by inserting insns: 2480, 2481, 2482:

265.ce2

(insn 915 918 916 88 (set (reg:DF 590 [ MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B] ])
        (mem:DF (reg/v/f:DI 239 [ sdiag ]) [2 MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B]+0 S8 A64])) "min.qrsolv.f":51 1289 {*movdf_64dfp}
     (nil))
(insn 916 915 2480 88 (set (reg:CCZ 33 %cc)
        (compare:CCZ (reg:DF 590 [ MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B] ])
            (const_double:DF 0.0 [0x0.0p+0]))) "min.qrsolv.f":51 1255 {*cmpdf_ccs}
     (expr_list:REG_DEAD (reg:DF 590 [ MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B] ])
        (nil)))
(insn 2480 916 2481 88 (set (reg:SI 733)
        (const_int 0 [0])) 1274 {*movsi_zarch}
     (nil))
(insn 2481 2480 2482 88 (set (reg:CCZ 33 %cc)
        (compare:CCZ (reg:DF 590 [ MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B] ])
            (const_double:DF 0.0 [0x0.0p+0]))) 1255 {*cmpdf_ccs}
     (nil))
(insn 2482 2481 927 88 (set (reg/v:SI 109 [ nsing ])
        (if_then_else:SI (ne (reg:CCZ 33 %cc)
                (const_int 0 [0]))
            (reg/v:SI 109 [ nsing ])
            (reg:SI 733))) 1676 {*movsicc}
     (nil))
(note 927 2482 928 88 NOTE_INSN_DELETED)
(jump_insn 928 927 932 88 (parallel [
            (set (pc)
                (if_then_else (le (reg:SI 320 [ _444 ])
                        (reg/v:SI 109 [ nsing ]))
                    (label_ref:DI 943)
                    (pc)))
            (clobber (reg:CC 33 %cc))
        ]) "min.qrsolv.f":52 1260 {*cmp_and_br_signed_si}
     (expr_list:REG_UNUSED (reg:CC 33 %cc)
        (int_list:REG_BR_PROB 536870916 (nil)))
 -> 943)

In the backend we have that interesting splitter which triggers for the old and now obsolete compare in insn 916

(define_split
  [(set (match_operand 0 "cc_reg_operand")
	(compare (match_operand:FP 1 "register_operand")
		 (match_operand:FP 2 "const0_operand")))]
  "TARGET_HARD_FLOAT && REG_P (operands[1]) && dead_or_set_p (insn, operands[1])"
  [(parallel
    [(set (match_dup 0) (match_dup 3))
     (clobber (match_dup 1))])]
 {
   /* s390_match_ccmode requires the compare to have the same CC mode
      as the CC destination register.  */
   operands[3] = gen_rtx_COMPARE (GET_MODE (operands[0]),
				  operands[1], operands[2]);
 })

268.split1   insn 916 -> insn 2677
The REG_DEAD note becomes a clobber due to that

(insn 915 918 2677 105 (set (reg:DF 590 [ MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B] ])
        (mem:DF (reg/v/f:DI 239 [ sdiag ]) [2 MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B]+0 S8 A64])) "min.qrsolv.f":51 1289 {*movdf_64dfp}
     (nil))
(insn 2677 915 2480 105 (parallel [
            (set (reg:CCZ 33 %cc)
                (compare:CCZ (reg:DF 590 [ MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B] ])
                    (const_double:DF 0.0 [0x0.0p+0])))
            (clobber (reg:DF 590 [ MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B] ]))
        ]) "min.qrsolv.f":51 -1
     (nil))
(insn 2480 2677 2481 105 (set (reg:SI 733)
        (const_int 0 [0])) 1274 {*movsi_zarch}
     (nil))
(insn 2481 2480 2482 105 (set (reg:CCZ 33 %cc)
        (compare:CCZ (reg:DF 590 [ MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B] ])
            (const_double:DF 0.0 [0x0.0p+0]))) 1255 {*cmpdf_ccs}
     (nil))

294.cprop_hardreg appears to mess up things: REG_DEAD note in insn 2677 does not appear to fit the used reg but CC is unused now

(insn 3107 927 915 121 (set (reg:DI 3 %r3 [1060])
        (mem/f/c:DI (plus:DI (reg/f:DI 15 %r15)
                (const_int 520 [0x208])) [3 sdiag+0 S8 A64])) "min.qrsolv.f":51 1270 {*movdi_64}
     (nil))
(insn 915 3107 2677 121 (set (reg:DF 19 %f6 [orig:590 MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B] ] [590])
        (mem:DF (reg:DI 3 %r3 [1060]) [2 MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B]+0 S8 A64])) "min.qrsolv.f":51 1289 {*movdf_64dfp}
     (expr_list:REG_DEAD (reg:DI 3 %r3 [1060])
        (nil)))
(insn 2677 915 3108 121 (parallel [
            (set (reg:CCZ 33 %cc)
                (compare:CCZ (reg:DF 19 %f6 [orig:590 MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B] ] [590])
                    (const_double:DF 0.0 [0x0.0p+0])))
            (clobber (reg:DF 19 %f6 [orig:590 MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B] ] [590]))
        ]) "min.qrsolv.f":51 1250 {*cmpdf_ccs_0}
     (expr_list:REG_DEAD (reg:DF 16 %f0 [orig:590 MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B] ] [590])
        (expr_list:REG_UNUSED (reg:CCZ 33 %cc)
            (nil))))
(insn 3108 2677 2481 121 (set (reg:DF 20 %f1 [1061])
        (const_double:DF 0.0 [0x0.0p+0])) 1289 {*movdf_64dfp}
     (nil))
(insn 2481 3108 2480 121 (set (reg:CCZ 33 %cc)
        (compare:CCZ (reg:DF 16 %f0 [orig:590 MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B] ] [590])
            (reg:DF 20 %f1 [1061]))) 1255 {*cmpdf_ccs}
     (expr_list:REG_DEAD (reg:DF 16 %f0 [orig:590 MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B] ] [590])
        (nil)))

295.rtl_dce consequently removes insn 3107, 915, 2677

(insn 3108 927 2481 121 (set (reg:DF 20 %f1 [1061])
        (const_double:DF 0.0 [0x0.0p+0])) 1289 {*movdf_64dfp}
     (nil))
(insn 2481 3108 2480 121 (set (reg:CCZ 33 %cc)
        (compare:CCZ (reg:DF 16 %f0 [orig:590 MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B] ] [590])
            (reg:DF 20 %f1 [1061]))) 1255 {*cmpdf_ccs}
     (expr_list:REG_DEAD (reg:DF 16 %f0 [orig:590 MEM[base: sdiag_143(D), index: ivtmp.67_240, offset: 0B] ] [590])
        (nil)))
(insn 2480 2481 2482 121 (set (reg:SI 0 %r0 [733])
        (const_int 0 [0])) 1274 {*movsi_zarch}
     (expr_list:REG_EQUIV (const_int 0 [0])
        (nil)))
(insn 2482 2480 928 121 (set (reg/v:SI 2 %r2 [orig:109 nsing ] [109])
        (if_then_else:SI (ne (reg:CCZ 33 %cc)
                (const_int 0 [0]))
            (reg/v:SI 2 %r2 [orig:109 nsing ] [109])
            (reg:SI 0 %r0 [733]))) 1676 {*movsicc}
     (expr_list:REG_DEAD (reg:CCZ 33 %cc)
        (expr_list:REG_DEAD (reg:SI 3 %r3 [733])
            (nil))))


To my understanding the data flow is already broken after ifcvt inserted the new compare while leaving the REG_DEAD note in the old. The new compare already appears to read a dead register here and this appears to trigger the rest of the unfortunate events.
Comment 9 Andreas Krebbel 2019-02-01 17:50:35 UTC
Created attachment 45588 [details]
experimental patch

That patch appears to fix the problem for me.
Comment 10 Jakub Jelinek 2019-02-01 17:59:46 UTC
I admit I have just a vague recollection of this, but I thought since df has been added, usually if a pass wants REG_DEAD notes, it needs to df_note_add_problem () and df_analyze should rebuild the REG_DEAD/REG_UNUSED notes.
Comment 11 Jakub Jelinek 2019-02-01 18:07:00 UTC
So, to me this looks like a backend bug, using dead_or_set_p in a splitter when the split passes don't really compute the note problem.  Seems s390 is the only backend that does this, other backends use dead_or_set_p either only in peephole2s (which is fine, peephole2 pass starts with
  df_set_flags (DF_LR_RUN_DCE);
  df_note_add_problem ();
  df_analyze ();
and even when many targets don't use df_or_set_p, they do use peep2_dead*), or (cris) in delayed branch scheduling (I believe that doesn't guarantee that either).  Can't what you are doing in the splitters be done in define_peephole2 instead?
Comment 12 Jakub Jelinek 2019-02-01 19:29:07 UTC
Segher on IRC says that removing REG_DEAD notes that aren't valid is the right thing, so paging others what they think.
Comment 13 Eric Botcazou 2019-02-01 19:43:38 UTC
> Segher on IRC says that removing REG_DEAD notes that aren't valid is the
> right thing, so paging others what they think.

Definitely not, passes are not required to maintain REG_DEAD/REG_UNUSED notes, it's the exclusive job of DF and we're better not open this can of worms.
Comment 14 Andreas Krebbel 2019-02-04 12:59:43 UTC
(In reply to Jakub Jelinek from comment #11)
> ... Can't what you are doing in the splitters be done in
> define_peephole2 instead?

Not that easy unfortunately.  peephole2 will run after reload. So the FP constant ok 0.0 will already be reloaded into a register first or pushed into literal pool. The point of doing the transformation is to avoid this.
Comment 15 Jakub Jelinek 2019-02-04 13:07:52 UTC
(In reply to Andreas Krebbel from comment #14)
> (In reply to Jakub Jelinek from comment #11)
> > ... Can't what you are doing in the splitters be done in
> > define_peephole2 instead?
> 
> Not that easy unfortunately.  peephole2 will run after reload. So the FP
> constant ok 0.0 will already be reloaded into a register first or pushed
> into literal pool. The point of doing the transformation is to avoid this.

But it is still invalid.  If those splitters are essentially to you and worth slowing the compiler on s390*, then you could e.g. add a custom target pass right before split1, where you'd just df_note_addr_problem (); df_analyze (); and thus ensured that during the splitting you could use the REG_DEAD/REG_UNUSED notes safely.  In the splitters you'd likely need to use current_pass to verify you are in the pass for which you've done this.
Comment 16 Andreas Krebbel 2019-02-05 15:23:51 UTC
I'll commit a patch which just removes the splitter for now. I'll try to come up with a nicer testcase.
Comment 17 Jakub Jelinek 2019-02-05 15:26:07 UTC
(In reply to Andreas Krebbel from comment #16)
> I'll commit a patch which just removes the splitter for now. I'll try to
> come up with a nicer testcase.

All 3 s390 splitters that do this?
Comment 18 Andreas Krebbel 2019-02-05 17:14:43 UTC
Author: krebbel
Date: Tue Feb  5 17:14:11 2019
New Revision: 268550

URL: https://gcc.gnu.org/viewcvs?rev=268550&root=gcc&view=rev
Log:
S/390: Remove load and test fp splitter

gcc/ChangeLog:

2019-02-05  Andreas Krebbel  <krebbel@linux.ibm.com>

	PR target/88856
	* config/s390/s390.md: Remove load and test FP splitter.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/s390/s390.md
Comment 19 Andreas Krebbel 2019-02-05 17:17:32 UTC
Author: krebbel
Date: Tue Feb  5 17:17:00 2019
New Revision: 268551

URL: https://gcc.gnu.org/viewcvs?rev=268551&root=gcc&view=rev
Log:
S/390: Remove load and test fp splitter

gcc/ChangeLog:

2019-02-05  Andreas Krebbel  <krebbel@linux.ibm.com>

	Backport from mainline
	2019-02-05  Andreas Krebbel  <krebbel@linux.ibm.com>

	PR target/88856
	* config/s390/s390.md: Remove load and test FP splitter.


Modified:
    branches/gcc-8-branch/gcc/ChangeLog
    branches/gcc-8-branch/gcc/config/s390/s390.md
Comment 20 Andreas Krebbel 2019-02-05 17:19:58 UTC
Author: krebbel
Date: Tue Feb  5 17:19:26 2019
New Revision: 268552

URL: https://gcc.gnu.org/viewcvs?rev=268552&root=gcc&view=rev
Log:
S/390: Remove load and test fp splitter

gcc/ChangeLog:

2019-02-05  Andreas Krebbel  <krebbel@linux.ibm.com>

	Backport from mainline
	2019-02-05  Andreas Krebbel  <krebbel@linux.ibm.com>

	PR target/88856
	* config/s390/s390.md: Remove load and test FP splitter.


Modified:
    branches/gcc-7-branch/gcc/ChangeLog
    branches/gcc-7-branch/gcc/config/s390/s390.md
Comment 21 Andreas Krebbel 2019-02-05 18:23:46 UTC
(In reply to Jakub Jelinek from comment #17)
> (In reply to Andreas Krebbel from comment #16)
> > I'll commit a patch which just removes the splitter for now. I'll try to
> > come up with a nicer testcase.
> 
> All 3 s390 splitters that do this?

I've only removed the load and test splitter for now. The other two are only used for access register setters. There is only that one user in Glibc and we have it that way since the very beginning. I will revisit these for GCC >9 but would rather leave them in for now.
Comment 22 Jakub Jelinek 2019-02-06 16:33:40 UTC
What we could do there is remove the first of those two splitters, remove the
&& !dead_or_set_p (insn, operands[1])
test from the second, and add peephole2 that would transform
   (set (access part 1) (subreg:SI (match_dup 1) low))
   (set (match_dup 1) (rotate:DI (match_dup 1) (const_int 32)))
   (set (access part 2) (subreg:SI (match_dup 1) low))
with a lshiftrt instead of rotate if reg 1 is dead after the third insn (assuming rotate is more expensive as right shift, if it is the same expensive/same size, then having the two splitters makes no sense).  The last rotate should have been removed by DCE already if it was truly dead (though, of course, if it for some reason isn't yet, you could have another peephole2 for that too).
Comment 23 Jakub Jelinek 2019-02-06 16:34:42 UTC
That said, the regression is fixed now.