Created attachment 33248 [details] sha.c The following testcase is miscompiled on s390-linux, starting with r207605 (including latest trunk), with e.g. -O2 -m31 -march=z196 -mtune=z10 options.
In *.sched1 I see no significant insn changes in between r207604 (works) and r207605 (aborts), just r14 is mentioned live on entry of various bbs. The testcase fails at r207605 also with -O2 -m31 -march=z196 -mtune=z10 -mno-lra. Richard or Andreas, can you please have a look?
-fno-shrink-wrap doesn't help.
Note, when I replace the if+abort with a printout of all the 8 c.h[] values and take a poll of the results from: ./cc1.208700 -m31 -O0 sha.c -o sha0.s -quiet -nostdinc ./cc1.207604 -nostdinc -quiet -O2 -m31 -march=z196 -mtune=z10 sha.c -o sha1.s ./cc1.207605 -nostdinc -quiet -O2 -m31 -march=z196 -mtune=z10 sha.c -o sha2.s ./cc1.207604 -nostdinc -quiet -O2 -mno-lra -m31 -march=z196 -mtune=z10 sha.c -o sha3.s ./cc1.207605 -nostdinc -quiet -O2 -m31 -march=z196 -mno-lra -mtune=z10 sha.c -o sha4.s /usr/src/gcc/objz/gcc/cc1 -nostdinc -quiet -O2 -m31 -march=z196 -mtune=z10 sha.c -o sha5.s /usr/src/gcc/objz/gcc/cc1 -nostdinc -quiet -O2 -m31 -march=z196 -mtune=z10 sha.c -o sha6.s -mno-lra where /usr/src/gcc/objz/gcc/cc1 is trunk from around Aug 5th, I get different results from the testcase, and also if I remove the last line or two lines from the loop in the function. But which function have different result from the -O0 results depends on how many lines are removed. With no lines removed (G for results matching -O0, B for different result): GGBGBBB With the last line removed the results are GBGBBGB. With the last two lines removed GBGBBGG, when 3 or more lines from the loop are removed, all results are the same (i.e. all Gs). As I believe the testcase doesn't have undefined behavior with any of the loop lines removed, this shows that perhaps the shrink wrapping case just changed things big enough that some RA or later issue either started appearing, or went away.
I've applied on top of r207604 just the https://gcc.gnu.org/viewcvs/gcc/trunk/gcc/config/s390/s390.h?r1=207605&r2=207604&pathrev=207605 change and the testcase was miscompiled too, so the bug is just triggered by different RA decisions from having reg 14 live on the epilogue. The miscompilation goes away with -fno-schedule-insns2, but also with -fno-schedule-insns -fschedule-insns2, but as I said earlier, appart from different live in/out lr in/out (including r14 there) *.sched1 dump doesn't contain any real insn changes. Note only -mtune=z10 is miscompiled, other tuning options are ok.
While r207605 may be the trigger for the failure for this test. I can twiddle the test so that it fails with 207604 and 207605. I'm starting to wonder if 207605 just tripped something latent. I'm still looking, but we may have been totally barking up the wrong tree focusing on 207605.
It's late and I need to catch some zzzs. But as I hinted in my prior update, I think we may chasing something latent. I would recommend looking very closely at r204497, which my bisecting implicated as the failing commit for a severely hacked up test. Reverting r204497 by hand on the trunk results in the original testcase working properly. I'm too tired to really analyze, but I think it's worth a looksie.
Oh, it's my commit, I will have a look if you are ok with that.
More eyes never hurt :-) This pair is going to bed.
(In reply to Jeffrey A. Law from comment #6) > It's late and I need to catch some zzzs. But as I hinted in my prior update, > I think we may chasing something latent. I would recommend looking very > closely at r204497, which my bisecting implicated as the failing commit for > a severely hacked up test. > > Reverting r204497 by hand on the trunk results in the original testcase > working properly. I'm too tired to really analyze, but I think it's worth a > looksie. I never said r207605 was the buggy change, after all, as only the EPILOGUE_USES definition itself makes the code misbehave, it is clear it only triggers a latent issue. For any case we need to find out what is actually miscompiled in the testcase and from that why. I've tried this morning following hack on top of r207605: --- sched-deps.c.xx 2014-08-11 12:05:39.000000000 +0200 +++ sched-deps.c 2014-08-12 09:51:48.540485357 +0200 @@ -3034,6 +3034,14 @@ sched_analyze_insn (struct deps_desc *de || (NONJUMP_INSN_P (insn) && control_flow_insn_p (insn))) reg_pending_barrier = MOVE_BARRIER; + if (!DEBUG_INSN_P (insn) && reload_completed && getenv ("SKIPB") && getenv ("SKIPE")) + { + static int cntskip; + ++cntskip; + if (cntskip > atoi (getenv ("SKIPB")) && cntskip < atoi (getenv ("SKIPE"))) + reg_pending_barrier = TRUE_BARRIER; + } + if (sched_pressure != SCHED_PRESSURE_NONE) { setup_insn_reg_uses (deps, insn); and it seems that with: ./cc1 -m31 -O2 -march=z196 -mtune=z10 sha.c -o sha1.s -quiet -nostdinc; SKIPB=6261 SKIPE=6263 ./cc1 -m31 -O2 -march=z196 -mtune=z10 sha.c -o sha2.s -quiet -nostdinc sha1.s is expectedly miscompiled, while sha2.s gives the correct result and differs just in a single true barrier during sched2, i.e. no changes in split2 dump, and just small amount of changes in the sched2 dump.
And with: --- sched-deps.c (revision 207605) +++ sched-deps.c (working copy) @@ -3034,6 +3034,21 @@ sched_analyze_insn (struct deps_desc *de || (NONJUMP_INSN_P (insn) && control_flow_insn_p (insn))) reg_pending_barrier = MOVE_BARRIER; + if (!DEBUG_INSN_P (insn) && reload_completed && getenv ("SKIPB") && getenv ("SKIPE")) + { + static int cntskip; + ++cntskip; + if (cntskip > atoi (getenv ("SKIPB")) && cntskip < atoi (getenv ("SKIPE"))) + reg_pending_barrier = TRUE_BARRIER; + } + if (!DEBUG_INSN_P (insn) && reload_completed && getenv ("FLUSHB") && getenv ("FLUSHE")) + { + static int cntskip; + ++cntskip; + if (cntskip > atoi (getenv ("FLUSHB")) && cntskip < atoi (getenv ("FLUSHE"))) + flush_pending_lists (deps, insn, true, true); + } + if (sched_pressure != SCHED_PRESSURE_NONE) { setup_insn_reg_uses (deps, insn); again, on top of r207605, and -O2 -m31 -march=z196 -mtune=z10, I get no miscompilation with FLUSHB=6261 FLUSHE=6263 ./cc1 -m31 -O2 -march=z196 -mtune=z10 sha.c -o sha3.s -quiet -nostdinc diff -up -U8 sha1.s sha3.s --- sha1.s 2014-08-12 10:40:14.813740998 +0200 +++ sha3.s 2014-08-12 10:46:58.425686712 +0200 @@ -6246,38 +6246,38 @@ sha512_block_data_order: lm %r2,%r3,312(%r15) x %r11,152(%r15) or %r5,%r8 l %r8,276(%r15) n %r9,216(%r15) srl %r4,7 n %r11,184(%r15) al %r3,252(%r8) - ahi %r8,128 - alc %r2,120(%r8) + alc %r2,248(%r8) + x %r0,1104(%r15) xr %r11,%r9 st %r11,1072(%r15) - xr %r1,%r5 n %r7,188(%r15) - x %r0,1104(%r15) - st %r8,276(%r15) + xr %r1,%r5 stm %r2,%r3,120(%r15) st %r0,1084(%r15) l %r3,116(%r15) sll %r3,25 lr %r6,%r3 l %r3,224(%r15) xr %r3,%r10 st %r3,1068(%r15) l %r3,156(%r15) l %r2,232(%r15) xr %r2,%r12 st %r2,1088(%r15) - lm %r10,%r11,120(%r15) n %r3,220(%r15) + ahi %r8,128 + lm %r10,%r11,120(%r15) + st %r8,276(%r15) al %r11,164(%r15) alc %r10,160(%r15) xr %r7,%r3 st %r7,1076(%r15) lr %r3,%r11 or %r6,%r4 lm %r4,%r5,304(%r15) al %r3,1068(%r15)
And the bug in that is that in sha1.s (the miscompiled one) sched2 wrongly scheduled ahi %r8,128 which clobbers %cc in between a %cc producer and %cc consumer. If I manually fix that up as follows: --- sha1.s 2014-08-12 08:39:50.694224146 -0400 +++ sha4.s 2014-08-12 09:38:58.914224146 -0400 @@ -6251,8 +6251,8 @@ sha512_block_data_order: srl %r4,7 n %r11,184(%r15) al %r3,252(%r8) + alc %r2,248(%r8) ahi %r8,128 - alc %r2,120(%r8) xr %r11,%r9 st %r11,1072(%r15) xr %r1,%r5 then the testcase succeeds. In *.split2 we have: (insn 11376 7193 11377 5 (parallel [ (set (reg:CCL1 33 %cc) (compare:CCL1 (plus:SI (reg:SI 3 %r3 [orig:2900 D.1817+4 ] [2900]) (mem:SI (plus:SI (reg:SI 8 %r8 [orig:65 ivtmp.30 ] [65]) (const_int 252 [0xfc])) [2 MEM[base: _25, offset: 248B]+4 S4 A32])) (reg:SI 3 %r3 [orig:2900 D.1817+4 ] [2900]))) (set (reg:SI 3 %r3 [orig:2900 D.1817+4 ] [2900]) (plus:SI (reg:SI 3 %r3 [orig:2900 D.1817+4 ] [2900]) (mem:SI (plus:SI (reg:SI 8 %r8 [orig:65 ivtmp.30 ] [65]) (const_int 252 [0xfc])) [2 MEM[base: _25, offset: 248B]+4 S4 A32]))) ]) 329 {*addsi3_carry1_cc} (nil)) (insn 11377 11376 7194 5 (parallel [ (set (reg:SI 2 %r2 [ D.1817 ]) (plus:SI (plus:SI (ltu:SI (reg:CCL1 33 %cc) (const_int 0 [0])) (reg:SI 2 %r2 [ D.1817 ])) (mem:SI (plus:SI (reg:SI 8 %r8 [orig:65 ivtmp.30 ] [65]) (const_int 248 [0xf8])) [2 MEM[base: _25, offset: 248B]+0 S4 A64]))) (clobber (reg:CC 33 %cc)) ]) 407 {*addsi3_alc} (expr_list:REG_DEAD (reg:CCL1 33 %cc) (expr_list:REG_UNUSED (reg:CC 33 %cc) (nil)))) ... (insn 2799 7206 7208 5 (parallel [ (set (reg:SI 8 %r8 [orig:65 ivtmp.30 ] [65]) (plus:SI (reg:SI 8 %r8 [orig:65 ivtmp.30 ] [65]) (const_int 128 [0x80]))) (clobber (reg:CC 33 %cc)) ]) 327 {*addsi3} (expr_list:REG_UNUSED (reg:CC 33 %cc) (nil))) In *.sched2 dump this is (+ is from normal r207605 compilation, - from FLUSHB=6261 FLUSHE=6263 (i.e. - is working one, + is miscompiled one): +(insn:TI 11376 2759 2799 5 (parallel [ (set (reg:CCL1 33 %cc) (compare:CCL1 (plus:SI (reg:SI 3 %r3 [orig:2900 D.1817+4 ] [2900]) (mem:SI (plus:SI (reg:SI 8 %r8 [orig:65 ivtmp.30 ] [65]) (const_int 252 [0xfc])) [2 MEM[base: _25, offset: 248B]+4 S4 A32])) (reg:SI 3 %r3 [orig:2900 D.1817+4 ] [2900]))) (set (reg:SI 3 %r3 [orig:2900 D.1817+4 ] [2900]) (plus:SI (reg:SI 3 %r3 [orig:2900 D.1817+4 ] [2900]) (mem:SI (plus:SI (reg:SI 8 %r8 [orig:65 ivtmp.30 ] [65]) (const_int 252 [0xfc])) [2 MEM[base: _25, offset: 248B]+4 S4 A32]))) ]) 329 {*addsi3_carry1_cc} (nil)) -(insn:TI 11377 11376 2773 5 (parallel [ +(insn:TI 2799 11376 11377 5 (parallel [ + (set (reg:SI 8 %r8 [orig:65 ivtmp.30 ] [65]) + (plus:SI (reg:SI 8 %r8 [orig:65 ivtmp.30 ] [65]) + (const_int 128 [0x80]))) + (clobber (reg:CC 33 %cc)) + ]) 327 {*addsi3} + (expr_list:REG_UNUSED (reg:CC 33 %cc) + (nil))) +(insn:TI 11377 2799 2763 5 (parallel [ (set (reg:SI 2 %r2 [ D.1817 ]) (plus:SI (plus:SI (ltu:SI (reg:CCL1 33 %cc) (const_int 0 [0])) (reg:SI 2 %r2 [ D.1817 ])) (mem:SI (plus:SI (reg:SI 8 %r8 [orig:65 ivtmp.30 ] [65]) - (const_int 248 [0xf8])) [2 MEM[base: _25, offset: 248B]+0 S4 A64]))) + (const_int 120 [0x78])) [2 MEM[base: _25, offset: 248B]+0 S4 A64]))) (clobber (reg:CC 33 %cc)) ]) 407 {*addsi3_alc} (expr_list:REG_DEAD (reg:CCL1 33 %cc) (expr_list:REG_UNUSED (reg:CC 33 %cc) (nil))))
Hi, I can reproduce the exact mis-scheduled instruction issue as in Jakub's comment with/without the ivopt patch (204497). Turns out gcc chooses candidate like {&K512, 128}_loop with the patch while candidate {&K512[16], 128}_loop without the patch. This is caused by candidate cost change. As far as I can see, both candidate are good for the iv use and shouldn't generate wrong code either. The interesting thing is, mis-scheduled instruction is the only code difference (apart from some offset difference as base of candidates are different) caused by ivopt patch. Normally middle-end patch should have larger impact on generated code if there is. I will have a look why different iv candidate can result in one instruction mis-scheduled. Maybe it's useful for investigation. Anyway, I think 204497 is generally fine in this case. Thanks, bin
Just verified the trunk and it is the same thing there, also sched2 generating: al %r3,252(%r8) ahi %r8,128 alc %r2,120(%r8) where the first insn is cc setter, second cc clobber and third cc user and clobber, so the r8=r8+128 insn should not have been scheduled in between those two. It seems scheduler has some code to adjust the address (248 -> 120) based on the r8 adjustment, otherwise it would be wrong to move it even because of r8 register, perhaps if we take that path we don't fully check other side effects (cc clobber in this case). s390 AFAIK doesn't have addition which doesn't clobber flags, unlike many other architectures.
Seems that is the http://gcc.gnu.org/ml/gcc-patches/2012-08/msg00155.html minipass, supposedly when it sees that it could break a reg dependency (on r8 in this case), it just ignores the fact that it can have other side effects. parse_add_or_inc uses single_set which will return the increment SET even if there are extra clobbers. Giving up would make the minipass useless on many architectures, so I think we just need to make sure that we ignore the reg dependency only on the address register (r8 here), but not for the clobbered regs.
So, what about this completely untested fix? Unfortunately the scheduler change has been committed with no testcases. I guess I'll do a bootstrap/regtest with some printout where this patch changes things and where we consider it at all. --- gcc/sched-deps.c.jj 2014-08-06 10:34:13.000000000 +0200 +++ gcc/sched-deps.c 2014-08-12 14:12:06.625193731 +0200 @@ -4751,6 +4751,24 @@ find_inc (struct mem_inc_info *mii, bool "inc conflicts with store failure.\n"); goto next; } + + /* The inc instruction could have clobbers, make sure those + registers are not used in mem insn. */ + FOR_EACH_INSN_DEF (def, mii->inc_insn) + if (!reg_overlap_mentioned_p (DF_REF_REG (def), mii->mem_reg0)) + { + df_ref use; + FOR_EACH_INSN_USE (use, mii->mem_insn) + if (reg_overlap_mentioned_p (DF_REF_REG (def), + DF_REF_REG (use))) + { + if (sched_verbose >= 5) + fprintf (sched_dump, + "inc clobber used in store failure.\n"); + goto next; + } + } + newaddr = mii->inc_input; if (mii->mem_index != NULL_RTX) newaddr = gen_rtx_PLUS (GET_MODE (newaddr), newaddr,
In reference to c#12. I think the ivopts changes are just setting up the situation that is mis-handled later. I'd gotten as far as seeing the +128 increment moving in the scheduler, but hadn't looked to see if it was valid. Anyway, so yes I think the ivopts stuff is fine. I should have realized I was chasing something of that nature when the bisection settled on the ivopts code as the trigger.
Author: jakub Date: Tue Aug 12 21:24:40 2014 New Revision: 213887 URL: https://gcc.gnu.org/viewcvs?rev=213887&root=gcc&view=rev Log: PR target/62025 * sched-deps.c (find_inc): Check if inc_insn doesn't clobber any registers that are used in mem_insn. Modified: trunk/gcc/ChangeLog trunk/gcc/sched-deps.c
Author: jakub Date: Tue Aug 12 22:03:37 2014 New Revision: 213888 URL: https://gcc.gnu.org/viewcvs?rev=213888&root=gcc&view=rev Log: PR target/62025 * sched-deps.c (find_inc): Check if inc_insn doesn't clobber any registers that are used in mem_insn. Modified: branches/gcc-4_9-branch/gcc/ChangeLog branches/gcc-4_9-branch/gcc/sched-deps.c
(In reply to Jeffrey A. Law from comment #16) > In reference to c#12. I think the ivopts changes are just setting up the > situation that is mis-handled later. I'd gotten as far as seeing the +128 > increment moving in the scheduler, but hadn't looked to see if it was valid. > > Anyway, so yes I think the ivopts stuff is fine. > > I should have realized I was chasing something of that nature when the > bisection settled on the ivopts code as the trigger. Yes, The scheduling behavior is triggered by specific offset in this case. It changes below insn sequence: 11405: ...... 11406: {%r2:SI=ltu(%cc:CCL1,0)+%r2:SI+[%r8:SI+0xf8];clobber %cc:CC;} ...... 2803: {%r8:SI=%r8:SI+0x80;clobber %cc:CC;} REG_UNUSED %cc:CC into: 11405: ...... 2803: {%r8:SI=%r8:SI+0x80;clobber %cc:CC;} REG_UNUSED %cc:CC 11406: {%r2:SI=ltu(%cc:CCL1,0)+%r2:SI+[%r8:SI+0x78];clobber %cc:CC;} by changing the offset in insn 11406. The problem is why minipass in http://gcc.gnu.org/ml/gcc-patches/2012-08/msg00155.html would do this transformation. According to the description, it is to change rn++ rm=[rn] into rm=[rn+4] rn++ Here it is exactly the opposite tranformation and introducing more dependency. Thanks.
Created attachment 33314 [details] pr62025.c
Created attachment 33315 [details] pr62025-2.c Possible testcase for the testsuite. Except this one only used to be miscompiled with trunk and not 4.9 branch, and unfortunately also needs -march=z900 at least, so we'd need effective-target for 64-bit s390-*-*. Perhaps not worth adding this into the testsuite.
Author: jakub Date: Mon Sep 1 11:15:41 2014 New Revision: 214786 URL: https://gcc.gnu.org/viewcvs?rev=214786&root=gcc&view=rev Log: PR target/62025 * sched-deps.c (add_or_update_dep_1): If ask_dependency_caches returned DEP_PRESENT, make sure to set DEP_MULTIPLE on present_dep. (find_inc): Revert 2014-08-13 change. Modified: trunk/gcc/ChangeLog trunk/gcc/sched-deps.c
Author: jakub Date: Mon Sep 1 11:49:36 2014 New Revision: 214790 URL: https://gcc.gnu.org/viewcvs?rev=214790&root=gcc&view=rev Log: PR target/62025 * sched-deps.c (add_or_update_dep_1): If ask_dependency_caches returned DEP_PRESENT, make sure to set DEP_MULTIPLE on present_dep. (find_inc): Revert 2014-08-12 change. Modified: branches/gcc-4_9-branch/gcc/ChangeLog branches/gcc-4_9-branch/gcc/sched-deps.c
Should be fixed for 4.9.2+.