Created attachment 49962 [details] bug test file a, compiler option: cc1 -mabi=lp64d -march=rv64gc -O2 -S b, hot code in function t_run_test: j .L30 .L39: mv a4,a3 .L30: ld a2,8(a5) addi a3,a4,1 slli t3,a4,3 ble a2,a1,.L28 ld t5,0(a5) bge a1,t5,.L50 .L28: addi a5,a5,8 bne a3,a0,.L39 : hot code loop to .L39 better code in version 8.4 with same compiler option: ===================================================== .L30: ld t1,8(a4) slli a7,a5,3 ble t1,a3,.L28 ld t4,0(a4) bge a3,t4,.L50 .L28: addi a5,a5,1 addi a4,a4,8 bne a5,t3,.L30 : hot code loop to .L30 v10.2.0 gcc has more one instruction than v8.4.0. analize gcc pass of source code in v10.2.0: =========================================== before pass fr4: ---------------- <bb 8> [local count: 82176881]: engLoad.11_20 = engLoad; loadValue.13_26 = loadValue; _410 = (unsigned long) numXEntries.17_218; _409 = _410 + 18446744073709551615; _408 = (long int) _409; ... ... <bb 12> [local count: 986782143]: i1_174 = i1_6 + 1; if (i1_174 != _408) goto <bb 9>; [94.50%] else goto <bb 13>; [5.50%] <bb 13> [local count: 54273018]: # i1_420 = PHI <i1_174(12)> _433 = (long unsigned int) i1_420; _434 = _433 + 1; _435 = _434 * 8; _436 = i1_420 + 1; _440 = _435 - 8; _442 = engLoad.11_20 + _440; goto <bb 15>; [100.00%] after pass fr4: --------------- <bb 8> [local count: 82176881]: engLoad.11_20 = engLoad; loadValue.13_26 = loadValue; _410 = (unsigned long) numXEntries.17_218; _409 = _410 + 18446744073709551615; ... ... <bb 12> [local count: 986782143]: i1_174 = i1_6 + 1; if (i1_174 != _213) goto <bb 9>; [94.50%] else goto <bb 13>; [5.50%] <bb 13> [local count: 54273018]: _433 = (long unsigned int) i1_174; _434 = _433 + 1; _435 = _434 * 8; _436 = i1_174 + 1; _440 = _435 - 8; _442 = engLoad.11_20 + _440; goto <bb 15>; [100.00%] pass fr4 remove 'Removing dead stmt _408 = (long int) _409;', pass dom3 can't optimize this <bb 13> about '_433 = (long unsigned int) i1_174;' if <bb 13> use i1_174 node same as <bb 12>, so that conflict will be happened in pass expand on processing coalesced ssa/phi nodes, and then will split edge. need help ....:)
The analysis sounds a bit confused. What is the transform that DOM cannot do after the transform that FRE does? There's some older bug about out-of-SSA coalescing issues with loops and liveness of induction variables but it's not clear if this is related (the assembly doesn't show the loop exit block). Can you name the loop in the source that is problematic? See PR86270 and PR70359
(In reply to Richard Biener from comment #1) > The analysis sounds a bit confused. What is the transform that DOM cannot > do after the transform that FRE does? There's some older bug about > out-of-SSA > coalescing issues with loops and liveness of induction variables but it's > not clear if this is related (the assembly doesn't show the loop exit block). > > Can you name the loop in the source that is problematic? > see this loop: for( i1 = 0 ; i1 < ( numXEntries - 1 ) ; i1++ ) { if( ( loadValue < engLoad[i1+1] ) && ( loadValue >= engLoad[i1] ) ) { break ; } }
(In reply to jojo from comment #2) > (In reply to Richard Biener from comment #1) > > The analysis sounds a bit confused. What is the transform that DOM cannot > > do after the transform that FRE does? There's some older bug about > > out-of-SSA > > coalescing issues with loops and liveness of induction variables but it's > > not clear if this is related (the assembly doesn't show the loop exit block). > > > > Can you name the loop in the source that is problematic? > > > > see this loop: > > for( i1 = 0 ; i1 < ( numXEntries - 1 ) ; i1++ ) > { > if( ( loadValue < engLoad[i1+1] ) && ( loadValue >= engLoad[i1] > ) ) > { > break ; > } > } Richi: Can you please update Known to work?
Please try simplifying the testcase, I've tried long loadValue; const long *engLoad; float engLoadDelta1; void foo() { long i1, numXEntries = 50; for( i1 = 0 ; i1 < ( numXEntries - 1 ) ; i1++ ) { if( ( loadValue < engLoad[i1+1] ) && ( loadValue >= engLoad[i1] ) ) { break ; } } if( i1 == ( numXEntries - 1 ) ) { loadValue = engLoad[i1] ; } engLoadDelta1 = (float)( loadValue - engLoad[i1] ) / (float)( engLoad[i1 + 1] - engLoad[i1] ) ; } which on x86 doesn't exhibit the issue (same code with GCC 8 and GCC 10).
Sorry for late :) Please test with following c case: long YTableLookup (long xValue, long xEntries, const long *xAxis, const long *yTable ) { int i ; long xDelta ; long outValue ; for (i=0; i<(xEntries - 1); i++) { if ((xValue < xAxis[i + 1]) && (xValue >= xAxis[i])) break ; } if (i == (xEntries - 1)) xValue = xAxis[i] ; xDelta = (long) ((xValue - xAxis[i]) * 1000) / (xAxis[i + 1] - xAxis[i]); outValue = (long) ((((1000 - xDelta) * (long) yTable[i]) / 1000) + ((xDelta * (long) yTable[i+1]) / 1000)) ; return outValue ; } risc-v cc1 option: -O2 -march=rv32gc -mabi=ilp32d ================================================= YTableLookup: addi a1,a1,-1 ble a1,zero,.L2 li a7,4 li a4,0 sub a7,a7,a2 j .L5 .L6: mv a4,a5 .L5: lw a6,4(a2) addi a5,a4,1 add t1,a7,a2 ble a6,a0,.L3 lw t3,0(a2) slli t4,a4,2 ble t3,a0,.L9 .L3: addi a2,a2,4 bne a1,a5,.L6 addi a4,a4,2 slli t1,a4,2 addi t4,t1,-4 add t4,a3,t4 li a1,0 li a5,1000 .L4: or x86 cc1 option: -O2 -march=i386 ================================== YTableLookup: .LFB0: pushl %ebp .LCFI0: pushl %edi .LCFI1: pushl %esi .LCFI2: pushl %ebx .LCFI3: pushl %ecx .LCFI4: movl 24(%esp), %esi movl 32(%esp), %edi movl 28(%esp), %eax decl %eax testl %eax, %eax jle .L2 movl $4, %ecx xorl %edx, %edx jmp .L5 .align 4 .L6: movl %ebx, %edx .L5: movl (%edi,%ecx), %ebx cmpl %esi, %ebx jle .L3 leal -4(%ecx), %ebp movl %ebp, (%esp) movl (%edi,%edx,4), %ebp cmpl %esi, %ebp jle .L10 .L3: leal 1(%edx), %ebx addl $4, %ecx cmpl %ebx, %eax jne .L6 leal 8(,%edx,4), %ecx movl 36(%esp), %eax leal -4(%eax,%ecx), %esi xorl %eax, %eax movl $1000, %ebx .L4: Please check the redundancy instruction 'mov' at .L6:
So with your testcase on trunk I see for RISCV ble a1,zero,.L2 li a6,4 li a5,0 sub a6,a6,a2 .L5: lw a4,4(a2) slli a7,a5,2 add t1,a6,a2 addi a5,a5,1 ble a4,a0,.L3 lw t3,0(a2) ble t3,a0,.L9 .L3: addi a2,a2,4 bne a1,a5,.L5 which is fine, same for x86. This is usually a SSA coalescing issue where a failed coalesce ends up splitting the backedge and emitting a move there. I can see the issue on the branch where the problematic one is ;; basic block 4, loop depth 1 ;; pred: 3 ;; 7 # i_57 = PHI <0(3), i_41(7)> ... ;; basic block 7, loop depth 1 ;; pred: 4 ;; 5 i_41 = i_57 + 1; ivtmp.14_90 = ivtmp.14_91 + 4; if (_6 != i_41) goto <bb 4>; [94.50%] else goto <bb 8>; [5.50%] ;; succ: 4 ;; 8 ;; basic block 8, loop depth 0 ;; pred: 7 _87 = (sizetype) i_57; _146 = _87 + 2; which is a use of the pre-increment i_57 on the loop exit edge. This inhibits coalescing of i_57 and i_41 causing the copy. That's exactly the issue noted in the cited PRs. There have been patches floating around re-materializing i_41 + 1 at the point of i_57 to make the coalescing possible but I think nobody developed them in full. See the thread starting at https://gcc.gnu.org/pipermail/gcc-patches/2018-March/495843.html