This is based on performance analysis of eembc2.0 suite on Atom processor in comparison with clang compiler. I prepared a simple reproducer that exhibits the issue. Hottest loop (from function "remove") consists of 7 instructions if inlining was applied: .L48: movl %edx, %ebp <<-- it is redundant!! movl %ecx, %edx .L39: cmpl %edx, %eax je .L24 movl 16(%edx), %ecx testl %ecx, %ecx jne .L48 but without inlining it consists of 6 instructions: .L5: cmpl %eax, %ecx je .L4 movl %eax, %edx .L7: movl 16(%edx), %eax testl %eax, %eax jne .L5 In result we can see performance drop on 4% (Atom 32-bit mode). The reproducer will be attached.
Created attachment 30203 [details] testcase Need to compile with "-O2 -m32" options on x86.
I don't believe that this is related to rtl optimizations, but rather to inlining phase. To prove it I did small changes in t.c for remove.c (it now has type void): void remove (node ** head, node* elt) { node* curr; node* prev; if (*head == 0) { return; } prev = 0; curr = *head; while (curr) { if (curr != elt) { prev = curr; curr = curr->next; } else { if (prev == 0) { *head = (*head)->next; break; } else { prev->next = curr->next; break; } } } } For modified test we still have the same issue with redundant move instruction. Now we can do another test modification that performs manual partial inlining for 'remove' (full source will be attached): Move test on zero head out of function to its call, i.e. call of 'remove' is guarded by test on non-null head in 'find' and remove this test out of function, i.e. we have if (head) remove (&head, first) For this modofied test we have optimal 6 instructions for innermost loop. I assume that a problem related to phi-function construction after inlining, namely for original test we have recurrent chain of phi's: <bb 14>: # prev_47 = PHI <prev_37(16), prev_51(13)> # prev_54 = PHI <prev_47(16), prev_25(13)> (before expand phase) but for modified test we have an ordinary phi: <bb 16>: # prev_52 = PHI <prev_38(15), prev_26(13)>
Created attachment 30213 [details] modified testcase This is modified testcase which does not have a problem with redundant move instruction in innermost loop.
I have not seen any comments about my latest note - have you any ideas about this issue? Thanks ahead.
So on the trunk, it looks like the inlined version has only one mov while the out of line one has two. inlined: .L19: cmpq %rax, %rsi je .L18 .L17: movq %rax, %rdx movq 24(%rax), %rax testq %rax, %rax jne .L19 out of line: .L7: movq %rdx, %rax .L4: cmpq %rax, %rsi je .L3 movq 24(%rax), %rdx movq %rax, %rcx testq %rdx, %rdx jne .L7 Note this is 64bit x86. inlined: <bb 10> [local count: 376074471]: if (first_24 != curr_34) goto <bb 11>; [94.50%] else goto <bb 12>; [5.50%] <bb 11> [local count: 397962402]: # curr_55 = PHI <curr_34(10), head.0_9(9)> curr_34 = curr_55->next; if (curr_34 != 0B) goto <bb 10>; [94.50%] else goto <bb 13>; [5.50%] out of line: <bb 3> [local count: 1014686024]: # curr_24 = PHI <curr_15(4), _1(2)> # prev_25 = PHI <curr_24(4), 0B(2)> if (elt_12(D) != curr_24) goto <bb 4>; [94.50%] else goto <bb 5>; [5.50%] <bb 4> [local count: 958878294]: curr_15 = curr_24->next; if (curr_15 != 0B) goto <bb 3>; [93.84%] else goto <bb 8>; [6.16%]