57430 – Redundant move instruction is produced after function inlining

Bug 57430 - Redundant move instruction is produced after function inlining

Summary: Redundant move instruction is produced after function inlining

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.9.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2013-05-27 11:45 UTC by Yuri Rumyantsev
Modified:	2021-07-25 00:33 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
testcase (517 bytes, text/plain) 2013-05-27 11:48 UTC, Yuri Rumyantsev	Details
modified testcase (504 bytes, text/plain) 2013-05-29 12:39 UTC, Yuri Rumyantsev	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Yuri Rumyantsev 2013-05-27 11:45:24 UTC

This is based on performance analysis of eembc2.0 suite on Atom processor in comparison with clang compiler.

I prepared a simple reproducer that exhibits the issue.

Hottest loop (from function "remove") consists of 7 instructions if inlining was applied:
.L48:
	movl	%edx, %ebp   <<--  it is redundant!!
	movl	%ecx, %edx
.L39:
	cmpl	%edx, %eax
	je	.L24
	movl	16(%edx), %ecx
	testl	%ecx, %ecx
	jne	.L48

but without inlining it consists of 6 instructions:

.L5:
	cmpl	%eax, %ecx
	je	.L4
	movl	%eax, %edx
.L7:
	movl	16(%edx), %eax
	testl	%eax, %eax
	jne	.L5

In result we can see performance drop on 4% (Atom 32-bit mode).

The reproducer will be attached.

Comment 1 Yuri Rumyantsev 2013-05-27 11:48:27 UTC

Created attachment 30203 [details]
testcase

Need to compile with "-O2 -m32" options on x86.

Comment 2 Yuri Rumyantsev 2013-05-29 12:37:45 UTC

I don't believe that this is related to rtl optimizations, but rather to inlining phase. To prove it I did small changes in t.c for remove.c (it now has type void):

void remove (node ** head, node* elt)
{
  node* curr;
  node* prev;

  if (*head == 0)
    {
      return;
    }
  prev = 0;
  curr = *head;
  while (curr)
    {
      if (curr != elt)
	{
	  prev = curr;
	  curr = curr->next;
	}
      else
	{
	  if (prev == 0)
	    {
	      *head = (*head)->next;
	      break;
	    }
	  else
	    {
	      prev->next = curr->next;
	      break;
	    }
        }
    }

}

For modified test we still have the same issue with redundant move instruction.

Now we can do another test modification that performs manual partial inlining for 'remove' (full source will be attached):
 Move test on zero head out of function to its call, i.e.  call of 'remove' is guarded by test on non-null head in 'find' and remove this test out of function, i.e. we have

       if (head)
         remove (&head, first)

For this modofied test we have optimal 6 instructions for innermost loop.

I assume that a problem related to phi-function construction after inlining, namely for original test we have recurrent chain of phi's:

  <bb 14>:
  # prev_47 = PHI <prev_37(16), prev_51(13)>
  # prev_54 = PHI <prev_47(16), prev_25(13)>
(before expand phase)

but for modified test we have an ordinary phi:

  <bb 16>:
  # prev_52 = PHI <prev_38(15), prev_26(13)>

Comment 3 Yuri Rumyantsev 2013-05-29 12:39:46 UTC

Created attachment 30213 [details]
modified testcase

This is modified testcase which does not have a problem with redundant move instruction in innermost loop.

Comment 4 Yuri Rumyantsev 2013-06-11 11:02:41 UTC

I have not seen any comments about my latest note - have you any ideas about this issue?

Thanks ahead.

Comment 5 Andrew Pinski 2021-07-25 00:33:22 UTC

So on the trunk, it looks like the inlined version has only one mov while the out of line one has two.

inlined:
.L19:
        cmpq    %rax, %rsi
        je      .L18
.L17:
        movq    %rax, %rdx
        movq    24(%rax), %rax
        testq   %rax, %rax
        jne     .L19

out of line:
.L7:
        movq    %rdx, %rax
.L4:
        cmpq    %rax, %rsi
        je      .L3
        movq    24(%rax), %rdx
        movq    %rax, %rcx
        testq   %rdx, %rdx
        jne     .L7

Note this is 64bit x86.

inlined:
  <bb 10> [local count: 376074471]:
  if (first_24 != curr_34)
    goto <bb 11>; [94.50%]
  else
    goto <bb 12>; [5.50%]

  <bb 11> [local count: 397962402]:
  # curr_55 = PHI <curr_34(10), head.0_9(9)>
  curr_34 = curr_55->next;
  if (curr_34 != 0B)
    goto <bb 10>; [94.50%]
  else
    goto <bb 13>; [5.50%]

out of line:
  <bb 3> [local count: 1014686024]:
  # curr_24 = PHI <curr_15(4), _1(2)>
  # prev_25 = PHI <curr_24(4), 0B(2)>
  if (elt_12(D) != curr_24)
    goto <bb 4>; [94.50%]
  else
    goto <bb 5>; [5.50%]

  <bb 4> [local count: 958878294]:
  curr_15 = curr_24->next;
  if (curr_15 != 0B)
    goto <bb 3>; [93.84%]
  else
    goto <bb 8>; [6.16%]