Bug 63537 - [6 Regression] Missed optimization: Loop unrolling adds extra copy when returning aggregate
Summary: [6 Regression] Missed optimization: Loop unrolling adds extra copy when retur...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.9.1
: P2 normal
Target Milestone: 7.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2014-10-14 19:02 UTC by Tavian Barnes
Modified: 2018-10-26 10:43 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work: 4.7.3, 7.1.0
Known to fail: 4.8.3, 4.9.1, 5.0
Last reconfirmed: 2014-10-15 00:00:00


Attachments
Reproducer (157 bytes, text/plain)
2014-10-14 19:02 UTC, Tavian Barnes
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tavian Barnes 2014-10-14 19:02:36 UTC
Created attachment 33715 [details]
Reproducer

At -O2 and above on x86_64, this manually unrolled loop generates much better code than the automatically unrolled one:

	struct vec {
		double n[3];
	};

	struct vec mul_unrolled(struct vec lhs, double rhs) {
		struct vec ret;
		ret.n[0] = lhs.n[0]*rhs;
		ret.n[1] = lhs.n[1]*rhs;
		ret.n[2] = lhs.n[2]*rhs;
		return ret;
	}

This generates the beautiful:

	movsd	16(%rsp), %xmm2
	movq	%rdi, %rax
	movsd	24(%rsp), %xmm1
	mulsd	%xmm0, %xmm2
	mulsd	%xmm0, %xmm1
	mulsd	8(%rsp), %xmm0
	movsd	%xmm2, 8(%rdi)
	movsd	%xmm1, 16(%rdi)
	movsd	%xmm0, (%rdi)
	ret

In contrast, at -O2 this:

	struct vec mul_loop(struct vec lhs, double rhs) {
		struct vec ret;
		for (int i = 0; i < 3; ++i) {
			ret.n[i] = lhs.n[i]*rhs;
		}
		return ret;
	}

generates this:

	movsd	8(%rsp), %xmm1
	movq	%rdi, %rax
	mulsd	%xmm0, %xmm1
	movsd	%xmm1, -40(%rsp)
	movq	-40(%rsp), %rdx
	movsd	16(%rsp), %xmm1
	mulsd	%xmm0, %xmm1
	movq	%rdx, (%rdi)
	mulsd	24(%rsp), %xmm0
	movsd	%xmm1, -32(%rsp)
	movq	-32(%rsp), %rdx
	movsd	%xmm0, -24(%rsp)
	movq	%rdx, 8(%rdi)
	movq	-24(%rsp), %rdx
	movq	%rdx, 16(%rdi)
	ret

which puts the result in -40(%rsp) and then copies it to (%rdi).  At -O3 it gets vectorized but the extra copy is still there:

	movapd	%xmm0, %xmm1
	mulsd	24(%rsp), %xmm0
	movupd	8(%rsp), %xmm2
	movq	%rdi, %rax
	unpcklpd	%xmm1, %xmm1
	mulpd	%xmm1, %xmm2
	movsd	%xmm0, -24(%rsp)
	movaps	%xmm2, -40(%rsp)
	movq	-40(%rsp), %rdx
	movq	%rdx, (%rdi)
	movq	-32(%rsp), %rdx
	movq	%rdx, 8(%rdi)
	movq	-24(%rsp), %rdx
	movq	%rdx, 16(%rdi)
Comment 1 Richard Biener 2014-10-15 08:38:45 UTC
This is because the outer loop is unrolled only after SRA gets a chance to
scalarize away the local aggregate.  With GCC 4.7 we unroll the loop during
early unrolling even at -O2.

With 4.9 we conclude:

Estimating sizes for loop 1
 BB: 4, after_exit: 0
  size:   2 if (i_1 <= 2)
   Exit condition will be eliminated in peeled copies.
 BB: 3, after_exit: 1
  size:   1 _4 = lhs.n[i_1];
  size:   1 _6 = _4 * rhs_5(D);
  size:   1 ret.n[i_1] = _6;
  size:   1 i_8 = i_1 + 1;
   Induction variable computation will be folded away.

size: 6-3, last_iteration: 2-0
  Loop size: 6
  Estimated size after unrolling: 7
Not unrolling loop 1: size would grow.

while 4.7 had:

Estimating sizes for loop 1
 BB: 4, after_exit: 0
  size:   2 if (i_1 <= 2)
   Exit condition will be eliminated.
 BB: 3, after_exit: 1
  size:   1 D.1593_3 = lhs.n[i_1];
  size:   1 D.1594_5 = D.1593_3 * rhs_4(D);
  size:   1 ret.n[i_1] = D.1594_5;
  size:   1 i_6 = i_1 + 1;
   Induction variable computation will be folded away.
size: 6-3, last_iteration: 2-2
  Loop size: 6
  Estimated size after unrolling: 6

so the difference is in last_iteration handling.

Honza?

Otherwise this is a optimization pass ordering issue.

Eventually a simple pass could handle

  <retval> = ret;
  ret ={v} {CLOBBER};
  return <retval>;

and back-propagate <retval> into all stores/loads of ret.
Comment 2 Tavian Barnes 2014-10-15 15:20:46 UTC
Is it possible to make SRA work even if the loop isn't unrolled?  If the array size is increased to 4 then -O2 doesn't unroll the loop at all, resulting in:

	movq	%rdi, %rax
	xorl	%edx, %edx
.L3:
	movsd	8(%rsp,%rdx), %xmm1
	mulsd	%xmm0, %xmm1
	movsd	%xmm1, -40(%rsp,%rdx)
	addq	$8, %rdx
	cmpq	$32, %rdx
	jne	.L3
	movq	-40(%rsp), %rdx
	movq	%rdx, (%rax)
	movq	-32(%rsp), %rdx
	movq	%rdx, 8(%rax)
	movq	-24(%rsp), %rdx
	movq	%rdx, 16(%rax)
	movq	-16(%rsp), %rdx
	movq	%rdx, 24(%rax)
	ret

which would be a lot prettier as something like:

	movq	%rdi, %rax
	xorl	%edx, %edx
.L3:
	movsd	8(%rsp,%rdx), %xmm1
	mulsd	%xmm0, %xmm1
	movsd	%xmm1, (%rax,%rdx)
	addl	$8, %edx
	cmpl	$32, %edx
	jne	.L3
	ret
Comment 3 Marc Glisse 2014-10-15 20:37:35 UTC
(In reply to Richard Biener from comment #1)
> Eventually a simple pass could handle
> 
>   <retval> = ret;
>   ret ={v} {CLOBBER};
>   return <retval>;
> 
> and back-propagate <retval> into all stores/loads of ret.

Shouldn't tree-nrv.c already handle this, except that it currently bails out if TREE_ADDRESSABLE (found)? (strangely enough it has (dead) code to handle the addressable case further in the same function)
Comment 4 Jakub Jelinek 2014-12-19 13:29:38 UTC
GCC 4.8.4 has been released.
Comment 5 Richard Biener 2015-06-23 08:18:35 UTC
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
Comment 6 Jakub Jelinek 2015-06-26 19:54:46 UTC
GCC 4.9.3 has been released.
Comment 7 Richard Biener 2016-08-03 11:43:25 UTC
GCC 4.9 branch is being closed
Comment 8 Jeffrey A. Law 2017-01-13 19:27:04 UTC
This was fixed by Jan's change from June 16:

commit 8c1879bcebc5decbb6c0e7081f7a13c6a740b590
Author: hubicka <hubicka@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Thu Jun 16 12:54:31 2016 +0000

        * g++.dg/vect/pr36648.cc: Disable cunrolli
        * tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Fix estimation
        of comparsions in the last iteration.
    
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@237527 138bc75d-0d04-0410-961f-82ee72b054a4
Comment 9 Jakub Jelinek 2017-10-10 13:28:02 UTC
GCC 5 branch is being closed
Comment 10 Eric Gallager 2018-10-13 18:00:06 UTC
(In reply to Jeffrey A. Law from comment #8)
> This was fixed by Jan's change from June 16:
> 
> commit 8c1879bcebc5decbb6c0e7081f7a13c6a740b590
> Author: hubicka <hubicka@138bc75d-0d04-0410-961f-82ee72b054a4>
> Date:   Thu Jun 16 12:54:31 2016 +0000
> 
>         * g++.dg/vect/pr36648.cc: Disable cunrolli
>         * tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Fix estimation
>         of comparsions in the last iteration.
>     
>     
>     git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@237527
> 138bc75d-0d04-0410-961f-82ee72b054a4

...on trunk at least.
Comment 11 Jakub Jelinek 2018-10-26 10:43:49 UTC
GCC 6 branch is being closed, fixed in 7.x.