50557 – [4.7 Regression] Register pressure increase after reassociation (x86, 32 bits)

Bug 50557 - [4.7 Regression] Register pressure increase after reassociation (x86, 32 bits)

Summary: [4.7 Regression] Register pressure increase after reassociation (x86, 32 bits)

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.7.0

Importance:	P2 normal
Target Milestone:	4.8.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization, ra

Depends on:
Blocks:

Reported:	2011-09-28 10:08 UTC by Igor Zamyatin
Modified:	2014-06-12 13:02 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:
Build:
Known to work:	4.8.0
Known to fail:	4.7.4
Last reconfirmed:	2011-10-27 00:00:00

Attachments
testcase (332 bytes, application/octet-stream) 2011-09-28 11:52 UTC, Igor Zamyatin	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Igor Zamyatin 2011-09-28 10:08:30 UTC

After the fix for http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49749 (http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=176984) I observed a degradation for the attached test. (~9% on Core)
Before the fix RA managed to use registers for code related to line #30. After the fix an order of operations has been changed and this apparently led to change in live ranges and hence to increased register pressure.

Asm snippet for fast case

        # 4long.c:30
        .loc 1 30 0
        movl    8(%ecx), %esi
        xorl    %edi, %edi
        addl    %eax, %esi
        movl    52(%esp), %eax
        adcl    %edx, %edi
        mull    8(%ebp)
        addl    %eax, %esi
        adcl    %edx, %edi

Asm snippet for slow case

# 4long.c:30
        .loc 1 30 0
        movl    52(%esp), %eax
        mull    8(%ebp)
        movl    %eax, (%esp)
        movl    8(%ecx), %eax
        movl    %edx, 4(%esp)
        xorl    %edx, %edx
        addl    %eax, (%esp)
        adcl    %edx, 4(%esp)
        addl    %esi, (%esp)
        adcl    %edi, 4(%esp)

gcc is:
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --disable-bootstrap --enable-languages=c,c++ --prefix=/export/users/izamyati/build/
Thread model: posix
gcc version 4.7.0 20110731 (experimental) (GCC) 

Compilation flags:
-O2 -mssse3 -mfpmath=sse -ffast-math -m32

Comment 1 Igor Zamyatin 2011-09-28 11:52:18 UTC

Created attachment 25373 [details]
testcase

Comment 2 Bill Schmidt 2011-09-28 12:13:50 UTC

The fix for 49749 is intended to remove dependencies between loop iterations.  One possibility would be to condition the changes on the presence of -funroll-loops.  Another would be to limit the changes to loops containing fewer blocks or otherwise measuring simpler control flow.

To help make a good decision here, can you please try your test case with -funroll-loops before and after the fix for 49749?

Comment 3 Igor Zamyatin 2011-09-29 08:34:45 UTC

William, thanks for quick response!

With -funroll-loops regression is still present.
Do you want me to attach some dumps?

Comment 4 Bill Schmidt 2011-09-29 12:16:46 UTC

No, that's OK.  I should be able to reproduce this on a pool machine.

It may be difficult to come up with a good heuristic here given that reassociation doesn't have a good estimate of register pressure available.  The fix solved a couple of other problem reports in addition to 49749, so we need to be careful about constraining it too much.

All this is just to say I may not have something for you right away. :)

Comment 5 Bill Schmidt 2011-09-30 14:30:56 UTC

Reassociation isn't doing anything untoward here that raises register pressure.  The problem must be occurring downstream.  Likely the scheduler is making a different decision that leads to more pressure.

Block 9 contains the following prior to reassociation:

  D.3497_48 = D.3496_47 + D.3475_117;
  t_50 = D.3497_48 + D.3493_44;

Reassociation changes this to:

  D.3497_48 = D.3493_44 + D.3496_47;
  t_50 = D.3497_48 + D.3475_117;

This extends the lifetime of D.3475_117 but shortens the lifetime of D.3493_44, both of which go dead here.  Register pressure is not raised at any point.  There are no further changes to this code in the rest of the tree-SSA phases.

Based on this, I don't see any reason to adjust the reassociation algorithm.  Someone with some expertise in RTL phases could look into what happens later on to cause the additional pressure, but I don't see this as a tree-optimization issue.

Comment 6 Igor Zamyatin 2011-10-07 10:33:33 UTC

Indeed, overall register pressure is not increased. Even before IRA dumps show that register pressure is actually kept on the same level. 

Looks like it is a tricky case we met.

First, we can see that loop consists of 4 same group of instructions. The only difference is the index value used by arrays in each group. Before the reassociation improvement the group located on lines 30-33 of the attached test for some reasons (I haven't checked this yet) got a different sequence of instructions than others. After William's reassociation changes all groups got similar sequence. (Maybe there were some good reason for that group to be different? :) )

Now the tricky part.
In "fast" (i.e. before William's commit) version for group on lines 30-33 IRA managed to hold "c" in eax register. Moreover because of shorter live range of "c" IRA managed to reuse eax inside the operations of 30-th line. For others group all work was made through memory. 
Since reassociation improvement made all groups to have the same look, we unsurprisingly got memory instead of registers which led to the performance drop.

That is sort of my vision of the whole picture. Any comments, ideas?

Comment 7 Bill Schmidt 2011-10-10 12:40:01 UTC

I don't have anything too helpful to add.  This code as it stands is balanced on a knife's edge for register usage for the particular target, so it's always going to be sensitive to compiler changes (not just this one).

One thing I notice is that the loop is hand-unrolled four times.  Why not let the compiler intelligently choose the unroll factor?  I don't know what the result would be, but presumably the unroller has some heuristics to take target characteristics into account.  Seems to me the factor of 4 is a bit aggressive for this target.

Comment 8 Richard Biener 2011-10-27 10:29:55 UTC

Unsure what to do about this, it seems to be confirmed at least.  People
tend to blame the RA, so I will do that for now, too.  Considering
a WONTFIX eventually ...

Comment 9 Steven Bosscher 2011-12-05 10:45:56 UTC

There is still the old loop re-rolling pass from the rtlopt-branch. I am not sure if there were any good reasons for not including it in GCC.

http://gcc.gnu.org/viewcvs/branches/rtlopt-branch/gcc/loop-reroll.c?view=log

Comment 10 Andrew Pinski 2011-12-07 04:01:36 UTC

(In reply to comment #9)
> There is still the old loop re-rolling pass from the rtlopt-branch. I am not
> sure if there were any good reasons for not including it in GCC.

Maybe re-rolling could be pushed up to the tree level where it might catch a few more things.

Comment 11 Richard Biener 2012-01-19 12:46:42 UTC

Vlad, I suppose you didn't have a chance to have a look here?  Igor, after
the "recent" RA changes, is this still an issue?

This is most certainly not a P1, leaving at P3 until we get a more detailed
analysis.

Comment 12 Richard Biener 2012-02-27 10:33:25 UTC

P2.

Comment 13 Richard Biener 2012-03-22 08:27:15 UTC

GCC 4.7.0 is being released, adjusting target milestone.

Comment 14 Richard Biener 2012-06-14 08:39:46 UTC

GCC 4.7.1 is being released, adjusting target milestone.

Comment 15 Jakub Jelinek 2012-09-20 10:20:11 UTC

GCC 4.7.2 has been released.

Comment 16 Igor Zamyatin 2012-10-04 11:17:00 UTC

Seems with LRA code is fast again

Comment 17 Andrew Pinski 2013-01-01 05:25:44 UTC

(In reply to comment #16)
> Seems with LRA code is fast again

So marking it only as a 4.7 regression.

Comment 18 Richard Biener 2013-04-11 07:59:30 UTC

GCC 4.7.3 is being released, adjusting target milestone.

Comment 19 Richard Biener 2014-06-12 13:02:03 UTC

Fixed with 4.8.0.