After the fix for http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49749 (http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=176984) I observed a degradation for the attached test. (~9% on Core) Before the fix RA managed to use registers for code related to line #30. After the fix an order of operations has been changed and this apparently led to change in live ranges and hence to increased register pressure. Asm snippet for fast case # 4long.c:30 .loc 1 30 0 movl 8(%ecx), %esi xorl %edi, %edi addl %eax, %esi movl 52(%esp), %eax adcl %edx, %edi mull 8(%ebp) addl %eax, %esi adcl %edx, %edi Asm snippet for slow case # 4long.c:30 .loc 1 30 0 movl 52(%esp), %eax mull 8(%ebp) movl %eax, (%esp) movl 8(%ecx), %eax movl %edx, 4(%esp) xorl %edx, %edx addl %eax, (%esp) adcl %edx, 4(%esp) addl %esi, (%esp) adcl %edi, 4(%esp) gcc is: Target: x86_64-unknown-linux-gnu Configured with: ../configure --disable-bootstrap --enable-languages=c,c++ --prefix=/export/users/izamyati/build/ Thread model: posix gcc version 4.7.0 20110731 (experimental) (GCC) Compilation flags: -O2 -mssse3 -mfpmath=sse -ffast-math -m32
Created attachment 25373 [details] testcase
The fix for 49749 is intended to remove dependencies between loop iterations. One possibility would be to condition the changes on the presence of -funroll-loops. Another would be to limit the changes to loops containing fewer blocks or otherwise measuring simpler control flow. To help make a good decision here, can you please try your test case with -funroll-loops before and after the fix for 49749?
William, thanks for quick response! With -funroll-loops regression is still present. Do you want me to attach some dumps?
No, that's OK. I should be able to reproduce this on a pool machine. It may be difficult to come up with a good heuristic here given that reassociation doesn't have a good estimate of register pressure available. The fix solved a couple of other problem reports in addition to 49749, so we need to be careful about constraining it too much. All this is just to say I may not have something for you right away. :)
Reassociation isn't doing anything untoward here that raises register pressure. The problem must be occurring downstream. Likely the scheduler is making a different decision that leads to more pressure. Block 9 contains the following prior to reassociation: D.3497_48 = D.3496_47 + D.3475_117; t_50 = D.3497_48 + D.3493_44; Reassociation changes this to: D.3497_48 = D.3493_44 + D.3496_47; t_50 = D.3497_48 + D.3475_117; This extends the lifetime of D.3475_117 but shortens the lifetime of D.3493_44, both of which go dead here. Register pressure is not raised at any point. There are no further changes to this code in the rest of the tree-SSA phases. Based on this, I don't see any reason to adjust the reassociation algorithm. Someone with some expertise in RTL phases could look into what happens later on to cause the additional pressure, but I don't see this as a tree-optimization issue.
Indeed, overall register pressure is not increased. Even before IRA dumps show that register pressure is actually kept on the same level. Looks like it is a tricky case we met. First, we can see that loop consists of 4 same group of instructions. The only difference is the index value used by arrays in each group. Before the reassociation improvement the group located on lines 30-33 of the attached test for some reasons (I haven't checked this yet) got a different sequence of instructions than others. After William's reassociation changes all groups got similar sequence. (Maybe there were some good reason for that group to be different? :) ) Now the tricky part. In "fast" (i.e. before William's commit) version for group on lines 30-33 IRA managed to hold "c" in eax register. Moreover because of shorter live range of "c" IRA managed to reuse eax inside the operations of 30-th line. For others group all work was made through memory. Since reassociation improvement made all groups to have the same look, we unsurprisingly got memory instead of registers which led to the performance drop. That is sort of my vision of the whole picture. Any comments, ideas?
I don't have anything too helpful to add. This code as it stands is balanced on a knife's edge for register usage for the particular target, so it's always going to be sensitive to compiler changes (not just this one). One thing I notice is that the loop is hand-unrolled four times. Why not let the compiler intelligently choose the unroll factor? I don't know what the result would be, but presumably the unroller has some heuristics to take target characteristics into account. Seems to me the factor of 4 is a bit aggressive for this target.
Unsure what to do about this, it seems to be confirmed at least. People tend to blame the RA, so I will do that for now, too. Considering a WONTFIX eventually ...
There is still the old loop re-rolling pass from the rtlopt-branch. I am not sure if there were any good reasons for not including it in GCC. http://gcc.gnu.org/viewcvs/branches/rtlopt-branch/gcc/loop-reroll.c?view=log
(In reply to comment #9) > There is still the old loop re-rolling pass from the rtlopt-branch. I am not > sure if there were any good reasons for not including it in GCC. Maybe re-rolling could be pushed up to the tree level where it might catch a few more things.
Vlad, I suppose you didn't have a chance to have a look here? Igor, after the "recent" RA changes, is this still an issue? This is most certainly not a P1, leaving at P3 until we get a more detailed analysis.
P2.
GCC 4.7.0 is being released, adjusting target milestone.
GCC 4.7.1 is being released, adjusting target milestone.
GCC 4.7.2 has been released.
Seems with LRA code is fast again
(In reply to comment #16) > Seems with LRA code is fast again So marking it only as a 4.7 regression.
GCC 4.7.3 is being released, adjusting target milestone.
Fixed with 4.8.0.