53176 – [4.8 Regression] gcc.dg/lower-subreg-1.c FAILs

Bug 53176 - [4.8 Regression] gcc.dg/lower-subreg-1.c FAILs

Summary: [4.8 Regression] gcc.dg/lower-subreg-1.c FAILs

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.8.0

Importance:	P3 normal
Target Milestone:	4.8.0
Assignee:	Hans-Peter Nilsson

URL:	http://gcc.gnu.org/ml/gcc-patches/201...
Keywords:

Depends on:
Blocks:	53203
	Show dependency tree / graph

Reported:	2012-05-01 17:49 UTC by H.J. Lu
Modified:	2012-07-13 08:13 UTC (History)
CC List:	10 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2012-05-02 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description H.J. Lu 2012-05-01 17:49:25 UTC

On Linux/ia32, revision 187015:

http://gcc.gnu.org/ml/gcc-cvs/2012-05/msg00011.html

caused:

FAIL: gcc.dg/lower-subreg-1.c scan-rtl-dump subreg1 "Splitting reg"
FAIL: gcc.target/i386/movbe-2.c scan-assembler-times movbe[ \t] 4

Comment 1 Kenneth Zadeck 2012-05-01 18:32:32 UTC

hj

this is exactly the kind of regression that we expect!!!   The difference between this version and the previous versions of lower-subreg, is that this version takes into account the rtl costs of the operations transformed.    If the rtl costs do not indicate that this would be profitable, then the transformations are not preformed.

it is up to the port maintainer to set the rtl costs correctly for their port.    If the port maintainer believes that his/her costs are correct and we are still missing optimizations, then we will look into it.

Lower-subreg is most likely a terrible transformation to do on later x86s but is likely to have been useful on the earlier ones.  So this regression is likely a good thing.

kenny

Comment 2 Richard Biener 2012-05-02 11:56:32 UTC

Confirmed.

Comment 3 Uroš Bizjak 2012-05-02 17:58:17 UTC

(In reply to comment #1)

> Lower-subreg is most likely a terrible transformation to do on later x86s but
> is likely to have been useful on the earlier ones.  So this regression is
> likely a good thing.

Can you please suggest, what should be added to leave decisions as they were? We would like to fine tune the cost model in a controlled way, supported by the data from benchmarks and applications. Forced under-the-hood changes to cost model just don't fit this approach.

Comment 4 Jakub Jelinek 2012-05-02 18:11:57 UTC

Not to mention that it would be nice to avoid the possibly expensive initialization, which increases empty source file compilation time, when it isn't actually desirable to use it.

Comment 5 Kenneth Zadeck 2012-05-02 20:35:47 UTC

For each mode larger than the word size of the machine, a factor is 
computed.   That factor is the number of times that mode is larger than 
a word mode.  A move is split if the cost of moving factor words in 
separate instructions is less than moving it as an aggregate.    I.e. it 
must be profitable to do the splitting into separate instructions.     
Before our patch, wide move were always split.

A similar calculation is made for zero extensions and fixed shifts, 
except that this is only done for modes that are exactly twice the size 
of a word mode.  The decision to only consider modes that are twice the 
size of word mode for the shifting is historical, we did not change that.

I disagree with the comment about "forced under the hood changes to the 
cost model...".  The cost models are there for optimizations to use to 
decide when it is desirable to perform transformations.  Our use of 
those models is consistent with the way that the models are defined.    
Individual platforms may not have properly defined all of the cases, but 
the models clearly allow the platform to define the cost of a move of 
any mode.

The original writers of this pass did not consider machines with vector 
instructions.   A vector of N word sized elements can typically be moved 
more cheaply than N scalar moves.  In the old pass, there was no way to 
suppress this splitting.    Now many platforms can move N words as fast 
as they can move 1 word.  This is the way that gcc evolves so that it is 
relevant.

Kenny

On 05/02/2012 01:58 PM, ubizjak at gmail dot com wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53176
>
> --- Comment #3 from Uros Bizjak<ubizjak at gmail dot com>  2012-05-02 17:58:17 UTC ---
> (In reply to comment #1)
>
>> Lower-subreg is most likely a terrible transformation to do on later x86s but
>> is likely to have been useful on the earlier ones.  So this regression is
>> likely a good thing.
> Can you please suggest, what should be added to leave decisions as they were?
> We would like to fine tune the cost model in a controlled way, supported by the
> data from benchmarks and applications. Forced under-the-hood changes to cost
> model just don't fit this approach.
>

Comment 6 Hans-Peter Nilsson 2012-05-02 21:05:08 UTC

cris-elf too.  The "set the rtl costs correctly" comment assumes there's a single linear cost metric shared by all gcc, not leading to pessimization somewhere else.
We'll see about that.  IMHO, since you expected this to happen, a message with a heads-up to target maintainers would have been nicer than just trapping a test-case to silently fail; I had to search the (as always backlogged) mailing lists to find the discussion (no URL, local mailbox).

Comment 7 Kenneth Zadeck 2012-05-02 21:19:18 UTC

I do apologize for the lack of heads up.    that was a mistake on our part.

I am also a little skeptical about the simple rtl cost model being good 
enough to encompass every machine in every case.    But it is better to 
tie the optimization to a cost model than have it just assume that every 
machine does or does not do something.   There are several machines for 
which this pass only does harm and gcc ought to work well for us also.

But there is a legitimate question as to how you want to control what a 
pass does.   I have a multiple issue machine with asymmetric execution 
units and the rtl cost model is not really good enough to model that.   
However the rtl cost model does appear to be good enough for this pass.

I contacted iant before I started this, and he said that the proper plan 
is to use the rtl cost model. So that is what we did.  The alternative 
is to define a bunch of special target hooks and no one seemed to want 
to go there.

Kenny

On 05/02/2012 05:05 PM, hp at gcc dot gnu.org wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53176
>
> Hans-Peter Nilsson<hp at gcc dot gnu.org>  changed:
>
>             What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                   CC|                            |hp at gcc dot gnu.org
>
> --- Comment #6 from Hans-Peter Nilsson<hp at gcc dot gnu.org>  2012-05-02 21:05:08 UTC ---
> cris-elf too.  The "set the rtl costs correctly" comment assumes there's a
> single linear cost metric shared by all gcc, not leading to pessimization
> somewhere else.
> We'll see about that.  IMHO, since you expected this to happen, a message with
> a heads-up to target maintainers would have been nicer than just trapping a
> test-case to silently fail; I had to search the (as always backlogged) mailing
> lists to find the discussion (no URL, local mailbox).
>

Comment 8 Hans-Peter Nilsson 2012-05-02 22:22:54 UTC

(In reply to comment #7)
> I contacted iant before I started this, and he said that the proper plan 
> is to use the rtl cost model. So that is what we did.  The alternative 
> is to define a bunch of special target hooks and no one seemed to want 
> to go there.

Agreed.  Hopefully most targets are fixed by multiplying the current target rtx cost implementation with the size of the mode; I'll try that.

Actually, I think that's what rtlanal.c:rtx_cost should be changed to do, for the cases where the targetm.rtx_costs function returns false.

Comment 9 Hans-Peter Nilsson 2012-05-02 22:27:32 UTC

(In reply to comment #8)
> Actually, I think that's what rtlanal.c:rtx_cost should be changed to do, for
> the cases where the targetm.rtx_costs function returns false.

...and not just for non-tieable SUBREGs.  (Why currently just non-tieable SUBREGs?  That doesn't seem generic.)

brgds, H-P

Comment 10 Richard Sandiford 2012-05-02 22:32:24 UTC

(In reply to comment #4)
> Not to mention that it would be nice to avoid the possibly expensive
> initialization, which increases empty source file compilation time, when it
> isn't actually desirable to use it.

If that's a concern though, we should do it in a general way.
The initialisation that this pass does is much cheaper than
other passes that are initialised in the same way.  All we're
doing is following the existing model.

For the record: even with perf -F 100000 on an "empty" (well, "int x;")
file, the new lower_subreg routines didn't show up at all.
The number of calls to rtx_cost went up from 9017 to 9097.
And the rtx_cost-related functions only took ~2% total.
So I think we're talking about much less than 0.1% of
execution time here.  Compiling an empty file is so quick
even on my rather old box that it's difficult to get a
precise figure.

(The top rtl routines in the profile were IRA and init_move_costs,
which I suppose is what one would expect.  So my point is that
if we want to do something about this, it's more important that
we do something generic that will work for those two than something
ad-hoc that works only for this already-quick routine.)

Comment 11 Richard Sandiford 2012-05-02 22:36:28 UTC

(In reply to comment #8)
> Actually, I think that's what rtlanal.c:rtx_cost should be changed to do, for
> the cases where the targetm.rtx_costs function returns false.

Yeah, I'd wondered about that too.  We've tended to make
targets fend for themselves when it comes to doubleword
ANDs, ADDs, etc., but I suppose that doesn't mean we should...

Comment 12 Greta Yorsh 2012-05-03 10:29:17 UTC

Fails on arm-none-eabi as well:
FAIL: gcc.dg/lower-subreg-1.c scan-rtl-dump subreg1 "Splitting reg"

Comment 13 Kenneth Zadeck 2012-05-03 13:14:31 UTC

The arm is one of the architectures for which lower-subreg is harmful 
for some of the implementations.

kenny

On 05/03/2012 06:29 AM, Greta.Yorsh at arm dot com wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53176
>
> Greta Yorsh<Greta.Yorsh at arm dot com>  changed:
>
>             What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                   CC|                            |Greta.Yorsh at arm dot com
>
> --- Comment #12 from Greta Yorsh<Greta.Yorsh at arm dot com>  2012-05-03 10:29:17 UTC ---
> Fails on arm-none-eabi as well:
> FAIL: gcc.dg/lower-subreg-1.c scan-rtl-dump subreg1 "Splitting reg"
>

Comment 14 Greta Yorsh 2012-05-03 13:28:42 UTC

This test fails because splitting doesn't happen any more after r187015 patch. Without splitting, better code is generated in this case. 

The test
long long test (long long a, long long b) { return a | b; }
previously compiled into
orr r2, r2, r0
orr r3, r3, r1
mov r0, r2
mov r1, r3
now compiles into a better sequence, avoiding the moves:
orr r0, r0, r2
orr r1, r1, r3

I haven't yet come across an example where the patch makes things worse on arm. Do you have any such examples?

Thanks,
Greta

Comment 15 Uroš Bizjak 2012-05-03 16:19:04 UTC

32bit x86 regressed (-O2 -mmovbe) with following testcase:

void
foo (long long i)
{
  x = __builtin_bswap64 (i);
}

from:

foo:
        movbe   4(%esp), %eax
        movbe   8(%esp), %edx
        movl    %eax, x+4
        movl    %edx, x
        ret

to:

foo:
        pushl   %ebx
        movl    8(%esp), %eax
        movl    12(%esp), %edx
        movl    %eax, %ebx
        movl    %edx, %ecx
        bswap   %ebx
        bswap   %ecx
        movl    %ebx, x+4
        movl    %ecx, x
        popl    %ebx

Comment 16 Uroš Bizjak 2012-05-04 06:54:38 UTC

The x86 failure:

FAIL: gcc.target/i386/movbe-2.c scan-assembler-times movbe[ \t] 4

is a register allocator/reload problem. I will open new PR for this.

Comment 17 Uroš Bizjak 2012-05-04 08:20:27 UTC

(In reply to comment #16)
> The x86 failure:
> 
> FAIL: gcc.target/i386/movbe-2.c scan-assembler-times movbe[ \t] 4
> 
> is a register allocator/reload problem. I will open new PR for this.

-> PR 53227

Comment 18 Oleg Endo 2012-05-06 18:09:37 UTC

On SH an issue popped up because lower-subreg would not split multi-word regs anymore.  Could somebody please have a look at comment #2 and the proposed patch in PR 53250?

Comment 19 rdsandiford@googlemail.com 2012-05-06 19:17:03 UTC

"olegendo at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> writes:
> --- Comment #18 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-05-06 18:09:37 UTC ---
> On SH an issue popped up because lower-subreg would not split multi-word regs
> anymore.  Could somebody please have a look at comment #2 and the proposed
> patch in PR 53250?

Unfortunately I messed up the choice of cost routines in the original patch.
I just committed the fix for that.  The SH rtx_costs routine should now see
(set (reg) (reg)) and (set (reg) (const_int 0)) rtxes, so you should
be able to set the costs there.  See:

    http://gcc.gnu.org/ml/gcc-patches/2012-05/msg00425.html

for a similar MIPS patch.

Comment 20 Oleg Endo 2012-05-06 19:21:18 UTC

(In reply to comment #19)
> "olegendo at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> writes:
> > --- Comment #18 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-05-06 18:09:37 UTC ---
> > On SH an issue popped up because lower-subreg would not split multi-word regs
> > anymore.  Could somebody please have a look at comment #2 and the proposed
> > patch in PR 53250?
> 
> Unfortunately I messed up the choice of cost routines in the original patch.
> I just committed the fix for that.  The SH rtx_costs routine should now see
> (set (reg) (reg)) and (set (reg) (const_int 0)) rtxes, so you should
> be able to set the costs there.  See:
> 
>     http://gcc.gnu.org/ml/gcc-patches/2012-05/msg00425.html
> 
> for a similar MIPS patch.

Yep, I just saw it on the patches list.  I'm on it.  Thanks.

Comment 21 Hans-Peter Nilsson 2012-05-06 23:25:40 UTC

All this should be taken care of by the default rtx costs; I don't think any target patches should be needed for these targets at least not for this case.  The default should be N*natural_set_cost, where natural_set_cost is the cost of setting a register (possibly the default) and N is the number of UNITS_PER_WORD for the rtx.
And, the target rtx cost hook should only need to handle rtxes that are valid for the target; no phony insns in larger modes for which there is no named pattern (not documented as such but inferring from only valid addresses needing to be handled by TARGET_ADDRESS_COST).

Comment 22 Hans-Peter Nilsson 2012-05-06 23:28:55 UTC

(In reply to comment #21)

Hopefully it was obvious, but:

> ... where natural_set_cost is the cost of
> setting a register (possibly the default)

... for the UNITS_PER_WORD case ...

Comment 23 Uroš Bizjak 2012-05-08 16:10:07 UTC

Author: uros
Date: Tue May  8 16:01:54 2012
New Revision: 187289

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=187289
Log:
    PR target/53176
    * config/i386/i386.c (ix86_set_reg_reg_cost): New function.
    (ix86_rtx_costs): Handle SET.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c

Comment 24 Hans-Peter Nilsson 2012-05-09 06:22:41 UTC

Generic patch posted.  I'll assign myself until the patch is shot down, though I haven't even tested whether it fixes the original target reported.

Comment 25 Hans-Peter Nilsson 2012-07-12 21:14:19 UTC

Author: hp
Date: Thu Jul 12 21:14:14 2012
New Revision: 189441

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=189441
Log:
	PR rtl-optimization/53176
	* rtlanal.c (rtx_cost): Adjust default cost for X with a
	UNITS_PER_WORD factor for all X according to the size of
	its mode, not just for SUBREGs with untieable modes.
	Handle SET.  Use factor * factor for MULT, DIV, UDIV,
	MOD, UMOD.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/rtlanal.c

Comment 26 Hans-Peter Nilsson 2012-07-13 08:04:39 UTC

fixed in default costs too now.