Bug 50480 - 10% performance regression on Spec2006 410.bwaves
Summary: 10% performance regression on Spec2006 410.bwaves
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.7.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2011-09-22 09:59 UTC by Yukhin Kirill
Modified: 2021-11-03 20:21 UTC (History)
4 users (show)

See Also:
Host:
Target: i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
ivtops dump from subversion id 183934 (after regression) (16.99 KB, text/plain)
2012-04-20 23:17 UTC, Michael Meissner
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yukhin Kirill 2011-09-22 09:59:54 UTC
Hi,
Recently Richard fixed this http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49957
According to my measurements, fix for that bug caused (on Spec2006):

For SandyBride CPU:
* 410.bwaves degradation is -9.54% for peak32
* 410.bwaves degradation is -6.91% for base32
* 410.bwaves improvement is 1.00% for peak64
* 410.bwaves improvement is 0.91% 3or base64

For Corei7 CPU:
* 410.bwaves degradation is -3.91% for peak32
* 410.bwaves degradation is -3.91% for base32
* 410.bwaves improvement is 1.94% for peak64
* 410.bwaves improvement is 3.23% 3or base64

For AMD (Phenom(tm) II X3 B75) CPU:
* 410.bwaves degradation is -7.32% for peak32
* 410.bwaves degradation is -6.56% for base32
* 410.bwaves improvement is 2.01% for peak64
* 410.bwaves degradation is -1.34% 3or base64
Comment 1 Yukhin Kirill 2011-09-22 10:00:34 UTC
Checkin URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=177368
Comment 2 Yukhin Kirill 2011-09-22 10:33:06 UTC
Here is optset details:
base=-static -O2 -ffast-math ("-m32 -msse2 -mfpmath=sse" if 32 bit mode)
peak=-static -O3 -funroll-loops -ffast-math ("-m32 -msse2 -mfpmath=sse" if 32 bit mode)

For SandyBridge: += "-mavx -march=corei7" 
For Core i7: += "-march=corei7" 
For AMD: += "-march=amdfam10" (not sure this is the best)
Comment 3 Richard Biener 2011-09-25 11:57:42 UTC
For 32bit only it seems.  Supposedly a cost model issue, the register pressure
will be higher and we have only half the number of SSE regs.
Comment 4 Yukhin Kirill 2011-09-27 08:31:35 UTC
(In reply to comment #3)
> For 32bit only it seems.  Supposedly a cost model issue, the register pressure
> will be higher and we have only half the number of SSE regs.

Richard, what's wrong maybe with cost model? If you're increasing liverange and you have not as much registers (32-bit case), obviously register pressure will increase and degrade performance. But again, how it is connected with cost model?
Comment 5 rguenther@suse.de 2011-09-27 08:57:33 UTC
On Tue, 27 Sep 2011, kirill.yukhin at intel dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50480
> 
> --- Comment #4 from Yukhin Kirill <kirill.yukhin at intel dot com> 2011-09-27 08:31:35 UTC ---
> (In reply to comment #3)
> > For 32bit only it seems.  Supposedly a cost model issue, the register pressure
> > will be higher and we have only half the number of SSE regs.
> 
> Richard, what's wrong maybe with cost model? If you're increasing liverange and
> you have not as much registers (32-bit case), obviously register pressure will
> increase and degrade performance. But again, how it is connected with cost
> model?

It's connected to the cost model not modeling the whole vectorized
loop but only vectorized statements.  So it can't possibly catch
this case.

I thought of moving more of the cost model details to the target by
allowing the target to track the complete loop, like with

void * targetm.vectorizer.cost_model_start_loop (struct loop *);
targetm.vectorizer.cost_model_stmt (void *, gimple);
unsigned targetm.vectorizer.cost_model_finish_loop (void *);

where the latter would return a cost for the vectorized loop.

We'd need that to model things like PPC having imbalanced resources
for some kind of vectorizations as well (shift takes up much
resources, so you need other stmts to compensate for it).

Richard.
Comment 6 Michael Meissner 2012-04-20 23:17:44 UTC
Created attachment 27206 [details]
ivtops dump from subversion id 183934 (after regression)
Comment 7 Eric Gallager 2018-09-24 01:52:22 UTC
(In reply to Michael Meissner from comment #6)
> Created attachment 27206 [details]
> ivtops dump from subversion id 183934 (after regression)

Where are we supposed to be looking in this?