Bug 24653 - [4.1 regression] EON regressed seriously on x86-64
Summary: [4.1 regression] EON regressed seriously on x86-64
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.1.0
: P2 normal
Target Milestone: 4.1.0
Assignee: Not yet assigned to anyone
URL: http://gcc.gnu.org/ml/gcc-patches/200...
Keywords: missed-optimization, patch
Depends on:
Blocks:
 
Reported: 2005-11-03 11:26 UTC by Jan Hubicka
Modified: 2005-11-22 17:01 UTC (History)
2 users (show)

See Also:
Host: x86_64-gnu-linux
Target: x86_64-gnu-linux
Build: x86_64-gnu-linux
Known to work: 4.2.0
Known to fail:
Last reconfirmed: 2005-11-03 14:46:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jan Hubicka 2005-11-03 11:26:19 UTC
Eon seems to be our largest regression on x86-64 relative to all previous GCCs up to 3.3-hammer branch.
The slowdown is visible at -O2 for about 7%, at -O3 -ffast-math -march=k8 -funroll-all-loops and profile feedback it is already over 10%.
I've hacked sources so inline decisiosns of 4.0 and 4.1 mostly match (initializer of gritIterator needs to be marked alwaysinline, but doing so won't make situation better) and got following profile out of 4.1:
CPU: AMD64 processors, speed 2400.17 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 500000
samples  %        symbol name
120973   12.0901  mrSurfaceList::viewingHit(ggRay3 const&, double, double, double, mrViewingHitRecord&, ggMaterialRecord&) const120622   12.0550  mrGrid::viewingHit(ggRay3 const&, double, double, double, mrViewingHitRecord&, ggMaterialRecord&) const
119050   11.8979  ggSpectrum::Set(float)
82415     8.2366  mrGrid::shadowHit(ggRay3 const&, double, double, double, double&, ggVector3&, int&, ggSpectrum&) const
43740     4.3714  ggRayXZRectangleIntersect(ggRay3 const&, float, float, float, float, float, double, double, double&, double&,
double&)
32433     3.2414  ggRayBoxIntersect(ggRay3 const&, ggBox3 const&, double, double, ggONB3&, ggPoint3&, double&)
28028     2.8011  mrSurfaceList::shadowHit(ggRay3 const&, double, double, double, double&, ggVector3&, int&, ggSpectrum&) const
25568     2.5553  mrInstance::shadowHit(ggRay3 const&, double, double, double, double&, ggVector3&, int&, ggSpectrum&) const
25541     2.5526  mrXZRectangle::shadowHit(ggRay3 const&, double, double, double, double&, ggVector3&, int&, ggSpectrum&) const

and out of 4.0:
110943   12.0589  ggSpectrum::Set(float)
97609    10.6095  mrGrid::viewingHit(ggRay3 const&, double, double, double, mrViewingHitRecord&, ggMaterialRecord&) const
93316    10.1429  mrSurfaceList::viewingHit(ggRay3 const&, double, double, double, mrViewingHitRecord&, ggMaterialRecord&) const68093     7.4013  mrGrid::shadowHit(ggRay3 const&, double, double, double, double&, ggVector3&, int&, ggSpectrum&) const
31232     3.3947  ggRayXZRectangleIntersect(ggRay3 const&, float, float, float, float, float, double, double, double&, double&,
double&)
30519     3.3172  mrSurfaceList::shadowHit(ggRay3 const&, double, double, double, double&, ggVector3&, int&, ggSpectrum&) const
25570     2.7793  mrInstance::shadowHit(ggRay3 const&, double, double, double, double&, ggVector3&, int&, ggSpectrum&) const
24977     2.7149  mrCookPixelRenderer::directLight(ggRay3 const&, double, ggPoint3 const&, ggONB3 const&, ggPoint2 const&, ggBRD

So all the *hit functions got consistently slower.  Fortunately the all looks pretty much same (walk data using grid iterator), so it might have common cause.
I will attach oprofiled assembly of the viewingHit function. Main difference is in the longest BB of function that is pretty much different because 4.1 SRA out the iterator.  -fno-tree-sra won't make regression disappear.  In 4.1 -fno-tree-sra the BB in question looks like:
  k_462 = iterator.k;
Honza  j_463 = iterator.j;
  i_464 = iterator.i;
  D.46656_476 = p_142->e[0];
  D.46609.e[0] = D.46656_476;
  D.46657_477 = p_142->e[1];
  D.46609.e[1] = D.46657_477;
  D.46658_478 = p_142->e[2];
  D.46609.e[2] = D.46658_478;
  D.46611_481 = D.46609.e[2];
  D.46612_482 = (double) k_462;
  D.46614_484 = D.46467_416 * D.46612_482;
  e2_485 = D.46611_481 + D.46614_484;
  D.46616.e[0] = D.46656_476;
  D.46616.e[1] = D.46657_477;
  D.46616.e[2] = D.46658_478;
  D.46618_497 = D.46616.e[1];
  D.46619_498 = (double) j_463;
  D.46621_500 = D.46458_394 * D.46619_498;
  e1_501 = D.46618_497 + D.46621_500;
  D.46623.e[0] = D.46656_476;
  D.46623.e[1] = D.46657_477;
  D.46623.e[2] = D.46658_478;
  D.46625_513 = D.46623.e[0];
  D.46626_514 = (double) i_464;
  D.46628_516 = D.46449_372 * D.46626_514;
  e0_517 = D.46625_513 + D.46628_516;
  boxMin.e[0] = e0_517;
  boxMin.e[1] = e1_501;
  boxMin.e[2] = e2_485;
  D.46630.e[0] = D.46656_476;
  D.46630.e[1] = D.46657_477;
  D.46630.e[2] = D.46658_478;
  D.46632_536 = D.46630.e[2];
  D.46633_537 = k_462 + 1;
  D.46634_538 = (double) D.46633_537;
  D.46635_540 = D.46467_416 * D.46634_538;
  e2_541 = D.46632_536 + D.46635_540;
  D.46637.e[0] = D.46656_476;
  D.46637.e[1] = D.46657_477;
  D.46637.e[2] = D.46658_478;
  D.46639_553 = D.46637.e[1];
  D.46640_554 = j_463 + 1;
  D.46641_555 = (double) D.46640_554;
  D.46642_557 = D.46458_394 * D.46641_555;
  e1_558 = D.46639_553 + D.46642_557;
  D.46644.e[0] = D.46656_476;
  D.46644.e[1] = D.46657_477;
  D.46644.e[2] = D.46658_478;
  D.46646_570 = D.46644.e[0];
  D.46647_571 = i_464 + 1;
  D.46648_572 = (double) D.46647_571;
  D.46649_574 = D.46449_372 * D.46648_572;
  e0_575 = D.46646_570 + D.46649_574;
  boxMax.e[0] = e0_575;
  boxMax.e[1] = e1_558;
  boxMax.e[2] = e2_541;
  cellBox.pmin.e[2] = 0.0;
  cellBox.pmin.e[1] = 0.0;
  cellBox.pmin.e[0] = 0.0;
  cellBox.pmax.e[2] = 0.0;
  cellBox.pmax.e[1] = 0.0;
  cellBox.pmax.e[0] = 0.0;
  cellBox.pmin = boxMin;
  cellBox.pmax = boxMax;
  D.46719.e[0] = D.46538_345;
  D.46719.e[1] = D.46535_342;
  D.46719.e[2] = D.46532_339;
  o_601 = D.46719.e[0];
  D.46721.e[0] = D.46523_329;
  D.46721.e[1] = D.46521_327;
  D.46721.e[2] = D.46519_325;
  temp_612 = D.46721.e[0];
  if (temp_612 != 0.0) goto <L46>; else goto <L48>;
4.0 version is:
<L32>:;
  t1.187_430 = t1;
  iterator.tCellMax = t1.187_430;
  D.41484_448 = v_60->e[2];
  e2_449 = t1.187_430 * D.41484_448;
  D.41486_450 = v_60->e[1];
  e1_451 = t1.187_430 * D.41486_450;
  D.41488_452 = v_60->e[0];
  e0_453 = t1.187_430 * D.41488_452;
  D.41499_462 = p_74->e[2];
  p$e$2_464 = e2_449 + D.41499_462;
  D.41502_465 = p_74->e[1];
  p$e$1_467 = e1_451 + D.41502_465;
  D.41505_468 = p_74->e[0];
  p$e$0_470 = e0_453 + D.41505_468;
  D.41425_480 = iterator.iGrid;
  this_481 = &D.41425_480->gridBox;
  p_484 = &this_481->pmin;
  SR.507_487 = p_484->e[0];
  D.41429_493 = p$e$0_470 - SR.507_487;
  D.41430_495 = D.41425_480->xDimension;
  D.41431_496 = D.41429_493 / D.41430_495;
  D.41432_497 = (int) D.41431_496;
  iterator.i = D.41432_497;
  D.41425_502 = iterator.iGrid;
  this_503 = &D.41425_502->gridBox;
  p_506 = &this_503->pmin;
  SR.509_510 = p_506->e[1];
  D.41438_515 = p$e$1_467 - SR.509_510;
  D.41439_517 = D.41425_502->yDimension;
  D.41440_518 = D.41438_515 / D.41439_517;
  D.41441_519 = (int) D.41440_518;
  iterator.j = D.41441_519;
  D.41425_524 = iterator.iGrid;
  this_525 = &D.41425_524->gridBox;
  p_528 = &this_525->pmin;
  SR.511_533 = p_528->e[2];
  D.41447_537 = p$e$2_464 - SR.511_533;
  D.41448_539 = D.41425_524->zDimension;
  D.41449_540 = D.41447_537 / D.41448_539;
  D.41450_541 = (int) D.41449_540;
  iterator.k = D.41450_541;
  D.41451_543 = iterator.i;
  D.41425_544 = iterator.iGrid;
  D.41452_545 = D.41425_544->nx;
  if (D.41451_543 >= D.41452_545) goto <L45>; else goto <L46>;

What seems strange is that the [2] in 4.0 version gets just used in arithmetic, while 4.1 copies it around for some reason I don't follow (yet)

Honza
Comment 1 Jan Hubicka 2005-11-03 11:40:09 UTC
Actually I cut&pasted wrong BB and the -fno-tree-sra on 4.0 makes the difference go away, so ignore the huge dumps :)
Let me see if I can work out something better.
Comment 2 Jan Hubicka 2005-11-03 12:58:15 UTC
OK, have new, 100% sure theory ;)
for 4.0 -fno-tree-sra makes important difference, for 4.1 it does not.  One difference is that 4.0 splits startingpoint:
Initial instantiation for startPoint
  startPoint.e[2] -> startPoint$e$2
  startPoint.e[1] -> startPoint$e$1
  startPoint.e[0] -> startPoint$e$0
4.1 claims:
Cannot scalarize variable startPoint because it must live in memory

so this looks like 4.1 is missing transformation here.
Comment 3 Andrew Pinski 2005-11-03 14:40:11 UTC
Confirmed (a patch was posted), the issue is that we need to run DCE before may_alias before SRA.
Comment 4 Andrew Pinski 2005-11-03 14:47:00 UTC
Patch here:
http://gcc.gnu.org/ml/gcc-patches/2005-11/msg00195.html

The main reason why DCE is required is that the struct variable is marked as non TREE_ADDRESSABLE in may_alias.
Comment 5 Paolo Bonzini 2005-11-08 07:53:15 UTC
The approved patch is the one at http://gcc.gnu.org/ml/gcc-patches/2005-11/msg00212.html
Comment 6 Mark Mitchell 2005-11-19 01:39:02 UTC
It would be a shame not to apply this patch, since it's been approved.  Let's get it applied, and get this closed.
Comment 7 Jan Hubicka 2005-11-21 13:14:07 UTC
Subject: Bug 24653

Author: hubicka
Date: Mon Nov 21 13:14:02 2005
New Revision: 107304

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=107304
Log:
	PR tree-optimization/24653
	* tree-ssa-ccp.c (ccp_fold): Strip down useless conversions.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-ccp.c

Comment 8 Andrew Pinski 2005-11-21 13:30:51 UTC
Fixed at least on the mainline for 4.2.0.
Comment 9 Jan Hubicka 2005-11-21 14:44:26 UTC
Subject: Re:  [4.1 regression] EON regressed seriously on x86-64

> 
> 
> ------- Comment #8 from pinskia at gcc dot gnu dot org  2005-11-21 13:30 -------
> Fixed at least on the mainline for 4.2.0.

I am going to fix it on 4.1 branch too once testing converge.  However I
would still like to see DCE after DOM or reordered DCE and DOM.  Even if
the CCP patch fixes the EON regression one way, this problem seem pretty
common to C++ code (see my tramp3d results I posted).

Honza
Comment 10 Jan Hubicka 2005-11-22 16:56:52 UTC
Subject: Bug 24653

Author: hubicka
Date: Tue Nov 22 16:56:48 2005
New Revision: 107365

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=107365
Log:
	PR tree-optimization/24653
	* tree-ssa-ccp.c (ccp_fold): Strip down useless conversions.

Modified:
    branches/gcc-4_1-branch/gcc/ChangeLog
    branches/gcc-4_1-branch/gcc/tree-ssa-ccp.c

Comment 11 Andrew Pinski 2005-11-22 17:01:51 UTC
Fixed.