Bug 24123 - [4.1 Regression] Massive performance regression for -ffast-math due to the recip tree pass
Summary: [4.1 Regression] Massive performance regression for -ffast-math due to the re...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.1.0
: P5 normal
Target Milestone: 4.1.0
Assignee: Paolo Bonzini
URL:
Keywords: missed-optimization, patch
Depends on: 23948
Blocks:
  Show dependency treegraph
 
Reported: 2005-09-29 14:05 UTC by Uroš Bizjak
Modified: 2006-01-11 14:30 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2005-09-30 08:16:44


Attachments
gcc -O2 -ffast-math (701 bytes, text/plain)
2005-09-29 14:11 UTC, Uroš Bizjak
Details
Profile of povray runs in comment #9 (71.93 KB, application/octet-stream)
2005-09-30 12:07 UTC, Uroš Bizjak
Details
Differences to .recip in source directory (43.64 KB, application/octet-stream)
2005-09-30 13:36 UTC, Uroš Bizjak
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Uroš Bizjak 2005-09-29 14:05:50 UTC
There is a huge performance regression due to the recip tree pass. A problem was
found in povray-3.6.1, where -ffast-math was 3 times slower (for -mfpmath=sse or
-mfpmath=387).

The problem was traced to POVFPU_RunDefault function that was found to be more
than 20 (twenty!) times slower:

with -ffast-math:

Each sample counts as 0.00195312 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 75.69     41.10    41.10  2622181     0.00     0.00 
pov::POVFPU_RunDefault(unsigned)
  2.68     42.55     1.46  3849666     0.00     0.00 
pov::Intersect_Light_Tree(pov::Ray_Struct*, pov::Project_Tree_Node_Struct*, int,
int, pov::istk_entry*, pov::Object_Struct**, pov::Light_Source_Struct*)
 
without -ffast-math:

Each sample counts as 0.00195312 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 12.04      1.85     1.85  2622153     0.00     0.00 
pov::POVFPU_RunDefault(unsigned)
  9.01      3.23     1.38  3849666     0.00     0.00 
pov::Intersect_Light_Tree(pov::Ray_Struct*, pov::Project_Tree_Node_Struct*, int,
int, pov::istk_entry*, pov::Object_Struct**, pov::Light_Source_Struct*)
  7.25      4.34     1.11   190627     0.00     0.00  pov::sbisect(int, pov::p*,
double, double, int, int, double*)

Attached testcase, that was reduced from actual source
(povray-3.6.1/source/fnpovfpu.cpp, POVFPU_RunDefault()), shows the problem with
recip pass:

gcc -O2 -ffast-math:

<bb 0>:

  # pc_17 = PHI <0(0), pc_33(322)>;
  # r7_15 = PHI <r7_20(0), r7_16(322)>;
  # r6_13 = PHI <r6_21(0), r6_14(322)>;
  # r5_11 = PHI <r5_22(0), r5_12(322)>;
  # r4_9 = PHI <r4_23(0), r4_10(322)>;
  # r3_7 = PHI <r3_24(0), r3_8(322)>;
  # r2_5 = PHI <r2_25(0), r2_6(322)>;
  # r1_3 = PHI <r1_26(0), r1_4(322)>;
  # r0_1 = PHI <r0_27(0), r0_2(322)>;
<L0>:;
  reciptmp.51_38 = 1.0e+0 / r7_15;
  reciptmp.52_35 = 1.0e+0 / r6_13;
  reciptmp.53_34 = 1.0e+0 / r5_11;
  reciptmp.54_43 = 1.0e+0 / r4_9;
  reciptmp.55_40 = 1.0e+0 / r3_7;
  reciptmp.56_39 = 1.0e+0 / r2_5;
  reciptmp.57_48 = 1.0e+0 / r1_3;
  reciptmp.58_45 = 1.0e+0 / r0_1;
  D.1619_28 = pc_17 * 4;
  D.1620_29 = (unsigned int *) D.1619_28;
  D.1622_31 = *D.1620_29;
  D.1623_32 = D.1622_31 & 4095;
  switch (D.1623_32)
    {
      case 0: goto <L1>;
      case 1: goto <L2>;
      case 2: goto <L3>;
      case 3: goto <L4>;
      case 4: goto <L5>;
      case 5: goto <L6>;
      case 6: goto <L7>;
      case 7: goto <L8>;
      case 8: goto <L9>;
      case 9: goto <L10>;
      case 10: goto <L11>;
      ...

As shown, recip pass moves _eight_ divisions in front of switch. Now, for
_every_ entry of switch, all eight divisions are performed. The produced code is
horrible:

-mfpmath=sse:

POVFPU_RunDefault:
	pushl	%ebp
	xorl	%edx, %edx
	movl	%esp, %ebp
	subl	$64, %esp
	fldz
	.p2align 4,,15
.L4:
	movl	0(,%edx,4), %eax
	movsd	.LC1, %xmm7
	movsd	.LC1, %xmm6
	movsd	.LC1, %xmm5
	movsd	.LC1, %xmm4
	movsd	.LC1, %xmm3
	movsd	.LC1, %xmm2
	movsd	.LC1, %xmm1
	movsd	.LC1, %xmm0
	divsd	-8(%ebp), %xmm7
	divsd	-16(%ebp), %xmm6
	divsd	-24(%ebp), %xmm5
	andl	$4095, %eax
	divsd	-32(%ebp), %xmm4
	divsd	-40(%ebp), %xmm3
	cmpl	$319, %eax
	divsd	-48(%ebp), %xmm2
	divsd	-56(%ebp), %xmm1
	divsd	-64(%ebp), %xmm0
	ja	.L5
	jmp	*.L326(,%eax,4)
	.section	.rodata
	.align 4
	.align 4
.L326:
	.long	.L6
	.long	.L7
	.long	.L8
	.long	.L9
...

for -mfpmath=387, the code is even worse, as every clause has a pack of fstp's
that clear results of unneeded divisions at the beginning.

.L9:
    fstp %st(0)
    fstp %st(0)
    fstp %st(0)
    fstp %st(0)
    fstp %st(0)
    fstp %st(0)
    fstp %st(0)
    fstp %st(0)
    fldl -40(%ebp)
    incl %edx
    faddl -64(%ebp)
    fstpl -40(%ebp)
    jmp .L4
Comment 1 Andrew Pinski 2005-09-29 14:08:33 UTC
This is really a RA issue rather than anything else.
Comment 2 Uroš Bizjak 2005-09-29 14:11:16 UTC
Created attachment 9837 [details]
gcc -O2 -ffast-math

Testcase, compile with gcc -O2 -ffast-math. The same problem can be found in
povray-3.6.1/source/fnpovfpu.cpp, POVFPU_RunDefault() when compiled with
-ffast-math.
Comment 3 Uroš Bizjak 2005-09-29 14:25:51 UTC
(In reply to comment #1)
> This is really a RA issue rather than anything else.

Please this patch...

Index: tree-ssa-math-opts.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/tree-ssa-math-opts.c,v
retrieving revision 2.5
diff -u -p -r2.5 tree-ssa-math-opts.c
--- tree-ssa-math-opts.c  9 Aug 2005 03:28:32 -0000 2.5
+++ tree-ssa-math-opts.c  29 Sep 2005 14:23:49 -0000
@@ -51,7 +51,7 @@ Software Foundation, 51 Franklin Street,
 static bool
 gate_cse_reciprocals (void)
 {
-  return optimize && !optimize_size && flag_unsafe_math_optimizations;
+  return optimize && !optimize_size && flag_unsafe_math_optimizations && 0;
 }
 
 /* Where to put the statement computing a reciprocal.  */
Comment 4 Uroš Bizjak 2005-09-29 14:31:14 UTC
I forgot to add that with patch from comment #3 POVray starts to fly:

Each sample counts as 0.00195312 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 10.05      1.43     1.43  3849666     0.00     0.00 
pov::Intersect_Light_Tree(pov::Ray_Struct*, pov::Project_Tree_N
ode_Struct*, int, int, pov::istk_entry*, pov::Object_Struct**,
pov::Light_Source_Struct*)
  7.08      2.43     1.01   189783     0.00     0.00  pov::sbisect(int, pov::p*,
double, double, int, int, double*)
  6.83      3.40     0.97  3930175     0.00     0.00 
pov::block_light_source(pov::Light_Source_Struct*, double, pov:
:Ray_Struct*, pov::Ray_Struct*, double*, float*)
  6.16      4.28     0.88  3162522     0.00     0.00 
pov::All_Blob_Intersections(pov::Object_Struct*, pov::Ray_Struc
t*, pov::istack_struct*)
  5.82      5.11     0.83  2622167     0.00     0.00 
pov::POVFPU_RunDefault(unsigned)
  5.40      5.87     0.77   236178     0.00     0.00 
pov::Simulate_Media(pov::Media_Struct**, pov::Ray_Struct*, pov:
:istk_entry*, float*, int)
  3.60      6.39     0.51  4262762     0.00     0.00 
pov::intersect_element(double*, double*, pov::Blob_Element_Stru
ct*, double, double*, double*)
 
Comment 5 Andrew Pinski 2005-09-29 14:32:11 UTC
Subject: Re:  Massive performance regression for -ffast-math due to the recip tree pass

> 
> 
> ------- Additional Comments From uros at kss-loka dot si  2005-09-29 14:25 -------
> (In reply to comment #1)
> > This is really a RA issue rather than anything else.
> 
> Please this patch...

No, the RA is supposed to move the divisions so that things don't spill.

Can you also try the patch at
http://gcc.gnu.org/ml/gcc-patches/2005-09/msg01555.html


-- Pinski
Comment 6 Uroš Bizjak 2005-09-30 07:49:49 UTC
(In reply to comment #5)

> No, the RA is supposed to move the divisions so that things don't spill.
> 
> Can you also try the patch at
> http://gcc.gnu.org/ml/gcc-patches/2005-09/msg01555.html

  Yes, this patch fixes the problem!

  (However, if this really is RA problem with divisions, this patch hides it).

Each sample counts as 0.00195312 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 10.53      1.46     1.46  3849654     0.00     0.00 
pov::Intersect_Light_Tree(pov::Ray_Struct*, pov::Project_Tree_Node_Struct*, int,
int, pov::istk_entry*, pov::Object_Struct**, pov::Light_Source_Struct*)
  9.59      2.79     1.33   242013     0.00     0.00  pov::sbisect(int, pov::p*,
double, double, int, int, double*)
  6.19      3.65     0.86  2622158     0.00     0.00 
pov::POVFPU_RunDefault(unsigned)
  5.86      4.46     0.81  3162520     0.00     0.00 
pov::All_Blob_Intersections(pov::Object_Struct*, pov::Ray_Struct*,
pov::istack_struct*)
  5.79      5.27     0.80  3930163     0.00     0.00 
pov::block_light_source(pov::Light_Source_Struct*, double, pov::Ray_Struct*,
pov::Ray_Struct*, double*, float*)
Comment 7 Paolo Bonzini 2005-09-30 08:14:51 UTC
> This is really a RA issue rather than anything else.

No, this is an issue with the recip pass, that creates an absurd register
pressure -- it is quite expected that the RA cannot deal with that, especially
with the x87's register file.  The rewritten pass inserts the divisions at more
sane places.
Comment 8 Paolo Bonzini 2005-09-30 08:22:08 UTC
Uros, it seems from your comments that POVray is also 7.43% faster with the
rewritten recip pass, than without any recip pass:

time for no recip pass (comment #4): 1.43 * 100 / 10.05 = 14.22
time for new recip pass (comment #6): 1.85 * 100 / 12.04 = 15.36

Did you take the two times on the same machine, with the same scene, and with
similar load conditions?
Comment 9 Uroš Bizjak 2005-09-30 09:39:03 UTC
(In reply to comment #8)

> time for no recip pass (comment #4): 1.43 * 100 / 10.05 = 14.22
  The time above is with -ffast-math, with current recip pass disabled

> time for new recip pass (comment #6): 1.85 * 100 / 12.04 = 15.36
  The time above is without -ffast-math, with current recip pass enabled

I have done some more measusrement, with gcc version 4.1.0 20050930 and your
patch commited. First results are with recip pass disabled ("&& 0" was added to
gate_cse_reciprocals()). Three runs were performed:

Without recip pass:

povray -display=NONE abyss.pov:
user    0m15.770s
user    0m15.824s
user    0m15.781s
=================
	0m15.7917

Next results are with recip pass enabled:

povray_recip -display=NONE abyss.pov:
user    0m15.891s
user    0m15.826s
user    0m15.867s
=================
	0m15.8613  (+ 0.44%)

The povray was compiled with:

CFLAGS =  -pipe -O3 -msse -mfpmath=387 -ffast-math -D__NO_MATH_INLINES -msse2
-march=pentium4 -mtune=pentium4 -malign-double -minline-all-stringops 

CXXFLAGS =  -pipe -Wno-multichar -O3 -msse -mfpmath=387 -ffast-math
-D__NO_MATH_INLINES -msse2 -march=pentium4 -mtune=pentium4 -malign-double
-minline-all-stringops 

In my case, recip pass marginally increases run times for -mfpmath=387.
Comment 10 Paolo Bonzini 2005-09-30 10:15:47 UTC
To calculate the numbers I gave you,  I took the sum of the percentages in the
profiling snippets, and the cumulative time for the last line of the profiling
snippets.  Then pct * 100 / time obviously gives you the total running time.

The original bug report gives the -fno-fast-math timing
  top functions account for 28.3%, seconds = 4.34 --> total time = 15.33

Comment #4 gives the timing with -ffast-math but without recip:
  top functions account for 44.94%, seconds = 6.93 --> total time = 14.21

Comment #6 gives the times with -ffast-math and the new recip:
  top functions account for 37.96%, seconds = 5.97 --> total time = 13.88

So it seems that the patch introduces a speedup.  If not, can you provide
profiling output for the two povray's in comment %9 (either on abyss.pov, or on
the standard POVray benchmark)?

Thanks,

Paolo
Comment 11 Uroš Bizjak 2005-09-30 12:07:03 UTC
Created attachment 9844 [details]
Profile of povray runs in comment #9

This is a profile results of abyss.pov run. Povray-3.6.1 was compiled with
flags as in comment #9. Recip patch was applied in both cases, res_recip.txt is
the result of profiled run with recip optimization and res_norecip.txt is the
result of profiled run with recip optimization disabled.

(It is interesting, that with -D__NO_MATH_INLINES, POVFPU_RunDefault time is
considerably increased)
Comment 12 Paolo Bonzini 2005-09-30 12:28:31 UTC
Here is a list of all the differences between the two profiles.  r is the time
with recip, nr is without, nr-r is the difference.

r       nr      nr-r    func
1.28    1.50    0.22    pov::sbisect(int,
0.84    0.90    0.06    pov::All_Blob_Intersections(pov::Object_Struct*,
0.40    0.46    0.06    pov::sample_media(pov::Light_List_Struct*,
1.41    1.46    0.05    pov::Intersect_Light_Tree(pov::Ray_Struct*,
0.20    0.25    0.05    pov::do_light(pov::Light_Source_Struct*,
0.46    0.50    0.04    pov::Check_And_Enqueue(pov::Priority_Queue_Struct*,
0.36    0.40    0.04    pov::Intersection(pov::istk_entry*,
0.07    0.10    0.03    pov::Reinitialize_VLBuffer_Code()
0.70    0.72    0.02    pov::Simulate_Media(pov::Media_Struct**,
0.10    0.12    0.02    pov::Test_Shadow(pov::Light_Source_Struct*,
0.10    0.12    0.02    pov::All_IsoSurface_Intersections(pov::Object_Struct*,
0.05    0.07    0.02    pov::Intersect_Box(pov::Ray_Struct*,
3.28    3.29    0.01    pov::POVFPU_RunDefault(unsigned)
0.87    0.88    0.01    pov::block_light_source(pov::Light_Source_Struct*,
0.22    0.23    0.01    pov::Intersect_Sphere(pov::Ray_Struct*,
0.15    0.16    0.01    pov::All_Sphere_Intersections(pov::Object_Struct*,
0.06    0.07    0.01    pov::compute_lighted_texture(float*,
0.03    0.04    0.01    pov::open_istack()
0.02    0.03    0.01    pov::project_object(pov::Project_Struct*,
0.02    0.03    0.01    pov::MTransPoint(double*,
0.46    0.45    -0.01   pov::intersect_element(double*,
0.28    0.27    -0.01   pov::MInvTransPoint(double*,
0.26    0.25    -0.01   pov::f_helix1(double*,
0.10    0.09    -0.01   pov::All_Cone_Intersections(pov::Object_Struct*,
0.06    0.05    -0.01   pov::Recompute_BBox(pov::Bounding_Box_Struct*,
0.04    0.03    -0.01   pov::solve_cubic(double*,
0.03    0.02    -0.01   pov::insert_hit(pov::Blob_Element_Struct*,
0.02    0.01    -0.01   pov::project_bounding_slab(int,
0.02    0.01    -0.01   pov::project_bbox(pov::Project_Struct*,
0.02    0.01    -0.01   pov::pov_free(void*,
0.02    0.01    -0.01   pov::close_istack(pov::istack_struct*)
0.02    0.01    -0.01   pov::IsoSurface_Function_Find_Root(pov::IsoSurface_Struct*,
0.02    0.01    -0.01   pov::Initialize_Ray_Containers(pov::Ray_Struct*,
0.02    0.01    -0.01   pov::DNoise(double*,
0.02    0.01    -0.01   pov::Compute_Pigment(float*,
0.41    0.39    -0.02   pov::All_Plane_Intersections(pov::Object_Struct*,
0.35    0.33    -0.02   pov::Inside_Blob(double*,
0.13    0.11    -0.02   pov::Priority_Queue_Delete(pov::Priority_Queue_Struct*,
0.07    0.05    -0.02   pov::Point_In_Clip(double*,
0.07    0.05    -0.02   pov::MTimesB(double
0.06    0.04    -0.02   pov::POVFPU_SetLocal(unsigned,
0.03    0.01    -0.02   pov::pov_memmove(void*,
0.03    0.01    -0.02   pov::incstack(pov::istack_struct*)
0.22    0.19    -0.03   pov::Inside_Plane(double*,
0.17    0.14    -0.03   pov::Create_Rayinfo(pov::Ray_Struct*,
0.16    0.13    -0.03   pov::polysolve(int,
0.13    0.10    -0.03   pov::sort_and_split(pov::BBox_Tree_Struct**,
0.12    0.09    -0.03   pov::Inside_Object(double*,
0.11    0.08    -0.03   pov::MInvTransDirection(double*,
0.44    0.40    -0.04   pov::All_CSG_Intersect_Intersections(pov::Object_Struct*,
0.29    0.25    -0.04   pov::scattering_attenuation(pov::Media_Struct**,
0.20    0.16    -0.04   pov::sample_media_rec(pov::Light_List_Struct*,
0.09    0.05    -0.04   pov::Solve_Polynomial(int,
Comment 13 Uroš Bizjak 2005-09-30 12:44:14 UTC
(In reply to comment #12)
> Here is a list of all the differences between the two profiles.  r is the time
> with recip, nr is without, nr-r is the difference.
> 
> r       nr      nr-r    func
> 1.28    1.50    0.22    pov::sbisect(int,

It looks to me that header is reversed! pov::sbisect is 1.50 _with_ recip.

IMHO header should look:

 nr      r      nr-r    func
 1.28    1.50    0.22    pov::sbisect(int,
 ...
Comment 14 paolo.bonzini@lu.unisi.ch 2005-09-30 12:53:21 UTC
Subject: Re:  [4.1 Regression] Massive performance
 regression for -ffast-math due to the recip tree pass


>It looks to me that header is reversed! pov::sbisect is 1.50 _with_ recip.
>
ehm, right.  indeed i was looking at sbisect now ;-)

Paolo
Comment 15 Paolo Bonzini 2005-09-30 13:16:47 UTC
hum, strange.  i don't get absolutely any difference before and after recip on
sbisect (and the family of functions it invokes, i.e. polyeval, regula_falsa and
numchanges) in polysolv.cpp from povray 3.6.1.  and indeed by browsing at the
source code, i saw no optimization opportunities there.

uros, can you do the same test? (i.e. compile polysolv.cpp with all your options
+ -fdump-tree-all, and then "diff -u polysolv.cpp.*.loopdone polysolv.cpp.*.recip").

tnx,

Paolo
Comment 16 Uroš Bizjak 2005-09-30 13:36:12 UTC
Created attachment 9845 [details]
Differences to .recip in source directory

The differences between .loopdone and .recip in povray-3.6.1/source directory
for all files. Compiled with recip patch enabled. Other compile flags as stated
above.
Comment 17 Uroš Bizjak 2005-09-30 13:58:50 UTC
Looking at the differences, the result of recip looks _really_ good!

Currently, there seems to be some problems, i.e.:

double pov::f_polytubes(double*, unsigned int) (ptr, D.22748)

-  D.22787_46 = -6.28318530717958623199592693708837032318115234375e+0 / D.22783_80;
+  reciptmp.882_72 = 1.0e+0 / D.22783_80;
+  D.22787_46 = -6.28318530717958623199592693708837032318115234375e+0 *
reciptmp.882_72;

Not needed, only one user.


Function double pov::POVFPU_RunDefault(pov::FUNCTION)

 <L193>:;
-  r0_1660 = r0_89 / r0_89;
+  reciptmp.492_84 = 1.0e+0 / r0_89;
+  r0_1660 = r0_89 * reciptmp.492_84;
   goto <bb 1062> (<L1339>);

The result of above confusion is (1.0)! We are in fast-math, so no NaNs, etc..

void pov::Simulate_Media(pov::IMEDIA**, pov::RAY*, pov::INTERSECTION*
-  reciptmp.1152_907 = 1.0e+0 / prephitmp.1124_293;
-  reciptmp.1153_1046 = 1.0e+0 / prephitmp.1124_293;
-  reciptmp.1154_1041 = 1.0e+0 / prephitmp.1124_293;
+  reciptmp.1275_1270 = 1.0e+0 / prephitmp.1124_293;
+  reciptmp.1152_907 = 1.0e+0 * reciptmp.1275_1270;
+  reciptmp.1153_1046 = 1.0e+0 * reciptmp.1275_1270;
+  reciptmp.1154_1041 = 1.0e+0 * reciptmp.1275_1270;

These are all the same? There are already reciptmp variables in loopdone that
are all the same. And there are quite some places that have this problem.
Comment 18 paolo.bonzini@lu.unisi.ch 2005-09-30 14:13:55 UTC
Subject: Re:  [4.1 Regression] Massive performance
 regression for -ffast-math due to the recip tree pass


>Currently, there seems to be some problems, i.e.:
>
>double pov::f_polytubes(double*, unsigned int) (ptr, D.22748)
>
>-  D.22787_46 = -6.28318530717958623199592693708837032318115234375e+0 / D.22783_80;
>+  reciptmp.882_72 = 1.0e+0 / D.22783_80;
>+  D.22787_46 = -6.28318530717958623199592693708837032318115234375e+0 *
>reciptmp.882_72;
>
>Not needed, only one user.
>  
>
no, there is another user in another basic block.

>Function double pov::POVFPU_RunDefault(pov::FUNCTION)
>
> <L193>:;
>-  r0_1660 = r0_89 / r0_89;
>+  reciptmp.492_84 = 1.0e+0 / r0_89;
>+  r0_1660 = r0_89 * reciptmp.492_84;
>   goto <bb 1062> (<L1339>);
>
>The result of above confusion is (1.0)! We are in fast-math, so no NaNs, etc..
>
>void pov::Simulate_Media(pov::IMEDIA**, pov::RAY*, pov::INTERSECTION*
>-  reciptmp.1152_907 = 1.0e+0 / prephitmp.1124_293;
>-  reciptmp.1153_1046 = 1.0e+0 / prephitmp.1124_293;
>-  reciptmp.1154_1041 = 1.0e+0 / prephitmp.1124_293;
>+  reciptmp.1275_1270 = 1.0e+0 / prephitmp.1124_293;
>+  reciptmp.1152_907 = 1.0e+0 * reciptmp.1275_1270;
>+  reciptmp.1153_1046 = 1.0e+0 * reciptmp.1275_1270;
>+  reciptmp.1154_1041 = 1.0e+0 * reciptmp.1275_1270;
>
>These are all the same? There are already reciptmp variables in loopdone that
>are all the same. And there are quite some places that have this problem.
>  
>
I think the final DOM run ought to simplify the other problems.

Paolo
Comment 19 Uroš Bizjak 2005-09-30 14:37:47 UTC
(In reply to comment #18)

> >Currently, there seems to be some problems, i.e.:

> >Function double pov::POVFPU_RunDefault(pov::FUNCTION)
> >
> > <L193>:;
> >-  r0_1660 = r0_89 / r0_89;
> >+  reciptmp.492_84 = 1.0e+0 / r0_89;
> >+  r0_1660 = r0_89 * reciptmp.492_84;
> >   goto <bb 1062> (<L1339>);
> >
> >The result of above confusion is (1.0)! We are in fast-math, so no NaNs, etc..

> I think the final DOM run ought to simplify the other problems.

  Unfortunatelly, it doesn't.

This is from .optimized:

<L193>:;
  r0 = r0 * 1.0e+0 / r0;
  goto <bb 1062> (<L1339>);

And gcc further generates this beauty:

.L236:
	.loc 1 1405 0
	fldl	-112(%ebp)
	fdiv	%st(0), %st
	fstpl	-112(%ebp)
	.loc 1 1626 0
	addl	$1, %edi
	jmp	.L1595
.L237:

  Some optimization pass should figure out that fld1 should do the trick...
Comment 20 paolo.bonzini@lu.unisi.ch 2005-09-30 14:40:19 UTC
Subject: Re:  [4.1 Regression] Massive performance
 regression for -ffast-math due to the recip tree pass


>>>Function double pov::POVFPU_RunDefault(pov::FUNCTION)
>>>
>>><L193>:;
>>>-  r0_1660 = r0_89 / r0_89;
>>>+  reciptmp.492_84 = 1.0e+0 / r0_89;
>>>+  r0_1660 = r0_89 * reciptmp.492_84;
>>>  goto <bb 1062> (<L1339>);
>>>
>>>The result of above confusion is (1.0)! We are in fast-math, so no NaNs, etc..
>>>      
>>>
Can you prepare a patch to fold, that fixes this beauty? :-)
Comment 21 Paolo Bonzini 2005-10-12 09:38:58 UTC
Uros found that tweaking the PR23948 patch to require at least three divisions is more profitable (speed-wise).  I'll make it a param.
Comment 22 David Edelsohn 2006-01-03 15:22:49 UTC
Is it possible for backends to inquire how many reciprocals were discovered?  PowerPC and IA-64 have reciprocal instructions and could benefit by choosing to generate reciprocal earlier with additional information about the number of divides calculated.  For PowerPC, it is effective to use the instruction if there are multiple divides, such as the three divisions mentioned above.  The IBM XLC compiler propagates the reciprocal and numerator pair through its equivalent to RTL.
Comment 23 Paolo Bonzini 2006-01-03 15:30:16 UTC
> For PowerPC, it is effective to use the instruction if
> there are multiple divides, such as the three divisions mentioned above.  The
> IBM XLC compiler propagates the reciprocal and numerator pair through its
> equivalent to RTL.

I am not sure I follow you.  I see two questions, but it could be that you asked neither:

1) You want to use the reciprocal instruction instead of a FP divide.  This could be done in the expander, or with a new RECIP_EXPR tree code.  I'd rather do the former.

2) Multiplying by the result of the reciprocal instruction always provides the exact result of the division, so you want to enable the pass always.  It could be possible to parameterize the recip pass on a separate -fdivide-by-reciprocals flag, turned on by -funsafe-math-optimization, but also always turned on in config/rs6000/rs6000.c
Comment 24 David Edelsohn 2006-01-03 15:50:20 UTC
Comment #21 mentions three divisions for the optimization to be more profitable, but that information is lost by the time GCC reaches the expanders. Minimally, it would be helpful to know that the reciprocal optimization found more than three divides.

If the backend knew that it were calculating a reciprocal instead of 1./x division and that the reciprocal only occurs when more than two divides are present, it could produce better code.

For both PowerPC and IA-64, it would be helpful to have RECIP_EXPR and the pair that completes division because both archtectures could then explicitly use their reciprocal instructions and fully schedule the Newton-Raphson or power-series expansion for the closure instead of generating the 1./x reciprocal as a full divide and the lower accuracy, isolated numerator multiplication separately.

If the backends had more complete information, the reciprocal optimization could be extended beyond -ffast-math for some targets.
Comment 25 paolo.bonzini@lu.unisi.ch 2006-01-04 16:29:54 UTC
Subject: Re:  [4.1/4.2 Regression] Massive performance
 regression for -ffast-math due to the recip tree pass


>For PowerPC, it is effective to use the instruction if
>there are multiple divides, such as the three divisions mentioned above.  The
>IBM XLC compiler propagates the reciprocal and numerator pair through its
>equivalent to RTL.
>  
>
I am not sure I follow you.  I see two questions, but it could be that 
you asked neither:

1) You want to use the reciprocal instruction instead of a FP divide.  
This could be done in the expander, or with a new RECIP_EXPR tree code.  
I'd rather do the former.

2) Multiplying by the result of the reciprocal instruction always 
provides the exact result of the division, so you want to enable the 
pass always.  It could be possible to parameterize the recip pass on a 
separate -fdivide-by-reciprocals flag, turned on by 
-funsafe-math-optimization, but also always turned on in 
config/rs6000/rs6000.c

Paolo
Comment 26 Paolo Bonzini 2006-01-11 13:02:25 UTC
Subject: Bug 24123

Author: bonzini
Date: Wed Jan 11 13:02:18 2006
New Revision: 109578

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=109578
Log:
gcc:
2006-01-11  Paolo Bonzini  <bonzini@gnu.org>

	PR tree-optimization/23109
	PR tree-optimization/23948
	PR tree-optimization/24123

	* Makefile.in (tree-ssa-math-opts.o): Adjust dependencies.
        * tree-cfg.c (single_noncomplex_succ): New.
        * tree-flow.h (single_noncomplex_succ): Declare it.
        * tree-ssa-math-opts.c (enum place_reciprocal): Remove.
        * tree-ssa-math-opts.c (enum place_reciprocal): Remove.
        (struct occurrence, occ_head, occ_pool, is_divide_by, compute_merit,
	insert_bb, register_division_in, insert_reciprocals,
	replace_reciprocal, free_bb): New.
        (execute_cse_reciprocals_1): Rewritten.
        (execute_cse_reciprocals): Adjust calls to execute_cse_reciprocals_1.
        Do not commit any edge insertion.  Always compute dominators and
        create the allocation pool.
        * target-def.h (TARGET_MIN_DIVISIONS_FOR_RECIP_MUL): New.
	* target.h (struct gcc_target): Add min_divistions_for_recip_mul.
	* targhooks.c (default_min_divistions_for_recip_mul): New.
	* targhooks.h (default_min_divistions_for_recip_mul): New prototype.
        * passes.c (init_optimization_passes): Run recip after tree loop
        optimizations.
        * doc/tm.texi (Misc): Document TARGET_MIN_DIVISIONS_FOR_RECIP_MUL.

gcc/testsuite:
2006-01-11  Paolo Bonzini  <bonzini@gnu.org>
        
        PR tree-optimization/23109
        PR tree-optimization/23948
        PR tree-optimization/24123

        * gcc.dg/tree-ssa/recip-3.c, gcc.dg/tree-ssa/recip-4.c,
        gcc.dg/tree-ssa/recip-5.c, gcc.dg/tree-ssa/recip-6.c,
        gcc.dg/tree-ssa/recip-7.c, gcc.dg/tree-ssa/pr23109.c,
        g++.dg/tree-ssa/pr23948.C: New testcases.
        * gcc.dg/tree-ssa/recip-2.c, gcc.dg/tree-ssa/pr23234.c: Provide
	three divisions in order to do the optimization.


Added:
    trunk/gcc/testsuite/g++.dg/tree-ssa/pr23948.C
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr23109.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-4.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-5.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-6.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-7.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/Makefile.in
    trunk/gcc/doc/tm.texi
    trunk/gcc/passes.c
    trunk/gcc/target-def.h
    trunk/gcc/target.h
    trunk/gcc/targhooks.c
    trunk/gcc/targhooks.h
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr23234.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-2.c
    trunk/gcc/tree-cfg.c
    trunk/gcc/tree-flow.h
    trunk/gcc/tree-ssa-math-opts.c

Comment 27 Paolo Bonzini 2006-01-11 14:29:41 UTC
Subject: Bug 24123

Author: bonzini
Date: Wed Jan 11 14:29:29 2006
New Revision: 109586

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=109586
Log:
gcc:
2006-01-11  Paolo Bonzini  <bonzini@gnu.org>

	PR tree-optimization/23109
	PR tree-optimization/23948
	PR tree-optimization/24123

	* Makefile.in (tree-ssa-math-opts.o): Adjust dependencies.
        * tree-cfg.c (single_noncomplex_succ): New.
        * tree-flow.h (single_noncomplex_succ): Declare it.
        * tree-ssa-math-opts.c (enum place_reciprocal): Remove.
        * tree-ssa-math-opts.c (enum place_reciprocal): Remove.
        (struct occurrence, occ_head, occ_pool, is_divide_by, compute_merit,
	insert_bb, register_division_in, insert_reciprocals,
	replace_reciprocal, free_bb): New.
        (execute_cse_reciprocals_1): Rewritten.
        (execute_cse_reciprocals): Adjust calls to execute_cse_reciprocals_1.
        Do not commit any edge insertion.  Always compute dominators and
        create the allocation pool.
        * target-def.h (TARGET_MIN_DIVISIONS_FOR_RECIP_MUL): New.
	* target.h (struct gcc_target): Add min_divistions_for_recip_mul.
	* targhooks.c (default_min_divistions_for_recip_mul): New.
	* targhooks.h (default_min_divistions_for_recip_mul): New prototype.
        * passes.c (init_optimization_passes): Run recip after tree loop
        optimizations.
        * doc/tm.texi (Misc): Document TARGET_MIN_DIVISIONS_FOR_RECIP_MUL.

gcc/testsuite:
2006-01-11  Paolo Bonzini  <bonzini@gnu.org>
        
        PR tree-optimization/23109
        PR tree-optimization/23948
        PR tree-optimization/24123

        * gcc.dg/tree-ssa/recip-3.c, gcc.dg/tree-ssa/recip-4.c,
        gcc.dg/tree-ssa/recip-5.c, gcc.dg/tree-ssa/recip-6.c,
        gcc.dg/tree-ssa/recip-7.c, gcc.dg/tree-ssa/pr23109.c,
        g++.dg/tree-ssa/pr23948.C: New testcases.
        * gcc.dg/tree-ssa/recip-2.c, gcc.dg/tree-ssa/pr23234.c: Provide
	three divisions in order to do the optimization.


Added:
    branches/gcc-4_1-branch/gcc/testsuite/g++.dg/tree-ssa/pr23948.C
      - copied unchanged from r109578, trunk/gcc/testsuite/g++.dg/tree-ssa/pr23948.C
    branches/gcc-4_1-branch/gcc/testsuite/gcc.dg/tree-ssa/pr23109.c
      - copied unchanged from r109578, trunk/gcc/testsuite/gcc.dg/tree-ssa/pr23109.c
    branches/gcc-4_1-branch/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
      - copied unchanged from r109578, trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
    branches/gcc-4_1-branch/gcc/testsuite/gcc.dg/tree-ssa/recip-4.c
      - copied unchanged from r109578, trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-4.c
    branches/gcc-4_1-branch/gcc/testsuite/gcc.dg/tree-ssa/recip-5.c
      - copied unchanged from r109578, trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-5.c
    branches/gcc-4_1-branch/gcc/testsuite/gcc.dg/tree-ssa/recip-6.c
      - copied unchanged from r109578, trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-6.c
    branches/gcc-4_1-branch/gcc/testsuite/gcc.dg/tree-ssa/recip-7.c
      - copied unchanged from r109578, trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-7.c
Modified:
    branches/gcc-4_1-branch/gcc/ChangeLog
    branches/gcc-4_1-branch/gcc/Makefile.in
    branches/gcc-4_1-branch/gcc/doc/tm.texi
    branches/gcc-4_1-branch/gcc/passes.c
    branches/gcc-4_1-branch/gcc/target-def.h
    branches/gcc-4_1-branch/gcc/target.h
    branches/gcc-4_1-branch/gcc/targhooks.c
    branches/gcc-4_1-branch/gcc/targhooks.h
    branches/gcc-4_1-branch/gcc/testsuite/ChangeLog
    branches/gcc-4_1-branch/gcc/testsuite/gcc.dg/tree-ssa/pr23234.c
    branches/gcc-4_1-branch/gcc/testsuite/gcc.dg/tree-ssa/recip-2.c
    branches/gcc-4_1-branch/gcc/tree-cfg.c
    branches/gcc-4_1-branch/gcc/tree-flow.h
    branches/gcc-4_1-branch/gcc/tree-ssa-math-opts.c

Comment 28 Paolo Bonzini 2006-01-11 14:30:54 UTC
patch committed to both branches