There is a huge performance regression due to the recip tree pass. A problem was found in povray-3.6.1, where -ffast-math was 3 times slower (for -mfpmath=sse or -mfpmath=387). The problem was traced to POVFPU_RunDefault function that was found to be more than 20 (twenty!) times slower: with -ffast-math: Each sample counts as 0.00195312 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 75.69 41.10 41.10 2622181 0.00 0.00 pov::POVFPU_RunDefault(unsigned) 2.68 42.55 1.46 3849666 0.00 0.00 pov::Intersect_Light_Tree(pov::Ray_Struct*, pov::Project_Tree_Node_Struct*, int, int, pov::istk_entry*, pov::Object_Struct**, pov::Light_Source_Struct*) without -ffast-math: Each sample counts as 0.00195312 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 12.04 1.85 1.85 2622153 0.00 0.00 pov::POVFPU_RunDefault(unsigned) 9.01 3.23 1.38 3849666 0.00 0.00 pov::Intersect_Light_Tree(pov::Ray_Struct*, pov::Project_Tree_Node_Struct*, int, int, pov::istk_entry*, pov::Object_Struct**, pov::Light_Source_Struct*) 7.25 4.34 1.11 190627 0.00 0.00 pov::sbisect(int, pov::p*, double, double, int, int, double*) Attached testcase, that was reduced from actual source (povray-3.6.1/source/fnpovfpu.cpp, POVFPU_RunDefault()), shows the problem with recip pass: gcc -O2 -ffast-math: <bb 0>: # pc_17 = PHI <0(0), pc_33(322)>; # r7_15 = PHI <r7_20(0), r7_16(322)>; # r6_13 = PHI <r6_21(0), r6_14(322)>; # r5_11 = PHI <r5_22(0), r5_12(322)>; # r4_9 = PHI <r4_23(0), r4_10(322)>; # r3_7 = PHI <r3_24(0), r3_8(322)>; # r2_5 = PHI <r2_25(0), r2_6(322)>; # r1_3 = PHI <r1_26(0), r1_4(322)>; # r0_1 = PHI <r0_27(0), r0_2(322)>; <L0>:; reciptmp.51_38 = 1.0e+0 / r7_15; reciptmp.52_35 = 1.0e+0 / r6_13; reciptmp.53_34 = 1.0e+0 / r5_11; reciptmp.54_43 = 1.0e+0 / r4_9; reciptmp.55_40 = 1.0e+0 / r3_7; reciptmp.56_39 = 1.0e+0 / r2_5; reciptmp.57_48 = 1.0e+0 / r1_3; reciptmp.58_45 = 1.0e+0 / r0_1; D.1619_28 = pc_17 * 4; D.1620_29 = (unsigned int *) D.1619_28; D.1622_31 = *D.1620_29; D.1623_32 = D.1622_31 & 4095; switch (D.1623_32) { case 0: goto <L1>; case 1: goto <L2>; case 2: goto <L3>; case 3: goto <L4>; case 4: goto <L5>; case 5: goto <L6>; case 6: goto <L7>; case 7: goto <L8>; case 8: goto <L9>; case 9: goto <L10>; case 10: goto <L11>; ... As shown, recip pass moves _eight_ divisions in front of switch. Now, for _every_ entry of switch, all eight divisions are performed. The produced code is horrible: -mfpmath=sse: POVFPU_RunDefault: pushl %ebp xorl %edx, %edx movl %esp, %ebp subl $64, %esp fldz .p2align 4,,15 .L4: movl 0(,%edx,4), %eax movsd .LC1, %xmm7 movsd .LC1, %xmm6 movsd .LC1, %xmm5 movsd .LC1, %xmm4 movsd .LC1, %xmm3 movsd .LC1, %xmm2 movsd .LC1, %xmm1 movsd .LC1, %xmm0 divsd -8(%ebp), %xmm7 divsd -16(%ebp), %xmm6 divsd -24(%ebp), %xmm5 andl $4095, %eax divsd -32(%ebp), %xmm4 divsd -40(%ebp), %xmm3 cmpl $319, %eax divsd -48(%ebp), %xmm2 divsd -56(%ebp), %xmm1 divsd -64(%ebp), %xmm0 ja .L5 jmp *.L326(,%eax,4) .section .rodata .align 4 .align 4 .L326: .long .L6 .long .L7 .long .L8 .long .L9 ... for -mfpmath=387, the code is even worse, as every clause has a pack of fstp's that clear results of unneeded divisions at the beginning. .L9: fstp %st(0) fstp %st(0) fstp %st(0) fstp %st(0) fstp %st(0) fstp %st(0) fstp %st(0) fstp %st(0) fldl -40(%ebp) incl %edx faddl -64(%ebp) fstpl -40(%ebp) jmp .L4
This is really a RA issue rather than anything else.
Created attachment 9837 [details] gcc -O2 -ffast-math Testcase, compile with gcc -O2 -ffast-math. The same problem can be found in povray-3.6.1/source/fnpovfpu.cpp, POVFPU_RunDefault() when compiled with -ffast-math.
(In reply to comment #1) > This is really a RA issue rather than anything else. Please this patch... Index: tree-ssa-math-opts.c =================================================================== RCS file: /cvs/gcc/gcc/gcc/tree-ssa-math-opts.c,v retrieving revision 2.5 diff -u -p -r2.5 tree-ssa-math-opts.c --- tree-ssa-math-opts.c 9 Aug 2005 03:28:32 -0000 2.5 +++ tree-ssa-math-opts.c 29 Sep 2005 14:23:49 -0000 @@ -51,7 +51,7 @@ Software Foundation, 51 Franklin Street, static bool gate_cse_reciprocals (void) { - return optimize && !optimize_size && flag_unsafe_math_optimizations; + return optimize && !optimize_size && flag_unsafe_math_optimizations && 0; } /* Where to put the statement computing a reciprocal. */
I forgot to add that with patch from comment #3 POVray starts to fly: Each sample counts as 0.00195312 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 10.05 1.43 1.43 3849666 0.00 0.00 pov::Intersect_Light_Tree(pov::Ray_Struct*, pov::Project_Tree_N ode_Struct*, int, int, pov::istk_entry*, pov::Object_Struct**, pov::Light_Source_Struct*) 7.08 2.43 1.01 189783 0.00 0.00 pov::sbisect(int, pov::p*, double, double, int, int, double*) 6.83 3.40 0.97 3930175 0.00 0.00 pov::block_light_source(pov::Light_Source_Struct*, double, pov: :Ray_Struct*, pov::Ray_Struct*, double*, float*) 6.16 4.28 0.88 3162522 0.00 0.00 pov::All_Blob_Intersections(pov::Object_Struct*, pov::Ray_Struc t*, pov::istack_struct*) 5.82 5.11 0.83 2622167 0.00 0.00 pov::POVFPU_RunDefault(unsigned) 5.40 5.87 0.77 236178 0.00 0.00 pov::Simulate_Media(pov::Media_Struct**, pov::Ray_Struct*, pov: :istk_entry*, float*, int) 3.60 6.39 0.51 4262762 0.00 0.00 pov::intersect_element(double*, double*, pov::Blob_Element_Stru ct*, double, double*, double*)
Subject: Re: Massive performance regression for -ffast-math due to the recip tree pass > > > ------- Additional Comments From uros at kss-loka dot si 2005-09-29 14:25 ------- > (In reply to comment #1) > > This is really a RA issue rather than anything else. > > Please this patch... No, the RA is supposed to move the divisions so that things don't spill. Can you also try the patch at http://gcc.gnu.org/ml/gcc-patches/2005-09/msg01555.html -- Pinski
(In reply to comment #5) > No, the RA is supposed to move the divisions so that things don't spill. > > Can you also try the patch at > http://gcc.gnu.org/ml/gcc-patches/2005-09/msg01555.html Yes, this patch fixes the problem! (However, if this really is RA problem with divisions, this patch hides it). Each sample counts as 0.00195312 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 10.53 1.46 1.46 3849654 0.00 0.00 pov::Intersect_Light_Tree(pov::Ray_Struct*, pov::Project_Tree_Node_Struct*, int, int, pov::istk_entry*, pov::Object_Struct**, pov::Light_Source_Struct*) 9.59 2.79 1.33 242013 0.00 0.00 pov::sbisect(int, pov::p*, double, double, int, int, double*) 6.19 3.65 0.86 2622158 0.00 0.00 pov::POVFPU_RunDefault(unsigned) 5.86 4.46 0.81 3162520 0.00 0.00 pov::All_Blob_Intersections(pov::Object_Struct*, pov::Ray_Struct*, pov::istack_struct*) 5.79 5.27 0.80 3930163 0.00 0.00 pov::block_light_source(pov::Light_Source_Struct*, double, pov::Ray_Struct*, pov::Ray_Struct*, double*, float*)
> This is really a RA issue rather than anything else. No, this is an issue with the recip pass, that creates an absurd register pressure -- it is quite expected that the RA cannot deal with that, especially with the x87's register file. The rewritten pass inserts the divisions at more sane places.
Uros, it seems from your comments that POVray is also 7.43% faster with the rewritten recip pass, than without any recip pass: time for no recip pass (comment #4): 1.43 * 100 / 10.05 = 14.22 time for new recip pass (comment #6): 1.85 * 100 / 12.04 = 15.36 Did you take the two times on the same machine, with the same scene, and with similar load conditions?
(In reply to comment #8) > time for no recip pass (comment #4): 1.43 * 100 / 10.05 = 14.22 The time above is with -ffast-math, with current recip pass disabled > time for new recip pass (comment #6): 1.85 * 100 / 12.04 = 15.36 The time above is without -ffast-math, with current recip pass enabled I have done some more measusrement, with gcc version 4.1.0 20050930 and your patch commited. First results are with recip pass disabled ("&& 0" was added to gate_cse_reciprocals()). Three runs were performed: Without recip pass: povray -display=NONE abyss.pov: user 0m15.770s user 0m15.824s user 0m15.781s ================= 0m15.7917 Next results are with recip pass enabled: povray_recip -display=NONE abyss.pov: user 0m15.891s user 0m15.826s user 0m15.867s ================= 0m15.8613 (+ 0.44%) The povray was compiled with: CFLAGS = -pipe -O3 -msse -mfpmath=387 -ffast-math -D__NO_MATH_INLINES -msse2 -march=pentium4 -mtune=pentium4 -malign-double -minline-all-stringops CXXFLAGS = -pipe -Wno-multichar -O3 -msse -mfpmath=387 -ffast-math -D__NO_MATH_INLINES -msse2 -march=pentium4 -mtune=pentium4 -malign-double -minline-all-stringops In my case, recip pass marginally increases run times for -mfpmath=387.
To calculate the numbers I gave you, I took the sum of the percentages in the profiling snippets, and the cumulative time for the last line of the profiling snippets. Then pct * 100 / time obviously gives you the total running time. The original bug report gives the -fno-fast-math timing top functions account for 28.3%, seconds = 4.34 --> total time = 15.33 Comment #4 gives the timing with -ffast-math but without recip: top functions account for 44.94%, seconds = 6.93 --> total time = 14.21 Comment #6 gives the times with -ffast-math and the new recip: top functions account for 37.96%, seconds = 5.97 --> total time = 13.88 So it seems that the patch introduces a speedup. If not, can you provide profiling output for the two povray's in comment %9 (either on abyss.pov, or on the standard POVray benchmark)? Thanks, Paolo
Created attachment 9844 [details] Profile of povray runs in comment #9 This is a profile results of abyss.pov run. Povray-3.6.1 was compiled with flags as in comment #9. Recip patch was applied in both cases, res_recip.txt is the result of profiled run with recip optimization and res_norecip.txt is the result of profiled run with recip optimization disabled. (It is interesting, that with -D__NO_MATH_INLINES, POVFPU_RunDefault time is considerably increased)
Here is a list of all the differences between the two profiles. r is the time with recip, nr is without, nr-r is the difference. r nr nr-r func 1.28 1.50 0.22 pov::sbisect(int, 0.84 0.90 0.06 pov::All_Blob_Intersections(pov::Object_Struct*, 0.40 0.46 0.06 pov::sample_media(pov::Light_List_Struct*, 1.41 1.46 0.05 pov::Intersect_Light_Tree(pov::Ray_Struct*, 0.20 0.25 0.05 pov::do_light(pov::Light_Source_Struct*, 0.46 0.50 0.04 pov::Check_And_Enqueue(pov::Priority_Queue_Struct*, 0.36 0.40 0.04 pov::Intersection(pov::istk_entry*, 0.07 0.10 0.03 pov::Reinitialize_VLBuffer_Code() 0.70 0.72 0.02 pov::Simulate_Media(pov::Media_Struct**, 0.10 0.12 0.02 pov::Test_Shadow(pov::Light_Source_Struct*, 0.10 0.12 0.02 pov::All_IsoSurface_Intersections(pov::Object_Struct*, 0.05 0.07 0.02 pov::Intersect_Box(pov::Ray_Struct*, 3.28 3.29 0.01 pov::POVFPU_RunDefault(unsigned) 0.87 0.88 0.01 pov::block_light_source(pov::Light_Source_Struct*, 0.22 0.23 0.01 pov::Intersect_Sphere(pov::Ray_Struct*, 0.15 0.16 0.01 pov::All_Sphere_Intersections(pov::Object_Struct*, 0.06 0.07 0.01 pov::compute_lighted_texture(float*, 0.03 0.04 0.01 pov::open_istack() 0.02 0.03 0.01 pov::project_object(pov::Project_Struct*, 0.02 0.03 0.01 pov::MTransPoint(double*, 0.46 0.45 -0.01 pov::intersect_element(double*, 0.28 0.27 -0.01 pov::MInvTransPoint(double*, 0.26 0.25 -0.01 pov::f_helix1(double*, 0.10 0.09 -0.01 pov::All_Cone_Intersections(pov::Object_Struct*, 0.06 0.05 -0.01 pov::Recompute_BBox(pov::Bounding_Box_Struct*, 0.04 0.03 -0.01 pov::solve_cubic(double*, 0.03 0.02 -0.01 pov::insert_hit(pov::Blob_Element_Struct*, 0.02 0.01 -0.01 pov::project_bounding_slab(int, 0.02 0.01 -0.01 pov::project_bbox(pov::Project_Struct*, 0.02 0.01 -0.01 pov::pov_free(void*, 0.02 0.01 -0.01 pov::close_istack(pov::istack_struct*) 0.02 0.01 -0.01 pov::IsoSurface_Function_Find_Root(pov::IsoSurface_Struct*, 0.02 0.01 -0.01 pov::Initialize_Ray_Containers(pov::Ray_Struct*, 0.02 0.01 -0.01 pov::DNoise(double*, 0.02 0.01 -0.01 pov::Compute_Pigment(float*, 0.41 0.39 -0.02 pov::All_Plane_Intersections(pov::Object_Struct*, 0.35 0.33 -0.02 pov::Inside_Blob(double*, 0.13 0.11 -0.02 pov::Priority_Queue_Delete(pov::Priority_Queue_Struct*, 0.07 0.05 -0.02 pov::Point_In_Clip(double*, 0.07 0.05 -0.02 pov::MTimesB(double 0.06 0.04 -0.02 pov::POVFPU_SetLocal(unsigned, 0.03 0.01 -0.02 pov::pov_memmove(void*, 0.03 0.01 -0.02 pov::incstack(pov::istack_struct*) 0.22 0.19 -0.03 pov::Inside_Plane(double*, 0.17 0.14 -0.03 pov::Create_Rayinfo(pov::Ray_Struct*, 0.16 0.13 -0.03 pov::polysolve(int, 0.13 0.10 -0.03 pov::sort_and_split(pov::BBox_Tree_Struct**, 0.12 0.09 -0.03 pov::Inside_Object(double*, 0.11 0.08 -0.03 pov::MInvTransDirection(double*, 0.44 0.40 -0.04 pov::All_CSG_Intersect_Intersections(pov::Object_Struct*, 0.29 0.25 -0.04 pov::scattering_attenuation(pov::Media_Struct**, 0.20 0.16 -0.04 pov::sample_media_rec(pov::Light_List_Struct*, 0.09 0.05 -0.04 pov::Solve_Polynomial(int,
(In reply to comment #12) > Here is a list of all the differences between the two profiles. r is the time > with recip, nr is without, nr-r is the difference. > > r nr nr-r func > 1.28 1.50 0.22 pov::sbisect(int, It looks to me that header is reversed! pov::sbisect is 1.50 _with_ recip. IMHO header should look: nr r nr-r func 1.28 1.50 0.22 pov::sbisect(int, ...
Subject: Re: [4.1 Regression] Massive performance regression for -ffast-math due to the recip tree pass >It looks to me that header is reversed! pov::sbisect is 1.50 _with_ recip. > ehm, right. indeed i was looking at sbisect now ;-) Paolo
hum, strange. i don't get absolutely any difference before and after recip on sbisect (and the family of functions it invokes, i.e. polyeval, regula_falsa and numchanges) in polysolv.cpp from povray 3.6.1. and indeed by browsing at the source code, i saw no optimization opportunities there. uros, can you do the same test? (i.e. compile polysolv.cpp with all your options + -fdump-tree-all, and then "diff -u polysolv.cpp.*.loopdone polysolv.cpp.*.recip"). tnx, Paolo
Created attachment 9845 [details] Differences to .recip in source directory The differences between .loopdone and .recip in povray-3.6.1/source directory for all files. Compiled with recip patch enabled. Other compile flags as stated above.
Looking at the differences, the result of recip looks _really_ good! Currently, there seems to be some problems, i.e.: double pov::f_polytubes(double*, unsigned int) (ptr, D.22748) - D.22787_46 = -6.28318530717958623199592693708837032318115234375e+0 / D.22783_80; + reciptmp.882_72 = 1.0e+0 / D.22783_80; + D.22787_46 = -6.28318530717958623199592693708837032318115234375e+0 * reciptmp.882_72; Not needed, only one user. Function double pov::POVFPU_RunDefault(pov::FUNCTION) <L193>:; - r0_1660 = r0_89 / r0_89; + reciptmp.492_84 = 1.0e+0 / r0_89; + r0_1660 = r0_89 * reciptmp.492_84; goto <bb 1062> (<L1339>); The result of above confusion is (1.0)! We are in fast-math, so no NaNs, etc.. void pov::Simulate_Media(pov::IMEDIA**, pov::RAY*, pov::INTERSECTION* - reciptmp.1152_907 = 1.0e+0 / prephitmp.1124_293; - reciptmp.1153_1046 = 1.0e+0 / prephitmp.1124_293; - reciptmp.1154_1041 = 1.0e+0 / prephitmp.1124_293; + reciptmp.1275_1270 = 1.0e+0 / prephitmp.1124_293; + reciptmp.1152_907 = 1.0e+0 * reciptmp.1275_1270; + reciptmp.1153_1046 = 1.0e+0 * reciptmp.1275_1270; + reciptmp.1154_1041 = 1.0e+0 * reciptmp.1275_1270; These are all the same? There are already reciptmp variables in loopdone that are all the same. And there are quite some places that have this problem.
Subject: Re: [4.1 Regression] Massive performance regression for -ffast-math due to the recip tree pass >Currently, there seems to be some problems, i.e.: > >double pov::f_polytubes(double*, unsigned int) (ptr, D.22748) > >- D.22787_46 = -6.28318530717958623199592693708837032318115234375e+0 / D.22783_80; >+ reciptmp.882_72 = 1.0e+0 / D.22783_80; >+ D.22787_46 = -6.28318530717958623199592693708837032318115234375e+0 * >reciptmp.882_72; > >Not needed, only one user. > > no, there is another user in another basic block. >Function double pov::POVFPU_RunDefault(pov::FUNCTION) > > <L193>:; >- r0_1660 = r0_89 / r0_89; >+ reciptmp.492_84 = 1.0e+0 / r0_89; >+ r0_1660 = r0_89 * reciptmp.492_84; > goto <bb 1062> (<L1339>); > >The result of above confusion is (1.0)! We are in fast-math, so no NaNs, etc.. > >void pov::Simulate_Media(pov::IMEDIA**, pov::RAY*, pov::INTERSECTION* >- reciptmp.1152_907 = 1.0e+0 / prephitmp.1124_293; >- reciptmp.1153_1046 = 1.0e+0 / prephitmp.1124_293; >- reciptmp.1154_1041 = 1.0e+0 / prephitmp.1124_293; >+ reciptmp.1275_1270 = 1.0e+0 / prephitmp.1124_293; >+ reciptmp.1152_907 = 1.0e+0 * reciptmp.1275_1270; >+ reciptmp.1153_1046 = 1.0e+0 * reciptmp.1275_1270; >+ reciptmp.1154_1041 = 1.0e+0 * reciptmp.1275_1270; > >These are all the same? There are already reciptmp variables in loopdone that >are all the same. And there are quite some places that have this problem. > > I think the final DOM run ought to simplify the other problems. Paolo
(In reply to comment #18) > >Currently, there seems to be some problems, i.e.: > >Function double pov::POVFPU_RunDefault(pov::FUNCTION) > > > > <L193>:; > >- r0_1660 = r0_89 / r0_89; > >+ reciptmp.492_84 = 1.0e+0 / r0_89; > >+ r0_1660 = r0_89 * reciptmp.492_84; > > goto <bb 1062> (<L1339>); > > > >The result of above confusion is (1.0)! We are in fast-math, so no NaNs, etc.. > I think the final DOM run ought to simplify the other problems. Unfortunatelly, it doesn't. This is from .optimized: <L193>:; r0 = r0 * 1.0e+0 / r0; goto <bb 1062> (<L1339>); And gcc further generates this beauty: .L236: .loc 1 1405 0 fldl -112(%ebp) fdiv %st(0), %st fstpl -112(%ebp) .loc 1 1626 0 addl $1, %edi jmp .L1595 .L237: Some optimization pass should figure out that fld1 should do the trick...
Subject: Re: [4.1 Regression] Massive performance regression for -ffast-math due to the recip tree pass >>>Function double pov::POVFPU_RunDefault(pov::FUNCTION) >>> >>><L193>:; >>>- r0_1660 = r0_89 / r0_89; >>>+ reciptmp.492_84 = 1.0e+0 / r0_89; >>>+ r0_1660 = r0_89 * reciptmp.492_84; >>> goto <bb 1062> (<L1339>); >>> >>>The result of above confusion is (1.0)! We are in fast-math, so no NaNs, etc.. >>> >>> Can you prepare a patch to fold, that fixes this beauty? :-)
Uros found that tweaking the PR23948 patch to require at least three divisions is more profitable (speed-wise). I'll make it a param.
Is it possible for backends to inquire how many reciprocals were discovered? PowerPC and IA-64 have reciprocal instructions and could benefit by choosing to generate reciprocal earlier with additional information about the number of divides calculated. For PowerPC, it is effective to use the instruction if there are multiple divides, such as the three divisions mentioned above. The IBM XLC compiler propagates the reciprocal and numerator pair through its equivalent to RTL.
> For PowerPC, it is effective to use the instruction if > there are multiple divides, such as the three divisions mentioned above. The > IBM XLC compiler propagates the reciprocal and numerator pair through its > equivalent to RTL. I am not sure I follow you. I see two questions, but it could be that you asked neither: 1) You want to use the reciprocal instruction instead of a FP divide. This could be done in the expander, or with a new RECIP_EXPR tree code. I'd rather do the former. 2) Multiplying by the result of the reciprocal instruction always provides the exact result of the division, so you want to enable the pass always. It could be possible to parameterize the recip pass on a separate -fdivide-by-reciprocals flag, turned on by -funsafe-math-optimization, but also always turned on in config/rs6000/rs6000.c
Comment #21 mentions three divisions for the optimization to be more profitable, but that information is lost by the time GCC reaches the expanders. Minimally, it would be helpful to know that the reciprocal optimization found more than three divides. If the backend knew that it were calculating a reciprocal instead of 1./x division and that the reciprocal only occurs when more than two divides are present, it could produce better code. For both PowerPC and IA-64, it would be helpful to have RECIP_EXPR and the pair that completes division because both archtectures could then explicitly use their reciprocal instructions and fully schedule the Newton-Raphson or power-series expansion for the closure instead of generating the 1./x reciprocal as a full divide and the lower accuracy, isolated numerator multiplication separately. If the backends had more complete information, the reciprocal optimization could be extended beyond -ffast-math for some targets.
Subject: Re: [4.1/4.2 Regression] Massive performance regression for -ffast-math due to the recip tree pass >For PowerPC, it is effective to use the instruction if >there are multiple divides, such as the three divisions mentioned above. The >IBM XLC compiler propagates the reciprocal and numerator pair through its >equivalent to RTL. > > I am not sure I follow you. I see two questions, but it could be that you asked neither: 1) You want to use the reciprocal instruction instead of a FP divide. This could be done in the expander, or with a new RECIP_EXPR tree code. I'd rather do the former. 2) Multiplying by the result of the reciprocal instruction always provides the exact result of the division, so you want to enable the pass always. It could be possible to parameterize the recip pass on a separate -fdivide-by-reciprocals flag, turned on by -funsafe-math-optimization, but also always turned on in config/rs6000/rs6000.c Paolo
Subject: Bug 24123 Author: bonzini Date: Wed Jan 11 13:02:18 2006 New Revision: 109578 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=109578 Log: gcc: 2006-01-11 Paolo Bonzini <bonzini@gnu.org> PR tree-optimization/23109 PR tree-optimization/23948 PR tree-optimization/24123 * Makefile.in (tree-ssa-math-opts.o): Adjust dependencies. * tree-cfg.c (single_noncomplex_succ): New. * tree-flow.h (single_noncomplex_succ): Declare it. * tree-ssa-math-opts.c (enum place_reciprocal): Remove. * tree-ssa-math-opts.c (enum place_reciprocal): Remove. (struct occurrence, occ_head, occ_pool, is_divide_by, compute_merit, insert_bb, register_division_in, insert_reciprocals, replace_reciprocal, free_bb): New. (execute_cse_reciprocals_1): Rewritten. (execute_cse_reciprocals): Adjust calls to execute_cse_reciprocals_1. Do not commit any edge insertion. Always compute dominators and create the allocation pool. * target-def.h (TARGET_MIN_DIVISIONS_FOR_RECIP_MUL): New. * target.h (struct gcc_target): Add min_divistions_for_recip_mul. * targhooks.c (default_min_divistions_for_recip_mul): New. * targhooks.h (default_min_divistions_for_recip_mul): New prototype. * passes.c (init_optimization_passes): Run recip after tree loop optimizations. * doc/tm.texi (Misc): Document TARGET_MIN_DIVISIONS_FOR_RECIP_MUL. gcc/testsuite: 2006-01-11 Paolo Bonzini <bonzini@gnu.org> PR tree-optimization/23109 PR tree-optimization/23948 PR tree-optimization/24123 * gcc.dg/tree-ssa/recip-3.c, gcc.dg/tree-ssa/recip-4.c, gcc.dg/tree-ssa/recip-5.c, gcc.dg/tree-ssa/recip-6.c, gcc.dg/tree-ssa/recip-7.c, gcc.dg/tree-ssa/pr23109.c, g++.dg/tree-ssa/pr23948.C: New testcases. * gcc.dg/tree-ssa/recip-2.c, gcc.dg/tree-ssa/pr23234.c: Provide three divisions in order to do the optimization. Added: trunk/gcc/testsuite/g++.dg/tree-ssa/pr23948.C trunk/gcc/testsuite/gcc.dg/tree-ssa/pr23109.c trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-4.c trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-5.c trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-6.c trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-7.c Modified: trunk/gcc/ChangeLog trunk/gcc/Makefile.in trunk/gcc/doc/tm.texi trunk/gcc/passes.c trunk/gcc/target-def.h trunk/gcc/target.h trunk/gcc/targhooks.c trunk/gcc/targhooks.h trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/tree-ssa/pr23234.c trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-2.c trunk/gcc/tree-cfg.c trunk/gcc/tree-flow.h trunk/gcc/tree-ssa-math-opts.c
Subject: Bug 24123 Author: bonzini Date: Wed Jan 11 14:29:29 2006 New Revision: 109586 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=109586 Log: gcc: 2006-01-11 Paolo Bonzini <bonzini@gnu.org> PR tree-optimization/23109 PR tree-optimization/23948 PR tree-optimization/24123 * Makefile.in (tree-ssa-math-opts.o): Adjust dependencies. * tree-cfg.c (single_noncomplex_succ): New. * tree-flow.h (single_noncomplex_succ): Declare it. * tree-ssa-math-opts.c (enum place_reciprocal): Remove. * tree-ssa-math-opts.c (enum place_reciprocal): Remove. (struct occurrence, occ_head, occ_pool, is_divide_by, compute_merit, insert_bb, register_division_in, insert_reciprocals, replace_reciprocal, free_bb): New. (execute_cse_reciprocals_1): Rewritten. (execute_cse_reciprocals): Adjust calls to execute_cse_reciprocals_1. Do not commit any edge insertion. Always compute dominators and create the allocation pool. * target-def.h (TARGET_MIN_DIVISIONS_FOR_RECIP_MUL): New. * target.h (struct gcc_target): Add min_divistions_for_recip_mul. * targhooks.c (default_min_divistions_for_recip_mul): New. * targhooks.h (default_min_divistions_for_recip_mul): New prototype. * passes.c (init_optimization_passes): Run recip after tree loop optimizations. * doc/tm.texi (Misc): Document TARGET_MIN_DIVISIONS_FOR_RECIP_MUL. gcc/testsuite: 2006-01-11 Paolo Bonzini <bonzini@gnu.org> PR tree-optimization/23109 PR tree-optimization/23948 PR tree-optimization/24123 * gcc.dg/tree-ssa/recip-3.c, gcc.dg/tree-ssa/recip-4.c, gcc.dg/tree-ssa/recip-5.c, gcc.dg/tree-ssa/recip-6.c, gcc.dg/tree-ssa/recip-7.c, gcc.dg/tree-ssa/pr23109.c, g++.dg/tree-ssa/pr23948.C: New testcases. * gcc.dg/tree-ssa/recip-2.c, gcc.dg/tree-ssa/pr23234.c: Provide three divisions in order to do the optimization. Added: branches/gcc-4_1-branch/gcc/testsuite/g++.dg/tree-ssa/pr23948.C - copied unchanged from r109578, trunk/gcc/testsuite/g++.dg/tree-ssa/pr23948.C branches/gcc-4_1-branch/gcc/testsuite/gcc.dg/tree-ssa/pr23109.c - copied unchanged from r109578, trunk/gcc/testsuite/gcc.dg/tree-ssa/pr23109.c branches/gcc-4_1-branch/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c - copied unchanged from r109578, trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c branches/gcc-4_1-branch/gcc/testsuite/gcc.dg/tree-ssa/recip-4.c - copied unchanged from r109578, trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-4.c branches/gcc-4_1-branch/gcc/testsuite/gcc.dg/tree-ssa/recip-5.c - copied unchanged from r109578, trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-5.c branches/gcc-4_1-branch/gcc/testsuite/gcc.dg/tree-ssa/recip-6.c - copied unchanged from r109578, trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-6.c branches/gcc-4_1-branch/gcc/testsuite/gcc.dg/tree-ssa/recip-7.c - copied unchanged from r109578, trunk/gcc/testsuite/gcc.dg/tree-ssa/recip-7.c Modified: branches/gcc-4_1-branch/gcc/ChangeLog branches/gcc-4_1-branch/gcc/Makefile.in branches/gcc-4_1-branch/gcc/doc/tm.texi branches/gcc-4_1-branch/gcc/passes.c branches/gcc-4_1-branch/gcc/target-def.h branches/gcc-4_1-branch/gcc/target.h branches/gcc-4_1-branch/gcc/targhooks.c branches/gcc-4_1-branch/gcc/targhooks.h branches/gcc-4_1-branch/gcc/testsuite/ChangeLog branches/gcc-4_1-branch/gcc/testsuite/gcc.dg/tree-ssa/pr23234.c branches/gcc-4_1-branch/gcc/testsuite/gcc.dg/tree-ssa/recip-2.c branches/gcc-4_1-branch/gcc/tree-cfg.c branches/gcc-4_1-branch/gcc/tree-flow.h branches/gcc-4_1-branch/gcc/tree-ssa-math-opts.c
patch committed to both branches