[forwarded from http://bugs.debian.org/268115] Matthias The bug submitter writes: compiling this function: double baz(double foo, double bar) { return foo*foo*foo*foo*bar*bar*bar*bar; } on amd64 with -O6 -ffast-math, gcc emits this code: foo.o: file format elf64-x86-64 Disassembly of section .text: ... (some similar functions that I was messing around with) ... 0000000000000050 <ddbar>: 50: f2 0f 59 c0 mulsd %xmm0,%xmm0 54: f2 0f 59 c0 mulsd %xmm0,%xmm0 58: f2 0f 59 c1 mulsd %xmm1,%xmm0 5c: f2 0f 59 c1 mulsd %xmm1,%xmm0 60: f2 0f 59 c1 mulsd %xmm1,%xmm0 64: f2 0f 59 c1 mulsd %xmm1,%xmm0 68: c3 retq So, it notices that it can do foo*foo*foo*foo with two mulsd instructions, but it misses the same optimization for bar*bar*bar*bar. It would save one FP multiply overall to do: mulsd %xmm0, %xmm0 mulsd %xmm1, %xmm1 mulsd %xmm0, %xmm0 mulsd %xmm1, %xmm1 mulsd %xmm1, %xmm0 retq Also, the two non-dependent muls could run in parallel. Without -ffast-math, of course, gcc can't take advantage of the laws of arithmetic like that and has to do all the multiplies the straightforward way.
Confirmed. Actually the most optimial code would be: mulsd %xmm1, %xmm0 mulsd %xmm0, %xmm0 mulsd %xmm0, %xmm0 aka (foo*bar)^4
This is actually not a target issue, it can be shown on ppc also and other targets including x86. doing (f1*f2)^2^2 will be the best every where as it is only three instructions and it would take the same time as what is proposed if there are two FPU units but what I said is the smallest and fastest version.
I think PR 22312 mentions what the current problem with reassoc is (well once I submit the patch to introduce reassociation for fp).
In 4.2.0 and above we get: baz: .LFB2: mulsd %xmm1, %xmm0 mulsd %xmm0, %xmm0 mulsd %xmm0, %xmm0 ret Which is what I recommend but we don't get that on the tree level: return bar * bar * __builtin_pow (foo, 4.0e+0) * bar * bar; So this is just a tree level missed optimization.
The tree level on the trunk we get: powmult.2_10 = foo_1(D) * foo_1(D); D.1709_4 = bar_3(D) * bar_3(D); D.1710_5 = D.1709_4 * bar_3(D); D.1711_6 = D.1710_5 * bar_3(D); powmult.2_9 = D.1711_6 * powmult.2_10; D.1707_7 = powmult.2_9 * powmult.2_10; --- CUT --- which is just as bad.
Yeah, reassoc does not canonicalize to pow () so the tree level optimal expansion does not trigger [in reality reassoc should probably do both on-the-fly - linearly expand existing pow()s to expose them to multiply chains and sorting, combine them back and then emit them in optimal form]. Bill, something you want to tackle?
Sure, I'll at least have a look at it when I get some time.
I've started to look at this -- I'll plan to get a patch in place for 4.8.
Author: wschmidt Date: Thu Apr 12 16:15:13 2012 New Revision: 186384 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=186384 Log: gcc: 2012-04-12 Bill Schmidt <wschmidt@linux.vnet.ibm.com> PR tree-optimization/18589 * tree-ssa-reassoc.c (reassociate_stats): Add two fields. (operand_entry): Add count field. (add_repeat_to_ops_vec): New function. (completely_remove_stmt): Likewise. (remove_def_if_absorbed_call): Likewise. (remove_visited_stmt_chain): Remove feeding builtin pow/powi calls. (acceptable_pow_call): New function. (linearize_expr_tree): Look for builtin pow/powi calls and add operand entries with repeat counts when found. (repeat_factor_d): New struct and associated typedefs. (repeat_factor_vec): New static vector variable. (compare_repeat_factors): New function. (get_reassoc_pow_ssa_name): Likewise. (attempt_builtin_powi): Likewise. (reassociate_bb): Call attempt_builtin_powi. (fini_reassoc): Two new calls to statistics_counter_event. gcc/testsuite: 2012-04-12 Bill Schmidt <wschmidt@linux.vnet.ibm.com> PR tree-optimization/18589 * gcc.dg/tree-ssa/pr18589-1.c: New test. * gcc.dg/tree-ssa/pr18589-2.c: Likewise. * gcc.dg/tree-ssa/pr18589-3.c: Likewise. * gcc.dg/tree-ssa/pr18589-4.c: Likewise. * gcc.dg/tree-ssa/pr18589-5.c: Likewise. * gcc.dg/tree-ssa/pr18589-6.c: Likewise. * gcc.dg/tree-ssa/pr18589-7.c: Likewise. * gcc.dg/tree-ssa/pr18589-8.c: Likewise. * gcc.dg/tree-ssa/pr18589-9.c: Likewise. * gcc.dg/tree-ssa/pr18589-10.c: Likewise. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/pr18589-1.c trunk/gcc/testsuite/gcc.dg/tree-ssa/pr18589-10.c trunk/gcc/testsuite/gcc.dg/tree-ssa/pr18589-2.c trunk/gcc/testsuite/gcc.dg/tree-ssa/pr18589-3.c trunk/gcc/testsuite/gcc.dg/tree-ssa/pr18589-4.c trunk/gcc/testsuite/gcc.dg/tree-ssa/pr18589-5.c trunk/gcc/testsuite/gcc.dg/tree-ssa/pr18589-6.c trunk/gcc/testsuite/gcc.dg/tree-ssa/pr18589-7.c trunk/gcc/testsuite/gcc.dg/tree-ssa/pr18589-8.c trunk/gcc/testsuite/gcc.dg/tree-ssa/pr18589-9.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-reassoc.c
Fixed.