I would like to propose turning on -finline-float-divide-max-throughput
by default on IA64. I have tested it on HP-UX and Linux with no
regressions and I also ran the 4 C floating point spec programs on HP-UX
in 64 bit mode to test the performance difference.
The largest performance improvement was in 179.art which had a 35%
improvement with -O2 and 20% with -O3. The other tests improved from 0%
to 4% with the exception of 188.ammp which slowed down by 1.2% at -O2
(but improved by 0.5% at -O3). I also tried the min-latency version of
inlining, it was not as good as the max-throughput version except on
188.amp where it slowed things down a little less at -O2 and sped up a
little more at -O3.
The size increase for using inline division ranged from 1.4% to 15%,
(1.4% 177.mesa, 7.6% 179.art, 6.4% 183.equake, 15% 188.ammp).
As a side note, the HP IA64 compiler always generates inline code for
floating point division.