This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Optimizing floating point *(2^c) and /(2^c)


On 3/29/2010 10:51 AM, Geert Bosch wrote:
On Mar 29, 2010, at 13:19, Jeroen Van Der Bossche wrote:

've recently written a program where taking the average of 2 floating
point numbers was a real bottleneck. I've looked into the assembly
generated by gcc -O3 and apparently gcc treats multiplication and
division by a hard-coded 2 like any other multiplication with a
constant. I think, however, that *(2^c) and /(2^c) for floating
points, where the c is known at compile-time, should be able to be
optimized with the following pseudo-code:

e = exponent bits of the number
if (e>  c&&  e<  (0b111...11)-c) {
e += c or e -= c
} else {
do regular multiplication
}

Even further optimizations may be possible, such as bitshifting the
significand when e=0. However, that would require checking for a lot
of special cases and require so many conditional jumps that it's most
likely not going to be any faster.

I'm not skilled enough with assembly to write this myself and test if
this actually performs faster than how it's implemented now. Its
performance will most likely also depend on the processor
architecture, and I could only test this code on one machine.
Therefore I ask to those who are familiar with gcc's optimization
routines to give this 2 seconds of thought, as this is probably rather
easy to implement and many programs could benefit from this.
For any optimization suggestions, you should start with showing some real, compilable, code with a performance problem that you think the compiler could address. Please include details about compilation options, GCC versions and target hardware, as well as observed performance numbers. How do you see that averaging two floating point numbers is a bottleneck? This should only be a single addition and multiplication, and will execute in a nanosecond or so on a moderately modern system.

Your particular suggestion is flawed. Floating-point multiplication is very fast on most targets. It is hard to see how on any target with floating-point hardware, manual mucking with the representation can be a win. In particular, your sketch doesn't at all address underflow and overflow. Likely a complete implementation would be many times slower than a floating-point multiply.

-Geert
gcc used to have the ability to replace division by a power of 2 by an fscale instruction, for appropriate targets (maybe still does). Such targets have nearly disappeared from everyday usage. What remains is the possibility of replacing the division by constant power of 2 by multiplication, but it's generally considered the programmer should have done that in the beginning. icc has such an facility, but it's subject to -fp-model=fast (equivalent to gcc -ffast-math -fno-cx-limited-range), even though it's a totally safe conversion.
As Geert indicated, it's almost inconceivable that a correct implementation which takes care of exceptions could match the floating point hardware performance, even for a case which starts with operands in memory (but you mention the case following an addition).


--
Tim Prince


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]