/* I found an interesting xlc strength reduction optimization recently,
that had xlc producing fp code that ran over twice as fast as gcc
code on a powerpc benchmark. Some improvement on the benchmark code
was due to xlc using floating multiply-add more aggressively, but the
main improvement was converting code as in f1 to as in f2. */
void f1 (void)
for (i = 0; i < 500; i++)
__asm__ __volatile__ ("# %0" : : "f" (i * bar));
void f2 (void)
register long i;
register float f, bar2 = bar;
for (i = 500, f = 0.0; --i >= 0;)
__asm__ __volatile__ ("# %0" : : "f" (f));
f += bar2;
/* On ppc32, the f1 loop generates
the f2 loop is
Confirmed, the main reason why f1 is faster than f2 is that you no longer have to go through the stack
and store on the stack.
Retested on 3.5 cvs (20040703) and the probelm is still there:
f1 when compiled 64bit is worse, no use of count register (bug 16356), redundant
sign extension etc:
FYI this is still present in 4.0.0 20050313
What is the specific testcase compiled by XLC? What version of XLC? And what options were used?
I cannot reproduce strength reduction of a floating point multiply to floating point adds with a testcase that uses a function call instead of a volatile asm. In general FP strength reduction is unsafe. This could be implemented for GCC when -ffast-math is enabled, but I would like to understand exactly when XLC thinks it is safe to do this, if it indeed still performs the transformation.