Bug 14886 - strength reduction on floating point
Summary: strength reduction on floating point
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.0.0
: P2 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: 52316
  Show dependency treegraph
 
Reported: 2004-04-08 03:52 UTC by Alan Modra
Modified: 2021-12-28 04:01 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-12-27 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alan Modra 2004-04-08 03:52:56 UTC
/* I found an interesting xlc strength reduction optimization recently,
   that had xlc producing fp code that ran over twice as fast as gcc
   code on a powerpc benchmark.  Some improvement on the benchmark code
   was due to xlc using floating multiply-add more aggressively, but the
   main improvement was converting code as in f1 to as in f2.  */

float bar;

void f1 (void)
{
  int i;
  for (i = 0; i < 500; i++)
    __asm__ __volatile__ ("# %0" : : "f" (i * bar));
}

void f2 (void)
{
  register long i;
  register float f, bar2 = bar;
  for (i = 500, f = 0.0; --i >= 0;)
    {
      __asm__ __volatile__ ("# %0" : : "f" (f));
      f += bar2;
    }
}

/* On ppc32, the f1 loop generates
.L9:
        xoris 0,9,0x8000
        stw 11,8(1)
        stw 0,12(1)
        lfd 0,8(1)
        fsub 0,0,13
        frsp 0,0
        fmuls 0,0,12
#APP
        # 0
#NO_APP
        addi 9,9,1
        bdnz .L9

the f2 loop is
.L19:
#APP
        # 0
#NO_APP
        fadds 0,0,13
        bdnz .L19
*/
Comment 1 Andrew Pinski 2004-04-08 04:05:23 UTC
Confirmed, the main reason why f1 is faster than f2 is that you no longer have to go through the stack 
and store on the stack.
Comment 2 Anton Blanchard 2004-07-04 06:00:41 UTC
Retested on 3.5 cvs (20040703) and the probelm is still there:

.L2:
        xoris 0,9,0x8000
        stw 11,8(1)
        stw 0,12(1)
        lfd 0,8(1)
        fsub 0,0,13
        frsp 0,0
        fmuls 0,0,12
#APP
        # 0
#NO_APP
        addi 9,9,1
        bdnz .L2

vs:

.L8:
#APP
        # 0
#NO_APP
        fadds 0,0,13
        bdnz .L8
Comment 3 Anton Blanchard 2004-07-04 06:21:27 UTC
f1 when compiled 64bit is worse, no use of count register (bug 16356), redundant
sign extension etc:

.L2:
        rldicl. 0,11,0,53
        sradi 9,11,53
        addi 9,9,1
        cmpldi 7,9,2
        beq- 0,.L3
        xor 0,11,0
        blt- 7,.L3
        ori 11,0,2048
.L3:
        lfs 13,0(10)
        std 11,-16(1)
        lfd 12,-16(1)
        fcfid 12,12
        frsp 0,12
        fmuls 0,0,13
#APP
        # 0
#NO_APP
        addi 0,11,1
        extsw 11,0
        cmpwi 7,11,499
        ble+ 7,.L2
Comment 4 Anton Blanchard 2005-03-16 20:54:24 UTC
FYI this is still present in 4.0.0 20050313
Comment 5 David Edelsohn 2006-01-11 02:20:31 UTC
What is the specific testcase compiled by XLC?  What version of XLC?  And what options were used?

I cannot reproduce strength reduction of a floating point multiply to floating point adds with a testcase that uses a function call instead of a volatile asm.  In general FP strength reduction is unsafe.  This could be implemented for GCC when -ffast-math is enabled, but I would like to understand exactly when XLC thinks it is safe to do this, if it indeed still performs the transformation.