Bug 98673 - pass fre4 inhibit pass dom3 to create much more optimized code
Summary: pass fre4 inhibit pass dom3 to create much more optimized code
Status: RESOLVED WORKSFORME
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 10.2.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2021-01-14 09:32 UTC by jojo
Modified: 2023-12-06 10:39 UTC (History)
1 user (show)

See Also:
Host:
Target: riscv
Build:
Known to work: 11.0, 8.4.0
Known to fail: 10.2.1
Last reconfirmed:


Attachments
bug test file (4.81 KB, text/plain)
2021-01-14 09:32 UTC, jojo
Details

Note You need to log in before you can comment on or make changes to this bug.
Description jojo 2021-01-14 09:32:24 UTC
Created attachment 49962 [details]
bug test file

a, compiler option:

cc1 -mabi=lp64d -march=rv64gc -O2 -S

b, hot code in function t_run_test:

        j       .L30
.L39:
        mv      a4,a3
.L30:
        ld      a2,8(a5)
        addi    a3,a4,1
        slli    t3,a4,3
        ble     a2,a1,.L28
        ld      t5,0(a5)
        bge     a1,t5,.L50
.L28:
        addi    a5,a5,8
        bne     a3,a0,.L39    : hot code loop to .L39

better code in version 8.4 with same compiler option:
=====================================================

.L30:
        ld      t1,8(a4)
        slli    a7,a5,3
        ble     t1,a3,.L28
        ld      t4,0(a4)
        bge     a3,t4,.L50
.L28:
        addi    a5,a5,1
        addi    a4,a4,8
        bne     a5,t3,.L30    : hot code loop to .L30

v10.2.0 gcc has more one instruction than v8.4.0.

analize gcc pass of source code in v10.2.0:
===========================================

before pass fr4:
----------------

<bb 8> [local count: 82176881]:
  engLoad.11_20 = engLoad;
  loadValue.13_26 = loadValue;
  _410 = (unsigned long) numXEntries.17_218;
  _409 = _410 + 18446744073709551615;
  _408 = (long int) _409;

... ...

<bb 12> [local count: 986782143]:
  i1_174 = i1_6 + 1;
  if (i1_174 != _408)
    goto <bb 9>; [94.50%]
  else
    goto <bb 13>; [5.50%]

  <bb 13> [local count: 54273018]:
  # i1_420 = PHI <i1_174(12)>
  _433 = (long unsigned int) i1_420;
  _434 = _433 + 1;
  _435 = _434 * 8;
  _436 = i1_420 + 1;
  _440 = _435 - 8;
  _442 = engLoad.11_20 + _440;
  goto <bb 15>; [100.00%]

after pass fr4:
---------------

<bb 8> [local count: 82176881]:
engLoad.11_20 = engLoad;
loadValue.13_26 = loadValue;
_410 = (unsigned long) numXEntries.17_218;
_409 = _410 + 18446744073709551615;

... ...

<bb 12> [local count: 986782143]:
  i1_174 = i1_6 + 1;
  if (i1_174 != _213)
    goto <bb 9>; [94.50%]
  else
    goto <bb 13>; [5.50%]

<bb 13> [local count: 54273018]:
_433 = (long unsigned int) i1_174;
_434 = _433 + 1;
_435 = _434 * 8;
_436 = i1_174 + 1;
_440 = _435 - 8;
_442 = engLoad.11_20 + _440;
goto <bb 15>; [100.00%]

pass fr4 remove 'Removing dead stmt _408 = (long int) _409;',
pass dom3 can't optimize this <bb 13> about '_433 = (long unsigned int) i1_174;'

if <bb 13> use i1_174 node same as <bb 12>, so that conflict will be happened in pass expand on processing coalesced ssa/phi nodes, and then will split edge.


need help ....:)
Comment 1 Richard Biener 2021-01-14 10:45:36 UTC
The analysis sounds a bit confused.  What is the transform that DOM cannot do after the transform that FRE does?  There's some older bug about out-of-SSA
coalescing issues with loops and liveness of induction variables but it's
not clear if this is related (the assembly doesn't show the loop exit block).

Can you name the loop in the source that is problematic?

See PR86270 and PR70359
Comment 2 jojo 2021-01-15 01:35:29 UTC
(In reply to Richard Biener from comment #1)
> The analysis sounds a bit confused.  What is the transform that DOM cannot
> do after the transform that FRE does?  There's some older bug about
> out-of-SSA
> coalescing issues with loops and liveness of induction variables but it's
> not clear if this is related (the assembly doesn't show the loop exit block).
> 
> Can you name the loop in the source that is problematic?
> 

see this loop:

      for( i1 = 0 ; i1 < ( numXEntries - 1 ) ; i1++ )
        {
            if( ( loadValue < engLoad[i1+1] ) && ( loadValue >= engLoad[i1] ) )
            {
                break ;
            }
        }
Comment 3 jojo 2021-01-18 01:31:38 UTC
(In reply to jojo from comment #2)
> (In reply to Richard Biener from comment #1)
> > The analysis sounds a bit confused.  What is the transform that DOM cannot
> > do after the transform that FRE does?  There's some older bug about
> > out-of-SSA
> > coalescing issues with loops and liveness of induction variables but it's
> > not clear if this is related (the assembly doesn't show the loop exit block).
> > 
> > Can you name the loop in the source that is problematic?
> > 
> 
> see this loop:
> 
>       for( i1 = 0 ; i1 < ( numXEntries - 1 ) ; i1++ )
>         {
>             if( ( loadValue < engLoad[i1+1] ) && ( loadValue >= engLoad[i1]
> ) )
>             {
>                 break ;
>             }
>         }

Richi: Can you please update Known to work?
Comment 4 Richard Biener 2021-01-18 10:05:05 UTC
Please try simplifying the testcase, I've tried

long loadValue;
const long *engLoad;
float engLoadDelta1;
void foo()
{
  long i1, numXEntries = 50;
  for( i1 = 0 ; i1 < ( numXEntries - 1 ) ; i1++ )
    {
      if( ( loadValue < engLoad[i1+1] ) && ( loadValue >= engLoad[i1] ) )
        {
          break ;
        }
    }

  if( i1 == ( numXEntries - 1 ) )
    {
      loadValue = engLoad[i1] ;
    }

  engLoadDelta1 = (float)( loadValue - engLoad[i1] ) /
      (float)( engLoad[i1 + 1] - engLoad[i1] ) ;
}

which on x86 doesn't exhibit the issue (same code with GCC 8 and GCC 10).
Comment 5 jojo 2021-02-03 09:58:31 UTC
Sorry for late :)

Please test with following c case:

long
YTableLookup (long xValue, long xEntries, const long *xAxis,
              const long *yTable )
{
  int i ;
  long xDelta ;
  long outValue ;
  for (i=0; i<(xEntries - 1); i++)
    {
      if ((xValue < xAxis[i + 1]) && (xValue >= xAxis[i]))
        break ;
    }

  if (i == (xEntries - 1))
      xValue = xAxis[i] ;

  xDelta = (long) ((xValue - xAxis[i]) * 1000) / (xAxis[i + 1] - xAxis[i]);
  outValue = (long) ((((1000 - xDelta) * (long) yTable[i]) / 1000) +
             ((xDelta * (long) yTable[i+1]) / 1000)) ;
  return outValue ;
}

risc-v cc1 option: -O2 -march=rv32gc -mabi=ilp32d
=================================================

YTableLookup:
        addi    a1,a1,-1
        ble     a1,zero,.L2
        li      a7,4
        li      a4,0
        sub     a7,a7,a2
        j       .L5
.L6:
        mv      a4,a5
.L5:
        lw      a6,4(a2)
        addi    a5,a4,1
        add     t1,a7,a2
        ble     a6,a0,.L3
        lw      t3,0(a2)
        slli    t4,a4,2
        ble     t3,a0,.L9
.L3:
        addi    a2,a2,4
        bne     a1,a5,.L6
        addi    a4,a4,2
        slli    t1,a4,2
        addi    t4,t1,-4
        add     t4,a3,t4
        li      a1,0
        li      a5,1000
.L4:


or x86 cc1 option: -O2 -march=i386
==================================


YTableLookup:
.LFB0:
        pushl   %ebp
.LCFI0:
        pushl   %edi
.LCFI1:
        pushl   %esi
.LCFI2:
        pushl   %ebx
.LCFI3:
        pushl   %ecx
.LCFI4:
        movl    24(%esp), %esi
        movl    32(%esp), %edi
        movl    28(%esp), %eax
        decl    %eax
        testl   %eax, %eax
        jle     .L2
        movl    $4, %ecx
        xorl    %edx, %edx
        jmp     .L5
        .align 4
.L6:
        movl    %ebx, %edx
.L5:
        movl    (%edi,%ecx), %ebx
        cmpl    %esi, %ebx
        jle     .L3
        leal    -4(%ecx), %ebp
        movl    %ebp, (%esp)
        movl    (%edi,%edx,4), %ebp
        cmpl    %esi, %ebp
        jle     .L10
.L3:
        leal    1(%edx), %ebx
        addl    $4, %ecx
        cmpl    %ebx, %eax
        jne     .L6
        leal    8(,%edx,4), %ecx
        movl    36(%esp), %eax
        leal    -4(%eax,%ecx), %esi
        xorl    %eax, %eax
        movl    $1000, %ebx
.L4:


Please check the redundancy instruction 'mov' at .L6:
Comment 6 Richard Biener 2021-02-03 10:19:10 UTC
So with your testcase on trunk I see for RISCV

        ble     a1,zero,.L2
        li      a6,4
        li      a5,0
        sub     a6,a6,a2
.L5:
        lw      a4,4(a2)
        slli    a7,a5,2
        add     t1,a6,a2
        addi    a5,a5,1
        ble     a4,a0,.L3
        lw      t3,0(a2)
        ble     t3,a0,.L9
.L3:
        addi    a2,a2,4
        bne     a1,a5,.L5

which is fine, same for x86.  This is usually a SSA coalescing issue where
a failed coalesce ends up splitting the backedge and emitting a move there.

I can see the issue on the branch where the problematic one is

;;   basic block 4, loop depth 1
;;    pred:       3
;;                7
  # i_57 = PHI <0(3), i_41(7)>
...

;;   basic block 7, loop depth 1
;;    pred:       4
;;                5
  i_41 = i_57 + 1;
  ivtmp.14_90 = ivtmp.14_91 + 4;
  if (_6 != i_41)
    goto <bb 4>; [94.50%]
  else
    goto <bb 8>; [5.50%]
;;    succ:       4
;;                8

;;   basic block 8, loop depth 0
;;    pred:       7
  _87 = (sizetype) i_57;
  _146 = _87 + 2;

which is a use of the pre-increment i_57 on the loop exit edge.  This
inhibits coalescing of i_57 and i_41 causing the copy.

That's exactly the issue noted in the cited PRs.  There have been patches
floating around re-materializing i_41 + 1 at the point of i_57 to make
the coalescing possible but I think nobody developed them in full.
See the thread starting at https://gcc.gnu.org/pipermail/gcc-patches/2018-March/495843.html
Comment 7 Richard Biener 2023-12-06 10:39:44 UTC
No updates, verified the code is still the same, assuming not an issue.