Bug 41171 - register allocator undoing optimal schedule
Summary: register allocator undoing optimal schedule
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.4.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, ra
Depends on:
Blocks:
 
Reported: 2009-08-25 21:48 UTC by Charles J. Tabony
Modified: 2021-12-19 00:03 UTC (History)
9 users (show)

See Also:
Host: x86_64-unknown-linux-gnu
Target: ia64-elf
Build: x86_64-unknown-linux-gnu
Known to work:
Known to fail:
Last reconfirmed: 2009-08-25 23:05:53


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Charles J. Tabony 2009-08-25 21:48:47 UTC
When I compile the following code

void f(int *x, int *y){
  *x = 7;
  *y = 4;
}

at -O2 for Itanium, I get the following assembly:

f:
        .prologue
        .body
        .mmi
        addl r14 = 7, r0
        ;;
        st4 [r32] = r14
        addl r14 = 4, r0
        ;;
        .mib
        st4 [r33] = r14
        nop 0
        br.ret.sptk.many b0
        .endp f#

The expected output is

f:
        .prologue
        .body
        .mii
        addl r14 = 7, r0
        addl r15 = 4, r0
        ;;
        nop 0
        .mmb
        st4 [r32] = r14
        st4 [r33] = r15
        br.ret.sptk.many b0
        .endp f#

In the .sched1 dump, I see the expected schedule:

;;        0-->     7 r341=0x7                          :2_A
;;        0-->     9 r342=0x4                          :2_A
;;        1-->     8 [in0]=r341                        :2_M_only_um23
;;        1-->    10 [in1]=r342                        :2_M_only_um23
;;   total time = 1

but in the .ira dump, the RTL has reverted back to the serial code.  Because of the anti-dependency introduced by register allocation, the .mach dump shows an inferior schedule:

;;        0-->     7 r14=0x7                           :2_A
;;        1-->     8 [r32]=r14                         :2_M_only_um23
;;        1-->    21 r14=0x4                           :2_A
;;        2-->    10 [r33]=r14                         :2_M_only_um23
;;        2-->    25 {return;use b0;}                  :2_B
;;   total time = 2

In GCC 4.3.2 and 3.4.6, I see the lreg pass likewise creating an inferior schedule.  However, for PowerPC, MIPS, ARM, and FR-V, GCC 4.3.2 leaves the initial schedule intact, whereas GCC 4.4.0 changes the order of insns in the IRA pass for all targets.

For targets other than Itanium, I'm not sure this transformation in IRA degrades performance, and it reduces register pressure, so it seems like a positive change.  For Itanium, this degrades performance.  What's odd is that the Itanium port had this behavior prior to GCC 4.4.0, while other ports did not.  Is there some set of machine-specific parameters that the Itanium port could tune to prevent this transformation in IRA (hopefully without degrading performance elsewhere)?
Comment 1 Adam Nemet 2009-08-25 23:05:52 UTC
It's also a regression if the load-immediates can dual issue.  I.e. see below how the code gets worse between sched1 and sched2 on octeon.

I am cc'ing Vlad in case he has some ideas.



;;   ======================================================
;;   -- basic block 2 from 7 to 10 -- before reload
;;   ======================================================

;;        0-->     7 r195=0x7                          :octeon_pipe0|octeon_pipe1
;;        0-->     9 r196=0x4                          :octeon_pipe0|octeon_pipe1
;;        1-->     8 [$4]=r195                         :octeon_pipe0
;;        2-->    10 [$5]=r196                         :octeon_pipe0
;;      Ready list (final):  
;;   total time = 2
;;   new head = 7
;;   new tail = 10


;; Procedure interblock/speculative motions == 0/0 


;;   ======================================================
;;   -- basic block 2 from 7 to 18 -- after reload
;;   ======================================================

;;        0-->     7 $2=0x7                            :octeon_pipe0|octeon_pipe1
;;        1-->     8 [$4]=$2                           :octeon_pipe0
;;        1-->    14 $2=0x4                            :octeon_pipe0|octeon_pipe1
;;        2-->    10 [$5]=$2                           :octeon_pipe0
;;        3-->    18 return                            :octeon_pipe0
;;      Ready list (final):  
;;   total time = 3
;;   new head = 7
;;   new tail = 18
Comment 2 Peter Bergner 2009-08-26 13:44:02 UTC
The problem here, is that for some reason, IRA is spilling the two pseudos in the test case, even though it seems it should be trivial.  Looking deeper.
Comment 3 Peter Bergner 2009-08-26 15:14:19 UTC
Actually, they're already reordered by the time we call ira_color and the ira dumps shows that:

;; Function f (f)

starting the processing of deferred insns
ending the processing of deferred insns
df_analyze called
scanning new insn with uid = 14.
verify found no changes in insn with uid = 14.
deleting insn with uid = 9.
Building IRA IR
...

Giving us:

(insn 7 4 8 2 pr41171.c:4 (set (reg:SI 121)
        (const_int 7 [0x7])) 344 {*movsi_internal1} (expr_list:REG_EQUIV (const_int 7 [0x7])
        (nil)))

(insn 8 7 14 2 pr41171.c:4 (set (mem:SI (reg:SI 3 3 [ x ]) [2 S4 A32])
        (reg:SI 121)) 344 {*movsi_internal1} (expr_list:REG_DEAD (reg:SI 121)
        (expr_list:REG_DEAD (reg:SI 3 3 [ x ])
            (expr_list:REG_EQUAL (const_int 7 [0x7])
                (nil)))))

(insn 14 8 10 2 pr41171.c:5 (set (reg:SI 122)
        (const_int 4 [0x4])) 344 {*movsi_internal1} (expr_list:REG_EQUIV (const_int 4 [0x4])
        (nil)))

(insn 10 14 13 2 pr41171.c:5 (set (mem:SI (reg:SI 4 4 [ y ]) [2 S4 A32])
        (reg:SI 122)) 344 {*movsi_internal1} (expr_list:REG_DEAD (reg:SI 122)
        (expr_list:REG_DEAD (reg:SI 4 4 [ y ])
            (expr_list:REG_EQUAL (const_int 4 [0x4])
                (nil)))))

Comment 4 Peter Bergner 2009-08-26 15:22:21 UTC
It's update_equiv_regs() that is causing this.
Comment 5 Peter Bergner 2009-08-26 20:57:12 UTC
From my bug analysis and request for comment on the mailinglist:

  http://gcc.gnu.org/ml/gcc/2009-08/msg00485.html


This is caused by update_equiv_regs() which IRA inherited from local-alloc.c.
Although with gcc 4.3 and earlier, you don't see the problem, it is still there,
because if you look at the 4.3 dumps, you will see that update_equiv_regs()
unordered them for us.  What is saving us is that sched2 reschedules them
again for us in the order we want.  With 4.4, IRA happens to reuse the same
register for both pseudos, so sched2 is hand tied and cannot schedule them
back again for us.

Looking at update_equiv_regs(), if I disable the replacement for regs
that are local to one basic block (patch below) like it existed before
John Wehle's patch way back in Oct 2000:

  http://gcc.gnu.org/ml/gcc-patches/2000-09/msg00782.html

then we get the ordering we want.  Does anyone know why John removed
that part of the test in his patch?  Thoughts anyone?


Peter


Index: ira.c
===================================================================
--- ira.c	(revision 151111)
+++ ira.c	(working copy)
@@ -2510,6 +2510,7 @@ update_equiv_regs (void)
 		     calls.  */
 
 		  if (REG_N_REFS (regno) == 2
+		      && REG_BASIC_BLOCK (regno) < NUM_FIXED_BLOCKS
 		      && (rtx_equal_p (x, src)
 			  || ! equiv_init_varies_p (src))
 		      && NONJUMP_INSN_P (insn)

Comment 6 Peter Bergner 2009-09-02 21:47:59 UTC
My patch solved the problem, but was very very neutral wrt SPEC2000 scores.  Vlad's idea of moving update_equiv_regs() into its own pass before sched1 makes sense to me and seems to produce better performing code too, so his patch wins. :)  Thanks Vlad.

Comment 7 Steve Ellcey 2009-10-30 21:00:44 UTC
Has a patch to move update_equiv_regs into its own pass been submitted?
I don't see one and IA64 is still producing the 'wrong' scheduling.
Comment 8 Vladimir Makarov 2009-10-30 21:57:23 UTC
Unfortunately, not yet because I had some failures after applying the patch. I postponed work on this but now I have time to continue the work.
Comment 9 Steven Bosscher 2014-01-01 20:55:21 UTC
(In reply to Peter Bergner from comment #5)
> Looking at update_equiv_regs(), if I disable the replacement for regs
> that are local to one basic block (patch below) like it existed before
> John Wehle's patch way back in Oct 2000:
> 
>   http://gcc.gnu.org/ml/gcc-patches/2000-09/msg00782.html
> 
> then we get the ordering we want.  Does anyone know why John removed
> that part of the test in his patch?  Thoughts anyone?

To allow things to be moved around in, or out of loops.