Add a new combine pass

Tue Nov 19 11:36:00 GMT 2019

Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Sun, Nov 17, 2019 at 11:35:26PM +0000, Richard Sandiford wrote:
>> While working on SVE, I've noticed several cases in which we fail
>> to combine instructions because the combined form would need to be
>> placed earlier in the instruction stream than the last of the
>> instructions being combined.  This includes one very important
>> case in the handling of the first fault register (FFR).
>
> Do you have an example of that?

It's difficult to share realistic examples at this stage since this
isn't really the right forum for making them public for the first time.
But in rtl terms we have:

(set (reg/v:VNx16BI 102 [ ok ])
     (reg:VNx16BI 85 ffrt))
(set (reg:VNx16BI 85 ffrt)
     (unspec:VNx16BI [(reg:VNx16BI 85 ffrt)] UNSPEC_UPDATE_FFRT))
(set (reg:CC_NZC 66 cc)
     (unspec:CC_NZC
       [(reg:VNx16BI 106) repeated x2
        (const_int 1 [0x1])
        (reg/v:VNx16BI 102 [ ok ])] UNSPEC_PTEST))

and want to combine the first and third instruction at the site of the
first instruction.  Current combine gives:

Trying 18 -> 24:
   18: r102:VNx16BI=ffrt:VNx16BI
   24: cc:CC_NZC=unspec[r106:VNx16BI,r106:VNx16BI,0x1,r102:VNx16BI] 104
Can't combine i2 into i3

because of:

      /* Make sure that the value that is to be substituted for the register
	 does not use any registers whose values alter in between.  However,
	 If the insns are adjacent, a use can't cross a set even though we
	 think it might (this can happen for a sequence of insns each setting
	 the same destination; last_set of that register might point to
	 a NOTE).  If INSN has a REG_EQUIV note, the register is always
	 equivalent to the memory so the substitution is valid even if there
	 are intervening stores.  Also, don't move a volatile asm or
	 UNSPEC_VOLATILE across any other insns.  */
      || (! all_adjacent
	  && (((!MEM_P (src)
		|| ! find_reg_note (insn, REG_EQUIV, src))
	       && modified_between_p (src, insn, i3))
	      || (GET_CODE (src) == ASM_OPERANDS && MEM_VOLATILE_P (src))
	      || GET_CODE (src) == UNSPEC_VOLATILE))

>> Combine currently requires the combined instruction to live at the same
>> location as i3.
>
> Or i2 and i3.
>
>> I thought about trying to relax that restriction, but it
>> would be difficult to do with the current pass structure while keeping
>> everything linear-ish time.
>
> s/difficult/impossible/, yes.
>
> A long time ago we had to only move insns forward for correctness even,
> but that should no longer be required, combine always is finite by other
> means now.
>
>> So this patch instead goes for an option that has been talked about
>> several times over the years: writing a new combine pass that just
>> does instruction combination, and not all the other optimisations
>> that have been bolted onto combine over time.  E.g. it deliberately
>> doesn't do things like nonzero-bits tracking, since that really ought
>> to be a separate, more global, optimisation.
>
> In my dreams tracking nonzero bits would be a dataflow problem.
>
>> This is still far from being a realistic replacement for the even
>> the combine parts of the current combine pass.  E.g.:
>> 
>> - it only handles combinations that can be built up from individual
>>   two-instruction combinations.
>
> And combine does any of {2,3,4}->{1,2} combinations, and it also can
> modify a third insn ("other_insn").  For the bigger ->1 combos, if it
> *can* be decomposed in a bunch of 2->1, then those result in insns that
> are greater cost than those we started with (or else those combinations
> *would* be done).  For the ->2 combinations, there are many ways those
> two insns can be formed: it can be the two arms of a parallel, or
> combine can break a non-matching insn into two at what looks like a good
> spot for that, or it can use a define_split for it.
>
> All those things lead to many more successful combinations :-)

Right.  I definitely want to support multi-insn combos too.  It's one of
the TODOs in the head comment, along with the other points in this list.
Like I say, it's not yet a realistic replacement for even the combine parts
of the current pass.

>> On a more positive note, the pass handles things that the current
>> combine pass doesn't:
>> 
>> - the main motivating feature mentioned above: it works out where
>>   the combined instruction could validly live and moves it there
>>   if necessary.  If there are a range of valid places, it tries
>>   to pick the best one based on register pressure (although only
>>   with a simple heuristic for now).
>
> How are dependencies represented in your new pass?  If it just does
> walks over the insn stream for everything, you get quadratic complexity
> if you move insns backwards.  We have that in combine already, mostly
> from modified_between_p, but that is limited because of how LOG_LINKS
> work, and we have been doing this for so long and there are no problems
> found with it, so it must work in practice.  But I am worried about it
> when moving insns back an unlimited distance.

It builds def-use chains, but using a constant limit on the number of
explicitly-recorded uses.  All other uses go in a numerical live range
from which they (conservatively) never escape.  The def-use chains
represent memory as a single entity, a bit like in gimple.

I avoided the rtlanal.c dependency routines for exactly this reason. :-)

> If combine results in two insns it puts them at i2 and i3, and it can
> actually move a SET to i2 that was at i3 before the combination.
>
>> - once it has combined two instructions, it can try combining the
>>   result with both later and earlier code, i.e. it can combine
>>   in both directions.
>
> That is what combine does, too.

Yeah, that part was bogus, sorry.

>> - it tries using REG_EQUAL notes for the final instruction.
>
> And that.

I meant REG_EQUAL notes on i3, i.e. it tries replacing the src of i3
with i3's REG_EQUAL note and combining into that.  Does combine do that?
I couldn't see it, and in:

   https://gcc.gnu.org/ml/gcc/2019-06/msg00148.html

you seemed to reject the idea of allowing it.

>> - it can parallelise two independent instructions that both read from
>>   the same register or both read from memory.
>
> That only if somehow there is a link between the two (so essentially
> never).  The only combinations tried by combine are those via LOG_LINKs,
> which are between a SET and the first corresponding use.  This is a key
> factor that makes it kind of linear (instead of exponential) complexity.

Tracking limited def-use chains is what makes this last bit easy.
We can just try parallelising two instructions from the (bounded) list
of uses.  And for this case there's not any garbage rtl involved, since
we reuse the same PARALLEL rtx between attempts.  The cost is basically
all in the recog call (which would obviously mount up if we went
overboard).

The new pass also tries combining definitions with uses later than the
first, but of course in that case we need to keep the original set in
parallel.

>> The pass is supposed to be linear time without debug insns.
>> It only tries a constant number C of combinations per instruction
>> and its bookkeeping updates are constant-time.
>
> But how many other insns does it look at, say by modified_between_p or
> the like?

Hope the above answers this.

>> The patch adds two instances of the new pass: one before combine and
>> one after it.
>
> One thing I want to do is some mini-combine after every split, probably
> only with the insns new from the split.  But we have no cfglayout mode
> anymore then, and only hard regs (except in the first split pass, which
> is just a little later than your new pass).

Yeah, sounds like it could be useful.  I guess there'd need to be
an extra condition on the combination that the new insn can't be
immediately split.

>> As far as compile-time goes, I tried compiling optabs.ii at -O2
>> with an --enable-checking=release compiler:
>> 
>> run-combine=2 (normal combine):  100.0% (baseline)
>> run-combine=4 (new pass only)     98.0%
>> run-combine=6 (both passes)      100.3%
>> 
>> where the results are easily outside the noise.  So the pass on
>> its own is quicker than combine, but that's not a fair comparison
>> when it doesn't do everything combine does.  Running both passes
>> only has a slight overhead.
>
> And amount of garbage produced?

If -ftime-report stats are accurate, then the total amount of
memory allocated is:

run-combine=2 (normal combine): 1793 kB
run-combine=4 (new pass only):    98 kB
run-combine=6 (both passes):    1871 kB (new pass accounts for 78 kB)

But again that's not a fair comparison when the main combine pass does more.

I did try hard to keep the amount of garbage rtl down though.  This is
why I added validate_simplify_replace_rtx rather than trying to make
do with existing routines.  It should only create new rtl if the
simplification routines did something useful.  (Of course, that's mostly
true of combine as well, but things like the make_compound_operation/
expand_compound_operation wrangler can create expressions that are never
actually useful.)

>> To get a feel for the effect on multiple targets, I did my usual
>> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg
>> and g++.dg, this time comparing run-combine=2 and run-combine=6
>> using -O2 -ftree-vectorize:
>
> One problem with this is that these are very short functions on average.

There are some long ones too :-)

> What is the kind of changes you see for other targets?

On powerpc64le-linux-gnu it mostly comes from eliminating comparisons
in favour of other flag-setting instructions and making more use of
post-increments.  Not sure the last one is actually a win, but the
target costs say it's OK :-).  E.g. from gcc.c-torture/execute/pr78675.c:

@@ -48,9 +48,8 @@
        blr
        .align 4
 .L19:
-       cmpdi 0,10,0
+       mr. 9,10
        mr 3,8
-       mr 9,10
        bne 0,.L9
        b .L3
        .align 4

and a slightly more interesting example in gcc.c-torture/execute/loop-6.c:

@@ -16,24 +16,22 @@
        mflr 0
        li 9,50
        mtctr 9
-       li 8,1
+       li 10,1
        li 7,1
        std 0,16(1)
        stdu 1,-32(1)
 .LCFI0:
 .L2:
-       addi 10,8,1
-       extsw 8,10
+       addi 9,10,1
+       extsw 10,9
        bdz .L11
-       slw 9,7,10
-       rlwinm 9,9,0,0xff
-       cmpwi 0,9,0
+       slw 8,7,9
+       andi. 9,8,0xff
        beq 0,.L3
-       addi 10,8,1
-       slw 9,7,10
-       extsw 8,10
-       rlwinm 9,9,0,0xff
-       cmpwi 0,9,0
+       addi 9,10,1
+       slw 8,7,9
+       extsw 10,9
+       andi. 9,8,0xff
        bne 0,.L2
 .L3:
        li 3,0

gcc.c-torture/execute/20081218-1.c is an example where we make more use
of post-increment:

 .L9:
-       lbz 10,1(9)
-       addi 9,9,1
+       lbzu 10,1(9)
        cmpwi 0,10,38
        bne 0,.L8
-       lbz 10,1(9)
-       addi 9,9,1
+       lbzu 10,1(9)
        cmpwi 0,10,38
        bne 0,.L8
        bdnz .L9

The changes for s390x-linux-gnu are also often flag-related.  E.g.
gcc.c-torture/execute/pr68624.c:

@@ -27,9 +27,8 @@
 .L9:
        larl    %r2,d
        larl    %r3,.LANCHOR0
-       l       %r2,0(%r2)
+       icm     %r2,15,0(%r2)
        st      %r2,0(%r3)
-       ltr     %r2,%r2
        jne     .L11
        lhi     %r2,-4
        st      %r2,0(%r1)

where we move the flag-setting up and gcc.c-torture/execute/20050826-2.c:

@@ -62,8 +62,7 @@
        lgr     %r3,%r9
        lghi    %r2,0
        brasl   %r14,inet_check_attr
-       ltr     %r2,%r2
-       lr      %r12,%r2
+       ltr     %r12,%r2
        jne     .L16
        lgr     %r1,%r9
        lhi     %r3,-7

where we eliminate a separate move, like in the first powerpc64le
example above.

> Wait, does this combine sets with a hard reg source as well?  It
> shouldn't do that, that is RA's job; doing this in a greedy way is a
> bad idea.  (I haven't yet verified if you do this, fwiw).

No:

  /* Mimic combine's behavior by not combining moves from allocatable hard
     registers (e.g. when copying parameters or function return values).  */
  if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)])
    return false;

Although if that could have accounted for the difference, it sounds like
we're leaving a lot on the table by doing this :-)

>> Inevitably there was some scan-assembler fallout for other tests.
>> E.g. in gcc.target/aarch64/vmov_n_1.c:
>> 
>> #define INHIB_OPTIMIZATION asm volatile ("" : : : "memory")
>>   ...
>>   INHIB_OPTIMIZATION;							\
>>   (a) = TEST (test, data_len);						\
>>   INHIB_OPTIMIZATION;							\
>>   (b) = VMOV_OBSCURE_INST (reg_len, data_len, data_type) (&(a));	\
>> 
>> is no longer effective for preventing move (a) from being merged
>> into (b), because the pass can merge at the point of (a).
>
> It never was effective for that.  Unless (b) lives in memory, in which
> case your new pass has a bug here.

The target of the vmov is a register.

>> I think
>> this is a valid thing to do -- the asm semantics are still satisfied,
>> and asm volatile ("" : : : "memory") never acted as a register barrier.
>> But perhaps we should deal with this as a special case?
>
> I don't think we should, no.  What does "register barrier" even mean,
> *exactly*?

Yeah, agree with you and Andrew that we shouldn't, was just checking
that there was agreement.

Thanks,
Richard