Add a new combine pass

Thu Nov 21 21:12:00 GMT 2019

Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Wed, Nov 20, 2019 at 06:20:34PM +0000, Richard Sandiford wrote:
>> > Why don't you use DF for the DU chains?
>> 
>> The problem with DF_DU_CHAIN is that it's quadratic in the worst case.
>
> Oh, wow.
>
>> fwprop.c gets around that by using the MD problem and having its own
>> dominator walker to calculate limited def-use chains:
>> 
>>   /* We use the multiple definitions problem to compute our restricted
>>      use-def chains.  */
>
> It's not great if every pass invents its own version of some common
> infrastructure thing because that common one is not suitable.
>
> I.e., can this be fixed somehow?  Maybe just by having a restricted DU
> chains df problem?

Well, it'd probably make sense to make fwprop.c's approach available
as a "proper" df interface at some point.  Hopefully if anyone wants the
same thing as fwprop.c, they'd do that rather than copy the code. :-)

>> So taking that approach here would still require some amount of
>> roll-your-own.  Other reasons are:
>> 
>> * Even what fwprop does is more elaborate than we need for now.
>> 
>> * We need to handle memory too, and it's nice to be able to handle
>>   it in the same way as registers.
>> 
>> * Updating a full, ordered def-use chain after a move is a linear-time
>>   operation, so whatever happens, we'd need to apply some kind of limit
>>   on the number of uses we maintain, with something like that integer
>>   point range for the rest.
>> 
>> * Once we've analysed the insn and built its def-use chains, we don't
>>   look at the df_refs again until we update the chains after a successful
>>   combination.  So it should be more efficient to maintain a small array
>>   of insn_info_rec pointers alongside the numerical range, rather than
>>   walk and pollute chains of df_refs and then link back the insn uids
>>   to the pass-local info.
>
> So you need something like combine's LOG_LINKS?  Not that handling those
> is not quadratic in the worst case, but in practice it works well.  And
> it *could* be made linear.

Not sure why what I've used isn't what I need though :-)  If it's an
array vs. linked-list thing, then for the multi-use case, we need
two sets of link pointers, one for "next use of the same resource"
and one for "next use in this instruction".  Then we need the payload of
the list node itself.  For the small number of entries we're talking about,
using null-terminated arrays of "things that this instruction uses"
and "instructions that use this resource" should be more efficient than
pointer-chasing, and occupies the same space as the link pointers
(i.e. saves the extra payload).

We also need to be able to walk in both directions, to answer the
questions:

- which insns can I combine with this definition?
- where is this value of a resource defined?
- where are the uses of this resource?
- where was the previous definition of this resource, and where
  was its last use?

So if we're comparing it to existing linked-list GCC structures,
it's more similar to df_ref (see above for why that seemed like
a bad idea) or -- more light-weight -- dep_link_t in the scheduler.

And both the array and linked-list approaches still need to fall back to
the simple live range once a certain threshold is hit.

>> The second set is for:
>> 
>> (B) --param run-combine=6 (both passes), use-use combinations only
>> (C) --param run-combine=6 (both passes), no restrictions
>> 
>> Target                 Tests   Delta    Best   Worst  Median
>> ======                 =====   =====    ====   =====  ======
>> aarch64-linux-gnu        272   -3844    -585      18      -1
>> aarch64_be-linux-gnu     190   -3336    -370      18      -1
>> alpha-linux-gnu          401   -2735    -370      22      -2
>> amdgcn-amdhsa            188    1867    -484    1259      -1
>> arc-elf                  257   -1498    -650      54      -1
>> arm-linux-gnueabi        168   -1117    -612     680      -1
>> arm-linux-gnueabihf      168   -1117    -612     680      -1
>> avr-elf                 1341 -111401  -13824     680     -10
>
> Things like this are kind of suspicious :-)

Yeah.  This mostly seems to come from mopping up the extra moves created
by make_more_copies.  So we have combinations like:

   58: r70:SF=r94:SF
      REG_DEAD r94:SF
   60: r22:SF=r70:SF
      REG_DEAD r70:SF

(r22 is a hard reg, the others are pseudos) which produces:

        std Y+1,r22
        std Y+2,r23
        std Y+3,r24
        std Y+4,r25
-       ldd r22,Y+1
-       ldd r23,Y+2
-       ldd r24,Y+3
-       ldd r25,Y+4

On the REG_EQUAL thing: you're right that it doesn't make much difference
for run-combine=6:

Target             Tests   Delta    Best   Worst  Median
======             =====   =====    ====   =====  ======
arc-elf                1      -1      -1      -1      -1
avr-elf                1      -1      -1      -1      -1
bfin-elf               1      -1      -1      -1      -1
bpf-elf                2      -6      -5      -1      -5
c6x-elf                1      -2      -2      -2      -2
cr16-elf               1       7       7       7       7
epiphany-elf           5     -15      -4      -1      -4
fr30-elf               2     -16     -11      -5     -11
frv-linux-gnu          2     -20     -16      -4     -16
h8300-elf              2      -2      -1      -1      -1
i686-apple-darwin      1      -3      -3      -3      -3
ia64-linux-gnu         3     -39     -26      -6      -7
m32r-elf               3     -17     -10      -2      -5
mcore-elf              4      -7      -3      -1      -2
mn10300-elf            1      -2      -2      -2      -2
moxie-rtems            4     -15      -5      -2      -4
nds32le-elf            1      -1      -1      -1      -1
nios2-linux-gnu        1      -1      -1      -1      -1
or1k-elf               3     -18     -12      -2      -4
s390-linux-gnu         6     -28      -9      -1      -7
s390x-linux-gnu        1      -1      -1      -1      -1
sh-linux-gnu           1      -1      -1      -1      -1
sparc-linux-gnu        4     -24     -14      -2      -5
xstormy16-elf          9     -27     -10      -1      -2

So there's only one case in which it isn't a win, but the number of
tests is tiny.  So I agree there's no justification for trying this in
combine proper as things stand (and I wasn't arguing otherwise FWIW).
I'd still like to keep it in the new pass because it does help
*sometimes* and there's no sign yet that it has a noticeable
compile-time cost.

It might also be interesting to see how much difference it makes for
run-combine=4 (e.g. to see how much it makes up for the current 2-insn
limit)...

Thanks,
Richard