[RFC] enable flags-unchanging asms, add_overflow/expand/combine woes

Wed Sep 2 09:47:22 GMT 2020

Hi!

On Tue, Sep 01, 2020 at 07:22:57PM -0300, Alexandre Oliva wrote:
> This WIP patchlet introduces means for machines that implicitly clobber
> cc flags in asm statements, but that have machinery to output flags
> (namely x86, non-thumb arm and aarch64), to state that the asm statement
> does NOT clobber cc flags.  That's accomplished by using "=@ccC" in the
> output constraints.  It disables the implicit clobber, but it doesn't
> set up an actual asm output to the flags, so they are left alone.
> 
> It's ugly, I know.

Yeah, it's bloody disgusting :-)  But it is very local, and it works
with the generic code without any changes there, that is good.  OTOH
this patch is for x86 only.  (And aarch, but not the other targets
with default clobbers).

> I've considered "!cc" or "nocc" in the clobber
> list as a machine-independent way to signal cc is not modified, or
> even special-casing empty asm patterns (at a slight risk of breaking
> code that expects implicit clobbers even for empty asm patterns, but
> empty asm patterns won't clobber flags, so how could it break
> anything?).

People write empty asm statements not because they would like no insns
emitted from it, but *because* they want the other effects an asm has
(for example, an empty asm usually has no outputs, so it is volatile,
and that makes sure it is executed in the real machine exactly as often
as in the abstract machine).  So your expectation might be wrong,
someone might want an empty asm to clobber cc on x86 (like any asm is
documented as doing).

But how about a "none" clobber?  That would be generic, and just remove
all preceding clobbers (incl. the implicit clobbers).  Maybe disallow
any explicit clobbers before it, not sure what is nicer.

> I take this might be useful for do-nothing asm
> statements, often used to stop certain optimizations, e.g.:
> 
>   __typeof (*p) c = __builtin_add_overflow (*p, 1, p);
>   asm ("" : "+m" (*p)); // Make sure we write to memory.
>   *p += c; // This should compile into an add with carry.

Wow, nasty.  That asm cannot be optimised away even if the rest is
(unless GCC can somehow figure out nothing ever uses *p).  Is there no
better way to do this?

> Is there interest in, and a preferred form for (portably?), conveying
> a no-cc-clobbering asm?

Well, that whole cc clobbering is an x86 thing, but some other targets
clobber other registers by default.  Yes, I think this might be useful;
and see my suggestion above ("none").

> Without the asm, we issue load;add;adc;store, which is not the ideal
> sequence with add and adc to the same memory address (or two different
> addresses, if the last statement uses say *q instead of *p).

Is doing two RMWs on memory faster?  Huh.

> Alas, getting the first add to go straight to memory is more
> complicated.  Even with the asm that forces the output to memory, the
> output flag makes it harder to get it optimized to an add-to-memory
> form.  When the output flag is unused, we optimize it enough in gimple
> that TER does its job and we issue a single add, but that's not possible
> when the two outputs of ADD_OVERFLOW are used: the flag setting gets
> optimized away, but only after stopping combine from turning the
> load/add/store into an add-to-memory.
> 
> If we retried the 3-insn substitution after substituting the flag store
> into the add for the adc,

combine should retry every combination if any of the input insns to it
have changed (put another way, if any insn is changed all combinations
with it are tried anew).  If this doesn't work, please file a bug.

But.  Dependencies through memory are never used for combine (it uses
dependencies through registers only), maybe that is what you are seeing?
This makes many "RMW" optimisations need 4-insn combinations, which are
not normally done.

> we might succeed, but only if we had a pattern
> that matched add<mode>3_cc_overflow_1's parallel with the flag-setter as
> the second element of the parallel, because that's where combine adds it
> to the new i3 pattern, after splitting it out of i2.

That sounds like the backend pattern has it wrong then?  There is a
canonical order for this?

> I suppose adding such patterns manually isn't the way to go.  I wonder
> if getting recog_for_combine to recognize and reorder PARALLELs
> appearing out of order would get too expensive, even if genrecog were to
> generate optimized code to try alternate orders in parallels.

Very big parallels are used, and trying all orderings would take just a
little too much time ;-)

We could do some limited permutations of course.  There are some cases
where you *unavoidably* have this problem (say, when adding three
things together), so this could be useful sometimes.  Maybe just try
permuting the first three arms of the parallel, for example?

> The issue doesn't seem that important in the grand scheme of things, but
> there is some embarrassment from the missed combines and from the AFAICT
> impossibility to get GCC to issue the most compact (and possibly
> fastest) insn sequence on x86* for a 'memory += value;' spelled as
> __builtin_add_overflow, when the result of the overflow checking is
> used.

GCC does not handle RMW to memory very well (partially because it
*cannot* really be handled well).  There are some PRs about this I
think (at least wrt combine).

Segher