[RFC] Aarch64: Replace nested FP min/max with conditionals for TX2

Fri Sep 11 06:42:52 GMT 2020

On Fri, Sep 11, 2020 at 8:27 AM Anton Youdkevitch
<anton.youdkevitch@bell-sw.com> wrote:
>
> Richard,
>
> On Thu, Sep 10, 2020 at 12:03 PM Richard Biener <richard.guenther@gmail.com> wrote:
>>
>> On Wed, Sep 9, 2020 at 5:51 PM Anton Youdkevitch
>> <anton.youdkevitch@bell-sw.com> wrote:
>> >
>> > ThunderxT2 chip has an odd property that nested scalar FP min and max are
>> > slower than logically the same sequence of compares and branches.
>>
>> Always for any input data?
>
> If you mean the data that makes it choose all the combinations of
> taken/not taken branches then yes — the results for synthetics are always
> the same (+60%). I didn't check Inf/NaNs, though, as in such
> cases performance is not a concern.

I specifically was suggesting to measure the effect of branch mispredicts.
You'll have the case of the first branch being mispredicted, the second
branch being mispredicted and both branches being mispredicted.
So how's the worst case behaving in comparison to the FP min/max
back-to-back case?

Btw, did you try to use conditional moves / conditional compares (IIRC
arm has some weird ccmp that might or might not come in handy)?

>> > Here is the patch where I'm trying to implement that transformation.
>> > Please advise if the "combine" pass (actually after the pass itself) is the
>> > appropriate place to do this.
>> >
>> > I was considering the possibility to implement this in aarch64.md
>> > (which would be much cleaner) but didn't manage to figure out how
>> > to make fmin/fmax survive until later passes and replace them only
>> > then.
>>
>> +             || !SCALAR_FLOAT_MODE_P (GET_MODE (SET_SRC (PATTERN (insn)))))
>> +           continue;
>> ...
>> +         if (code1 != SMIN && code1 != UMIN &&
>> +             code1 != SMAX && code1 != UMAX)
>> +           continue;
>>
>> you shouldn't see U{MIN,MAX} for float data.
>
> OK, thanks. Will fix that.
>
>>
>>
>> May I suggest to instead to this in a peephole2 or in another late
>> machine-specific pass?
>
> Yes, sure, I'm basically asking for any suggestion. My idea is to move
> it as late as possible since messing with control flow is generally a bad
> idea. The current implementation is just a proof of concept. Do you
> think it's worth to postpone it until, let's say, shorten or peephole2
> would be enough?

I think doing it as late as possible, possibly after sched2, is best
since presumably the slowness really depends on back-to-back
min(max(..)) (what about min (min (..))?), so if there's enough other
instructions inbetween they behave reasonable again.

Did you try if scheduling some insns inbetween the min/max operation
would improve things?  Thus, might it be reasonable to adjust the
machine desctiption to artitifically constrain min/max latency?

>>
>>
>> Are nested vector FP min/max fast?
>
> The vector min/max are as fast as the scalar ones (ironically) it is that utilizing the vector
> compare and branch will much be slower: it's not just the fact the ASIMD compare does
> not affect CC register and additional processing is required but also the number of branches
> to deal with all the individual elements of the vector in the mixed case. It seemed pretty much
> a deadend so I didn't bother to touch it.

OK, I wasn't thinking of applyin the same transform to vector code but using
vector instructions in place of the scalar ones instead of branchy code.  But if
that doesn't make a difference ...

Richard.

> --
>   Thanks,
>   Anton
>
>
>
>>
>> Richard.
>>
>>
>> >
>> > --
>> >   Thanks,
>> >   Anton