[PATCH] tree-ssa-threadbackward.c (profitable_jump_thread_path): Do not allow __builtin_constant_p.

Thu Jul 2 08:29:48 GMT 2020

On Wed, Jul 1, 2020 at 10:02 PM Jeff Law <law@redhat.com> wrote:
>
> On Mon, 2020-06-29 at 08:58 +0200, Richard Biener wrote:
> > On Sat, Jun 27, 2020 at 4:52 PM Marc Glisse <marc.glisse@inria.fr> wrote:
> > > On Fri, 26 Jun 2020, Jeff Law via Gcc-patches wrote:
> > >
> > > > In theory yes, but there are cases where paths converge (like you've shown) where
> > > > you may have evaluated to a constant on the paths, but it's not a constant at the
> > > > convergence point.  You have to be very careful using b_c_p like this and it's
> > > > been a regular source of kernel bugs.
> > > >
> > > >
> > > > I'd recommend looking at the .ssa dump and walk forward from there if the .ssa
> > > > dump looks correct.
> > >
> > > Here is the last dump before thread1 (105t.mergephi2). I don't see
> > > anything incorrect in it.
> > >
> > > ledtrig_cpu (_Bool is_active)
> > > {
> > >    int old;
> > >    int iftmp.0_1;
> > >    int _5;
> > >
> > >    <bb 2> [local count: 1073741824]:
> > >    if (is_active_2(D) != 0)
> > >      goto <bb 4>; [50.00%]
> > >    else
> > >      goto <bb 3>; [50.00%]
> > >
> > >    <bb 3> [local count: 536870913]:
> > >
> > >    <bb 4> [local count: 1073741824]:
> > >    # iftmp.0_1 = PHI <1(2), -1(3)>
> > >    _5 = __builtin_constant_p (iftmp.0_1);
> > >    if (_5 != 0)
> > >      goto <bb 5>; [50.00%]
> > >    else
> > >      goto <bb 8>; [50.00%]
> > >
> > >    <bb 5> [local count: 536870913]:
> > >    if (iftmp.0_1 >= -128)
> > >      goto <bb 6>; [50.00%]
> > >    else
> > >      goto <bb 8>; [50.00%]
> > >
> > >    <bb 6> [local count: 268435456]:
> > >    if (iftmp.0_1 <= 127)
> > >      goto <bb 7>; [34.00%]
> > >    else
> > >      goto <bb 8>; [66.00%]
> > >
> > >    <bb 7> [local count: 91268056]:
> > >    __asm__ __volatile__("asi %0,%1
> > > " : "ptr" "=Q" MEM[(int *)&num_active_cpus] : "val" "i" iftmp.0_1, "Q" MEM[(int *)&num_active_cpus] : "memory", "cc");
> > >    goto <bb 9>; [100.00%]
> > >
> > >    <bb 8> [local count: 982473769]:
> > >    __asm__ __volatile__("laa %0,%2,%1
> > > " : "old" "=d" old_8, "ptr" "=Q" MEM[(int *)&num_active_cpus] : "val" "d" iftmp.0_1, "Q" MEM[(int *)&num_active_cpus] : "memory", "cc");
> > >
> > >    <bb 9> [local count: 1073741824]:
> > >    return;
> > >
> > > }
> > >
> > > There is a single _b_c_p, the immediate asm argument is exactly the
> > > argument of _b_c_p, and it is in the branch protected by _b_c_p.
> > >
> > > Now the thread1 dump, for comparison
> > >
> > > ledtrig_cpu (_Bool is_active)
> > > {
> > >    int old;
> > >    int iftmp.0_4;
> > >    int iftmp.0_6;
> > >    int _7;
> > >    int _12;
> > >    int iftmp.0_13;
> > >    int iftmp.0_14;
> > >
> > >    <bb 2> [local count: 1073741824]:
> > >    if (is_active_2(D) != 0)
> > >      goto <bb 3>; [50.00%]
> > >    else
> > >      goto <bb 4>; [50.00%]
> > >
> > >    <bb 3> [local count: 536870912]:
> > >    # iftmp.0_6 = PHI <1(2)>
> > >    _7 = __builtin_constant_p (iftmp.0_6);
> > >    if (_7 != 0)
> > >      goto <bb 6>; [50.00%]
> > >    else
> > >      goto <bb 8>; [50.00%]
> > >
> > >    <bb 4> [local count: 536870912]:
> > >    # iftmp.0_4 = PHI <-1(2)>
> > >    _12 = __builtin_constant_p (iftmp.0_4);
> > >    if (_12 != 0)
> > >      goto <bb 5>; [50.00%]
> > >    else
> > >      goto <bb 8>; [50.00%]
> > >
> > >    <bb 5> [local count: 268435456]:
> > >    if (iftmp.0_4 >= -128)
> > >      goto <bb 7>; [20.00%]
> > >    else
> > >      goto <bb 8>; [80.00%]
> > >
> > >    <bb 6> [local count: 214748364]:
> > >    if (iftmp.0_6 <= 127)
> > >      goto <bb 7>; [12.00%]
> > >    else
> > >      goto <bb 8>; [88.00%]
> > >
> > >    <bb 7> [local count: 91268056]:
> > >    # iftmp.0_13 = PHI <iftmp.0_6(6), iftmp.0_4(5)>
> > >    __asm__ __volatile__("asi %0,%1
> > > " : "ptr" "=Q" MEM[(int *)&num_active_cpus] : "val" "i" iftmp.0_13, "Q" MEM[(int *)&num_active_cpus] : "memory", "cc");
> > >    goto <bb 9>; [100.00%]
> > >
> > >    <bb 8> [local count: 982473769]:
> > >    # iftmp.0_14 = PHI <iftmp.0_4(5), iftmp.0_4(4), iftmp.0_6(6), iftmp.0_6(3)>
> > >    __asm__ __volatile__("laa %0,%2,%1
> > > " : "old" "=d" old_8, "ptr" "=Q" MEM[(int *)&num_active_cpus] : "val" "d" iftmp.0_14, "Q" MEM[(int *)&num_active_cpus] : "memory", "cc");
> > >
> > >    <bb 9> [local count: 1073741824]:
> > >    return;
> > >
> > > }
> > >
> > > Thread1 decides to separate the paths is_active and !is_active
> > > (surprisingly, for one it optimizes out the comparison <= 127 and for the
> > > other the comparison >= -128, while it could optimize both in both cases).
> > > And it decides to converge after the comparisons, but before the asm.
> > >
> > > What the pass did does seem to hurt. It looks like if we duplicate _b_c_p,
> > > we may need to duplicate far enough to include all the blocks dominated by
> > > _b_c_p==true (including the asm, here). Otherwise, any _b_c_p can be
> > > optimized to true, because for a boolean
> > >
> > > b is the same as b ? true : false
> > > __builtin_constant_p(b ? true : false) would be the same as b ?
> > > __builtin_constant_p(true) : __builtin_constant_p(false), i.e. true.
> > >
> > > It is too bad we don't have any optimization pass using ranges between IPA
> > > and thread1, that would have gotten rid of the comparisons, and hence the
> > > temptation to thread. Adding always_inline on atomic_add (or flatten on
> > > the caller) does help: EVRP removes the comparisons.
> > >
> > > Do you see a way forward without changing what thread1 does or declaring
> > > the testcase as unsupported?
> >
> > Most of the cases I've seen involve transforms that make _b_c_p constant
> > on one path and then introduce a new PHI merging two _b_c_p values
> > to be then tested in a not simplified condition.  I'm not sure how to fend
> > off jump threading (yeah, it's nearly always jump threading doing this...)
> > doing this but certainly the easiest way would be to simply disallow
> > [jump threading] from duplicating _b_c_p calls.
> >
> > Or fold _b_c_p even earlier (though I definitely saw early backwards
> > jump threading mess up such a case).
> I'm not 100% sure it's duplication of the b_c_p call that's the problem.  Or
> perhaps better stated, I think anything which results in an incremental update
> with an SSA_NAME that's used in a b_c_p call is going to potentially be
> problematical.  I suspect those are strong correlated though.

The issue is indeed not duplicating b_c_p in itself but that we stop threading
before reaching what is ultimatively guarded in the original source.  Usually
the ovious cases are like the one in this thread:

  if (__bultin_constant_p (i) && i > -128 && i < 127)
    ...
  else
    ...

and if we isolate parts of the condition and then can resolve _b_c_p to one
we get inconsitencies.  So yes, resolving _b_c_p earlier - if only during
threading itself as in the patch - will likely hide the problem.  Because it's
not always as easy as above to see what the "whole" condition that
ultimatively guards what was intended to be guarded was...

Thus the patch indeed looks appealing to me since we likely cannot
resolve _b_c_p during early optimizations in all cases.  In the end
for the example we lose optimization doing that since we should be
able to resolve this fully since both -1 and 1 are in the supported
range for atomic_add_const ...

One thing we might encourage people to do is re-formulate those
conditions as

 if (i >-128 && i < 127 && __builtin_constant_p (i))

which helps (but I've seen cases where doing that would be awkward).
Maybe

  if (__builtin_constant_p (__builtin_constant_p (i) && i > -128 && i < 127))

would be even better.

Richard.

> Jeff
>
>
> >
> > Richard.
> >
> > > --
> > > Marc Glisse
>