[PATCH] improve ifcvt optimization (PR rtl-optimization/89430)

Thu Jun 20 23:13:00 GMT 2019

On 6/20/19 3:53 AM, JiangNing OS wrote:
> Hi Jeff,
> 
> Appreciate your effort to review my patch! I've updated my patch as attached. See my answers below.
> 
>> in current function, so the store speculation can be avoided.
>> So at a high level should we be doing this in gimple rather than RTL?
>> We're going to have a lot more information about types, better
>> infrastructure for looking at uses/defs, access to the alias oracle, we should
>> be able to accurately distinguish between potentially shared objects vs those
>> which are local to the thread, etc.  We lose the low level costing information
>> though.
>>
>> I'm still going to go through the patch and do some level of review, but I do
>> think we need to answer the higher level question though.
>>
> I have the following reasons,
> 
> 1) Following the clue Richard B gave me before about parameter --param allow-store-data-races,
> I did check the middle-end pass tree-if-conv, but I think this pass at the moment doesn't work
> for the issue I'm trying to solve. Tree-if-conv is to do if conversion for loop, and its final goal is to
> help loop vectorization, while my case doesn't have a loop at all. 
I think the fact that it's focused so much on loops is a historical
accident.  We certainly have a variety of places in the gimple
optimizers that do if-conversion, and they're not all in tree-if-conv :(
 For example, some are done in tree-ssa-phiopt.

In the gimple optimizers the testcase from 89430 is going to look
something like this:

> ;   basic block 2, loop depth 0, count 1073741824 (estimated locally), maybe hot
> ;;    prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       ENTRY [always]  count:1073741824 (estimated locally) (FALLTHRU,EXECUTABLE)
>   a.0_1 = a;
>   _2 = (long unsigned int) k_8(D);
>   _3 = _2 * 4;
>   _4 = a.0_1 + _3;
>   _5 = *_4;
>   if (_5 > b_9(D))
>     goto <bb 3>; [50.00%]
>   else
>     goto <bb 4>; [50.00%]
> ;;    succ:       3 [50.0% (guessed)]  count:536870912 (estimated locally) (TRUE_VALUE,EXECUTABLE)
> ;;                4 [50.0% (guessed)]  count:536870912 (estimated locally) (FALSE_VALUE,EXECUTABLE)
> 
> ;;   basic block 3, loop depth 0, count 536870913 (estimated locally), maybe hot
> ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       2 [50.0% (guessed)]  count:536870912 (estimated locally) (TRUE_VALUE,EXECUTABLE)
>   *_4 = b_9(D);
> ;;    succ:       4 [always]  count:536870913 (estimated locally) (FALLTHRU,EXECUTABLE)
> 
> ;;   basic block 4, loop depth 0, count 1073741824 (estimated locally), maybe hot
> ;;    prev block 3, next block 1, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       3 [always]  count:536870913 (estimated locally) (FALLTHRU,EXECUTABLE)
> ;;                2 [50.0% (guessed)]  count:536870912 (estimated locally) (FALSE_VALUE,EXECUTABLE)
>   return;

That looks like a pretty easy form to analyze.  I'd suggest looking
through tree-ssa-phiopt.c closely.  There's several transformations in
there that share similarities with yours.

> 2) My current solution fits into current back-end if-conversion pass very well. I don't want to invent
> a new framework to solve this relatively small issue. Besides, this back-end patch doesn't only
> enhance store speculation detection, but also fix a bug in the original code. 
Understood, but I still wonder if we're better off addressing this in
gimple.

>> Just from a design standpoint, what are the consequences if this function
>> returns true for something that isn't actually in the stack or false for
>> something that is on the stack?
>>
> If noce_mem_is_on_stack returns true for something that isn't actually in the stack, 
> it could potentially introduce store speculation, then the if-conversion optimization
> will be incorrect. If this function returns false for something that is on stack, it doesn't
> matter, because the optimization will not be triggered. 
OK.  That's what I expected.
> 
> 
> 
> 
>>
>>> +
>>> +/* Always return true, if there is a dominating write.
>>> +
>>> +   When there is a dominating read from memory on stack,
>>> +   1) if x = a is a memory read, return true.
>>> +   2) if x = a is a memory write, return true if the memory is on stack.
>>> +      This is the guarantee the memory is *not* readonly. */
>>> +
>>> +static bool
>>> +noce_valid_for_dominating (basic_block bb, rtx_insn *a_insn,
>>> +                           const_rtx x, bool is_store) {
>>> +  rtx_insn *insn;
>>> +  rtx set;
>>> +
>>> +  gcc_assert (MEM_P (x));
>>> +
>>> +  FOR_BB_INSNS (bb, insn)
>>> +    {
>>> +      set = single_set (insn);
>>> +      if (!set)
>>> +        continue;
>>> +
>>> +      /* Dominating store */
>>> +      if (rtx_equal_p (x, SET_DEST (set)))
>>> +        return true;
>>> +
>>> +      /* Dominating load */
>>> +      if (rtx_equal_p (x, SET_SRC (set)))
>>> +        if (is_store && noce_mem_is_on_stack (a_insn, x))
>>> +          return true;
>>> +    }
>>> +
>>> +  return false;
>>> +}
>> So what would be the consequences here of returning false when in fact
>> there was a dominating read or write?  That could easily happen if the
>> dominating read or write was not a single_set.
> If return false when in fact there is a dominating read or write, the optimization will not
> be triggered, because noce_mem_maybe_invalid_p will be true, which is playing the same
> role as may_trap_or_fault_p.
> 
>>
>> I'm guessing that from a design standpoint you're trying to find cases where
>> you know an object was written to within the block and does not escape.  So
>> returning false when there was a dominating write is safe.
>> Returning true when there was not would be disastrous.  Right?
>>
> YES! You've completely understood my code! ðŸ˜Š
So in this context, only handling single_sets is safe.  Perfect.

> 
>>
>>
>>
>>> +
>>> +  /* Iterate all insns until finding all stack pointers, and store
>>> +     them into bba_sets_must_be_sfp. */  bba_sets_must_be_sfp =
>>> + BITMAP_ALLOC (&reg_obstack);  do
>>> +    {
>>> +      sfp_found = false;
>>> +      FOR_ALL_BB_FN (bb, cfun)
>>> +        FOR_BB_INSNS (bb, a_insn)
>>> +          {
>>> +            rtx sset_a = single_set (a_insn);
>>> +            if (!sset_a)
>>> +              continue;
>>> +            rtx src = SET_SRC (sset_a);
>>> +            rtx dest = SET_DEST (sset_a);
>> So again, we can get things that aren't a single_set here and they'll be
>> ignored.  But I believe that's safe as you're trying to identify insns which store
>> stack addresses into a REG.  Missing a more complicated case where a stack
>> address is stored into a REG is safe.  Right?
>>
> Yes. Missing some pointers to stack is safe here, because we will use it to check if a memory
> in if-conversion candidate doesn't have store speculation. If we miss it, the optimization will
> not be triggered, so there will not be risky.
Thanks for the confirmation.

>>
>>> +
>>> +            /* Handle case like "x2 = x1 + offset", in which x1 is already
>>> +               identified as a stack pointer. */
>>> +            if ((GET_CODE (src) == PLUS || GET_CODE (src) == MINUS)
>>> +                && CONST_INT_P (XEXP (src, 1)))
>>> +              {
>>> +                rtx x1 = XEXP (src, 0);
>>> +                if (!REG_P (x1))
>>> +                  continue;
>>> +
>>> +                FOR_EACH_INSN_USE (use, a_insn)
>>> +                  {
>>> +                    if (!rtx_equal_p (x1, DF_REF_REG (use)))
>>> +                      continue;
>>> +
>>> +                    if (bitmap_bit_p (bba_sets_must_be_sfp, DF_REF_REGNO
>> (use)))
>>> +                      {
>>> +                        bitmap_set_bit (bba_sets_must_be_sfp, dest_regno);
>>> +                        sfp_found = true;
>>> +                        break;
>>> +                      }
>>> +                  }
>> So how much is there to be gained from going from this kind of localized
>> computation to global?  It shouldn't be hard to turn this into a truly global
>> algorithm, but I'm not sure it's worth the effort.
>>
>> I _do_ think it's worth visiting blocks in dominator order.  It's better than
>> what you're doing, but nowhere near as expensive as a global propagation
>> engine.
>>
> Fixed in attached new patch using dom_walk class. Please review again.
> 
>> Is it worth handling simple copies in the code above?
> I thought it less likely to have a copy for pointer to stack before register allocation. Right?
We certainly try to eliminate as many copies as possible, but experience
would tell us that they often sneak through.  I guess some quick
instrumentation would tell us if it's worth the effort.

> 
>>
>>> +/* Return true if current function doesn't pass stack address out of
>>> +frame. */
>>> +
>>> +static bool
>>> +no_stack_address_taken (void)
>>> +{
>>> +  basic_block bb;
>>> +  rtx_insn *a_insn;
>>> +  df_ref use;
>>> +
>>> +  FOR_ALL_BB_FN (bb, cfun)
>>> +    FOR_BB_INSNS (bb, a_insn)
>>> +      {
>>> +        if (!INSN_P (a_insn))
>>> +          continue;
>>> +
>>> +        rtx sset_a = single_set (a_insn);
>>> +        if (!sset_a)
>>> +          continue;
>>> +        rtx src = SET_SRC (sset_a);
>>> +        rtx dest = SET_DEST (sset_a);
>>> +
>>> +        /* Skip insn that is already identified as frame pointers. */
>>> +        if (bitmap_bit_p (bba_sets_may_be_sfp, REGNO (dest)))
>>> +          continue;
>>> +
>>> +        FOR_EACH_INSN_USE (use, a_insn)
>>> +          {
>>> +            /* Check if current insn is using any stack pointer. Ignore
>>> +               if it doesn't use frame pointers at all. */
>>> +            if (!bitmap_bit_p (bba_sets_may_be_sfp, DF_REF_REGNO (use)))
>>> +              continue;
>>> +
>>> +            /* Load cannot introduce address taken. */
>>> +            if (MEM_P (src))
>>> +              continue;
>>> +
>>> +            /* Store is safe if only the src doesn't contain stack pointer. */
>>> +            if (MEM_P (dest)
>>> +                && !reg_mentioned_p (DF_REF_REG (use), src))
>>> +              continue;
>>> +
>>> +            /* All other cases potentially introduce address taken. */
>>> +            return false;
>>> +          }
>> So what if you have a PARALLEL where one entry causes an escape of a
>> pointer into the stack and another entry does some other operation?  If so,
>> don't you miss the fact that you've got an escaping pointer into the stack?  I
>> don't think you can restrict yourself to single_sets here.
> Yes. You are right. I have added code to handle PARALLEL case. I handle it in a rather conservative way though. Review again, please!
I'll take another look at the patch.   I'll also do a little
experimentation to see if we're seeing enough copies to make handling
them worth the effort.

While I'm doing that, if you could walk through tree-ssa-phiopt.c and
ponder if your transformation could fit into that framework it'd be
appreciated (note the comment at the start of the file only covers one
class of transformations, there are others later).

jeff