dom1 prevents vectorization via partial loop peeling?

Tue Apr 28 14:36:00 GMT 2015

On Mon, Apr 27, 2015 at 7:06 PM, Jeff Law <law@redhat.com> wrote:
> On 04/27/2015 10:12 AM, Alan Lawrence wrote:
>>
>>
>> After copyrename3, immediately prior to dom1, the loop body looks like:
>>
>>    <bb 2>:
>>
>>    <bb 3>:
>>    # i_11 = PHI <i_9(7), 0(2)>
>>    _5 = a[i_11];
>>    _6 = i_11 & _5;
>>    if (_6 != 0)
>>      goto <bb 4>;
>>    else
>>      goto <bb 5>;
>>
>>    <bb 4>:
>>
>>    <bb 5>:
>>    # m_2 = PHI <5(4), 4(3)>
>>    _7 = m_2 * _5;
>>    b[i_11] = _7;
>>    i_9 = i_11 + 1;
>>    if (i_9 != 32)
>>      goto <bb 7>;
>>    else
>>      goto <bb 6>;
>>
>>    <bb 7>:
>>    goto <bb 3>;
>>
>>    <bb 6>:
>>    return;
>>
>> dom1 then peeled part of the first loop iteration, producing:
>
> Yup.  The jump threading code realized that if we traverse the edge 2->3,
> then we'll always traverse 3->5.  The net effect is like peeling the first
> iteration because we copy bb3.  The original will be reached via 7->3 (it,
> loop iterating), the copy via 2->3' and 3' will have its conditional removed
> and will unconditionally transfer control to bb5.
>
>
>
> This is a known problem, but we really don't have any good heuristics for
> when to do this vs when not to do it.
>
>
>>
>> In contrast, a slightly-different testcase:
>>
>> #define N 32
>>
>> int a[N];
>> int b[N];
>>
>> int foo ()
>> {
>>    for (int i = 0; i < N ; i++)
>>      {
>>        int cond = (a[i] & i) ? -1 : 0; // extra variable here
>>        int m = (cond) ? 5 : 4;
>>        b[i] = a[i] * m;
>>      }
>> }
>>
>> after copyrename3, just before dom1, is only slightly different:
>>
>>    <bb 2>:
>>
>>    <bb 3>:
>>    # i_15 = PHI <i_10(7), 0(2)>
>>    _6 = a[i_15];
>>    _7 = i_15 & _6;
>>    if (_7 != 0)
>>      goto <bb 4>;
>>    else
>>      goto <bb 6>;
>>
>>    <bb 4>:
>>    # m_3 = PHI <4(6), 5(3)>
>>    _8 = m_3 * _6;
>>    b[i_15] = _8;
>>    i_10 = i_15 + 1;
>>    if (i_10 != 32)
>>      goto <bb 7>;
>>    else
>>      goto <bb 5>;
>>
>>    <bb 7>:
>>    goto <bb 3>;
>>
>>    <bb 5>:
>>    return;
>>
>>    <bb 6>:
>>    goto <bb 4>;
>>
>> with bb6 being out-of-line at the end of the function, rather than bb4
>> falling through from just above bb5. However, this prevents dom1 from
>> doing the partial peeling, and dom1 only changes the "goto bb7" into a
>> "goto bb3":
>
> I would have still expected it to thread 2->3, 3->6->4
>
>
>>
>> (1) dom1 should really, in the second case, perform the same partial
>> peeling that it does in the first testcase, if that is what it thinks is
>> desirable. (Of course, we might want to fix that only later, as ATM
>> that'd take us backwards).
>
> Please a file a BZ.  It could be something simple, or we might be hitting
> one of Zdenek's heuristics around keeping overall loop structure.
>
>>
>> Alternatively, maybe we don't want dom1 doing that sort of thing (?),
>> but I'm inclined to think that if it's doing such optimizations, it's
>> for a good reason ;) I guess there'll be other times where we *cannot*
>> do partially peeling of later iterations...
>
> It's an open question -- we never reached any kind of conclusion when it was
> last discussed with Zdenek.  I think the fundamental issue is we can't
> really predict when threading the loop is going to interfere with later
> optimizations or not.  The heuristics we have are marginal at best.
>
> The one thought we've never explored was re-rolling that first iteration
> back into the loop in the vectorizer.

Well.  In this case we hit

  /* If one of the loop header's edge is an exit edge then do not
     apply if-conversion.  */
  FOR_EACH_EDGE (e, ei, loop->header->succs)
    if (loop_exit_edge_p (loop, e))
      return false;

which is simply because even after if-conversion we'll at least end up
with a non-empty latch block which is what the vectorizer doesn't support.
DOM rotated the loop into this non-canonical form.  Running loop header
copying again would probably undo this.

Richard.

>
> Jeff