dom1 prevents vectorization via partial loop peeling?
Richard Biener
richard.guenther@gmail.com
Tue Apr 28 14:36:00 GMT 2015
On Mon, Apr 27, 2015 at 7:06 PM, Jeff Law <law@redhat.com> wrote:
> On 04/27/2015 10:12 AM, Alan Lawrence wrote:
>>
>>
>> After copyrename3, immediately prior to dom1, the loop body looks like:
>>
>> <bb 2>:
>>
>> <bb 3>:
>> # i_11 = PHI <i_9(7), 0(2)>
>> _5 = a[i_11];
>> _6 = i_11 & _5;
>> if (_6 != 0)
>> goto <bb 4>;
>> else
>> goto <bb 5>;
>>
>> <bb 4>:
>>
>> <bb 5>:
>> # m_2 = PHI <5(4), 4(3)>
>> _7 = m_2 * _5;
>> b[i_11] = _7;
>> i_9 = i_11 + 1;
>> if (i_9 != 32)
>> goto <bb 7>;
>> else
>> goto <bb 6>;
>>
>> <bb 7>:
>> goto <bb 3>;
>>
>> <bb 6>:
>> return;
>>
>> dom1 then peeled part of the first loop iteration, producing:
>
> Yup. The jump threading code realized that if we traverse the edge 2->3,
> then we'll always traverse 3->5. The net effect is like peeling the first
> iteration because we copy bb3. The original will be reached via 7->3 (it,
> loop iterating), the copy via 2->3' and 3' will have its conditional removed
> and will unconditionally transfer control to bb5.
>
>
>
> This is a known problem, but we really don't have any good heuristics for
> when to do this vs when not to do it.
>
>
>>
>> In contrast, a slightly-different testcase:
>>
>> #define N 32
>>
>> int a[N];
>> int b[N];
>>
>> int foo ()
>> {
>> for (int i = 0; i < N ; i++)
>> {
>> int cond = (a[i] & i) ? -1 : 0; // extra variable here
>> int m = (cond) ? 5 : 4;
>> b[i] = a[i] * m;
>> }
>> }
>>
>> after copyrename3, just before dom1, is only slightly different:
>>
>> <bb 2>:
>>
>> <bb 3>:
>> # i_15 = PHI <i_10(7), 0(2)>
>> _6 = a[i_15];
>> _7 = i_15 & _6;
>> if (_7 != 0)
>> goto <bb 4>;
>> else
>> goto <bb 6>;
>>
>> <bb 4>:
>> # m_3 = PHI <4(6), 5(3)>
>> _8 = m_3 * _6;
>> b[i_15] = _8;
>> i_10 = i_15 + 1;
>> if (i_10 != 32)
>> goto <bb 7>;
>> else
>> goto <bb 5>;
>>
>> <bb 7>:
>> goto <bb 3>;
>>
>> <bb 5>:
>> return;
>>
>> <bb 6>:
>> goto <bb 4>;
>>
>> with bb6 being out-of-line at the end of the function, rather than bb4
>> falling through from just above bb5. However, this prevents dom1 from
>> doing the partial peeling, and dom1 only changes the "goto bb7" into a
>> "goto bb3":
>
> I would have still expected it to thread 2->3, 3->6->4
>
>
>>
>> (1) dom1 should really, in the second case, perform the same partial
>> peeling that it does in the first testcase, if that is what it thinks is
>> desirable. (Of course, we might want to fix that only later, as ATM
>> that'd take us backwards).
>
> Please a file a BZ. It could be something simple, or we might be hitting
> one of Zdenek's heuristics around keeping overall loop structure.
>
>>
>> Alternatively, maybe we don't want dom1 doing that sort of thing (?),
>> but I'm inclined to think that if it's doing such optimizations, it's
>> for a good reason ;) I guess there'll be other times where we *cannot*
>> do partially peeling of later iterations...
>
> It's an open question -- we never reached any kind of conclusion when it was
> last discussed with Zdenek. I think the fundamental issue is we can't
> really predict when threading the loop is going to interfere with later
> optimizations or not. The heuristics we have are marginal at best.
>
> The one thought we've never explored was re-rolling that first iteration
> back into the loop in the vectorizer.
Well. In this case we hit
/* If one of the loop header's edge is an exit edge then do not
apply if-conversion. */
FOR_EACH_EDGE (e, ei, loop->header->succs)
if (loop_exit_edge_p (loop, e))
return false;
which is simply because even after if-conversion we'll at least end up
with a non-empty latch block which is what the vectorizer doesn't support.
DOM rotated the loop into this non-canonical form. Running loop header
copying again would probably undo this.
Richard.
>
> Jeff
More information about the Gcc
mailing list