This is the mail archive of the
mailing list for the GCC project.
Re: dom1 prevents vectorization via partial loop peeling?
- From: Alan Lawrence <alan dot lawrence at arm dot com>
- To: Ajit Kumar Agarwal <ajit dot kumar dot agarwal at xilinx dot com>
- Cc: Richard Biener <richard dot guenther at gmail dot com>, Jeff Law <law at redhat dot com>, "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>
- Date: Tue, 28 Apr 2015 15:36:09 +0100
- Subject: Re: dom1 prevents vectorization via partial loop peeling?
- Authentication-results: sourceware.org; auth=none
- References: <553E5FEF dot 9070905 at arm dot com> <553E6C8A dot 4070202 at redhat dot com> <CAFiYyc2VTuRtqDuQVvwfrYdkefVtZJdmwNCK038xugfZKvndNQ at mail dot gmail dot com> <5443dde8-a1fa-4645-8ea4-8d3fe3e6d128 at BL2FFO11FD020 dot protection dot gbl>
Ajit Kumar Agarwal wrote:
From: firstname.lastname@example.org [mailto:email@example.com] On Behalf Of Richard Biener
Sent: Tuesday, April 28, 2015 4:12 PM
To: Jeff Law
Cc: Alan Lawrence; firstname.lastname@example.org
Subject: Re: dom1 prevents vectorization via partial loop peeling?
On Mon, Apr 27, 2015 at 7:06 PM, Jeff Law <email@example.com> wrote:
On 04/27/2015 10:12 AM, Alan Lawrence wrote:
Yup. The jump threading code realized that if we traverse the edge
2->3, then we'll always traverse 3->5. The net effect is like peeling
the first iteration because we copy bb3. The original will be reached
via 7->3 (it, loop iterating), the copy via 2->3' and 3' will have its
conditional removed and will unconditionally transfer control to bb5.
After copyrename3, immediately prior to dom1, the loop body looks like:
# i_11 = PHI <i_9(7), 0(2)>
_5 = a[i_11];
_6 = i_11 & _5;
if (_6 != 0)
goto <bb 4>;
goto <bb 5>;
# m_2 = PHI <5(4), 4(3)>
_7 = m_2 * _5;
b[i_11] = _7;
i_9 = i_11 + 1;
if (i_9 != 32)
goto <bb 7>;
goto <bb 6>;
goto <bb 3>;
dom1 then peeled part of the first loop iteration, producing:
This is a known problem, but we really don't have any good heuristics
for when to do this vs when not to do it.
Ah, yes, I'd not realized this was connected to the jump-threading issue, but I
see that now. As you say, the best heuristics are unclear, and I'm not keen on
trying *too hard* to predict what later phases will/won't do or do/don't
want...maybe if there are simple heuristics that work, but I would aim more at
making later phases work with what(ever) they might get???
One (horrible) possibility that I will just throw out (and then duck), is to do
something akin to tree-if-conversion's "gimple_build_call_internal
(IFN_LOOP_VECTORIZED, " ...
In contrast, a slightly-different testcase:
I would have still expected it to thread 2->3, 3->6->4
Ok, I'll look into that.
(1) dom1 should really, in the second case, perform the same partial
peeling that it does in the first testcase, if that is what it thinks
is desirable. (Of course, we might want to fix that only later, as
ATM that'd take us backwards).
Please a file a BZ. It could be something simple, or we might be
hitting one of Zdenek's heuristics around keeping overall loop structure.
Alternatively, maybe we don't want dom1 doing that sort of thing (?),
but I'm inclined to think that if it's doing such optimizations, it's
for a good reason ;) I guess there'll be other times where we
*cannot* do partially peeling of later iterations...
It's an open question -- we never reached any kind of conclusion when
it was last discussed with Zdenek. I think the fundamental issue is
we can't really predict when threading the loop is going to interfere
with later optimizations or not. The heuristics we have are marginal at best.
The one thought we've never explored was re-rolling that first
iteration back into the loop in the vectorizer.
Yeah, there is that ;).
So besides trying to partially-peel the next N iterations, the other approach -
that strikes me as sanest - is to finish (fully-)peeling off the first
iteration, and then to vectorize from then on. In fact the ideal (I confess I
have no knowledge of the GCC representation/infrastructure here) would probably
be for the vectorizer (in vect_analyze_scalar_cycles) to look for a point in the
loop, or rather a 'cut' across the loop, that avoids breaking any non-cyclic
use-def chains, and to use that as the loop header. That analysis could be quite
complex tho ;)...and I can see that having peeled the first 1/2 iteration, we
may then end up having to peel the next (vectorization factor - 1/2) iterations
too to restore alignment!
whereas with rerolling ;)...is there perhaps some reasonable way to keep markers
around to make the rerolling approach more feasible???
Well. In this case we hit
>>/* If one of the loop header's edge is an exit edge then do not
>> apply if-conversion. */
>>FOR_EACH_EDGE (e, ei, loop->header->succs)
>> if (loop_exit_edge_p (loop, e))
>> return false;
which is simply because even after if-conversion we'll at least end up with a non-empty latch block which is what the vectorizer doesn't support.
DOM rotated the loop into this non-canonical form. Running loop header copying again would probably undo this.
So I've just posted https://gcc.gnu.org/ml/gcc-patches/2015-04/msg01745.html
which fixes this limitation of if-conversion. As I first wrote though, the
vectorizer still fails, because the PHI nodes incoming to the loop header are
neither reductions nor inductions.
I'll see if I can run loop header copying again, as you suggest...
The creation of empty latches with the path-splitting approach where the back edge node can be copied to the predecessors and the
Empty latch can be created with the path-splitting approach I have proposed. This will enable the above scenario of vectorization.
Yes I can see that's possible, however, I think you'll need something like my
patch (https://gcc.gnu.org/ml/gcc-patches/2015-04/msg01745.html) in addition, as
tree_if_conversion currently has another restriction, that the exit block
(source of the loop's exit edge) dominates the latch, and moreover, that there
is no code after the exit block...