# Duplicating loops and virtual phis

Bin.Cheng amker.cheng@gmail.com
Mon May 22 08:31:00 GMT 2017

```On Mon, May 22, 2017 at 1:42 AM, Kugan Vivekanandarajah
<kugan.vivekanandarajah@linaro.org> wrote:
> Hi Bin and Steve,
>
> On 17 May 2017 at 19:41, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Mon, May 15, 2017 at 7:32 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On May 15, 2017 6:56:53 PM GMT+02:00, Steve Ellcey <sellcey@cavium.com> wrote:
>>>>On Sat, 2017-05-13 at 08:18 +0200, Richard Biener wrote:
>>>>> On May 12, 2017 10:42:34 PM GMT+02:00, Steve Ellcey <sellcey@cavium.c
>>>>> om> wrote:
>>>>> >
>>>>> > (Short version of this email, is there a way to recalculate/rebuild
>>>>> > virtual
>>>>> > phi nodes after modifying the CFG.)
>>>>> >
>>>>> > I have a question about duplicating loops and virtual phi nodes.
>>>>> > I am trying to implement the following optimization as a pass:
>>>>> >
>>>>> > Transform:
>>>>> >
>>>>> >   for (i = 0; i < n; i++) {
>>>>> >    A[i] = A[i] + B[i];
>>>>> >    C[i] = C[i-1] + D[i];
>>>>> >   }
>>>>> >
>>>>> > Into:
>>>>> >
>>>>> >   if (noalias between A&B, A&C, A&D)
>>>>> >    for (i = 0; i < 100; i++)
>>>>> >            A[i] = A[i] + B[i];
>>>>> >    for (i = 0; i < 100; i++)
>>>>> >            C[i] = C[i-1] + D[i];
>>>>> >   else
>>>>> >    for (i = 0; i < 100; i++) {
>>>>> >            A[i] = A[i] + B[i];
>>>>> >            C[i] = C[i-1] + D[i];
>>>>> >    }
>>>>> >
>>>>> > Right now the vectorizer sees that 'C[i] = C[i-1] + D[i];' cannot
>>>>be
>>>>> > vectorized so it gives up and does not vectorize the loop.  If we
>>>>split
>>>>> > up the loop into two loops then the vector add with A[i] could be
>>>>> > vectorized
>>>>> > even if the one with C[i] could not.
>>>>> Loop distribution does this transform but it doesn't know about
>>>>> versioning for unknown dependences.
>>>>>
>>>>
>>>>Yes, I looked at loop distribution.  But it only works with global
>>>>arrays and not with pointer arguments where it doesn't know the size of
>>>>the array being pointed at.  I would like to be able to have it work
>>>>with pointer arguments.  If I call a function with 2 or
>>>>more integer pointers, and I have a loop that accesses them with
>>>>offsets between 0 and N where N is loop invariant then I should have
>>>>enough information (at runtime) to determine if there are overlapping
>>>>memory accesses through the pointers and determine whether or not I can
>>>>distribute the loop.
>>>
>>> Not sure where you got that from. Loop distribution works with our data reference / dependence analysis.  The cost model might be more restricted but that can be fixed.
>>>
>>>>The loop splitting code seemed like a better template since it already
>>>>knows how to split a loop based on a runtime determined condition. That
>>>>part seems to be working for me, it is when I try to
>>>>distribute/duplicate one of those loops (under the unaliased condition)
>>>>that I am running into the problem with virtual PHIs.
>>>
>>> There's mark_virtual*for_renaming (sp?).
>>>
>>> But as said you are performing loop distribution so please enhance the existing pass rather than writing a new one.
>> I happen to be working on loop distribution now (If guess correctly,
>> to get hmmer fixed).  So far my idea is to fuse the finest distributed
>> loop in two passes, in the first pass, we merge all SCCs due to "true"
>> data dependence; in the second one we identify all SCCs and breaks
>> them on dependent edges due to possible alias.  Breaking SCCs with
>> minimal edge set can be modeled as Feedback arc set problem which is
>> NP-hard. Fortunately the problem is small in our case and there are
>> approximation algorithms.  OTOH, we should also improve loop
>> distribution/fusion to maximize parallelism / minimize
>> synchronization, as well as maximize data locality, but I think this
>> is not needed to get hmmer vectorized.
>
> I am also looking into vectoring homer loop. Glad to know you are also
> looking at this problem and looking forward to seeing the patches.
>
> I have some experimental patches where I added the data reference that
> needs runtime checking to a list
>
> static int
> pg_add_dependence_edges (struct graph *rdg, vec<loop_p> loops, int dir,
>               vec<data_reference_p> drs1,
> -             vec<data_reference_p> drs2)
> +             vec<data_reference_p> drs2,
> +             vec<ddr_p> &ddrs,
> +             bool runtime_alias_check)
>
> Then I am vesioning the main loop based on the condition generated
> from the runtime check. I have borrowed the logic from vectorizer
> (like pruning the list and generating the condition). I have neither
Yes, I am doing experiments factor runtime alias check out from
vectorizer as general interfaces.  Also I need to fix negative step
bug in current runtime alias check.

> verified nor benchmarked it enough yet.
>
> As I understand, we also should  have some form of cost model where we
> should be able too see the data access patterns and decide if the
> distributed loops can be vectorized?
As the start, the idea is to version loop as gcc does in ifcvt, we can
revert the versioning later if distributed loop can't be vectorized
(or optimized in other significant way).  In case of hmmer, it's also
important to eliminate number of dead stores under condition of
no-alias.

>
> Cost model in  similar_memory_accesses also need to be relaxd based on
> the ability to vectorize distributed loops.
Yes, better cost model for fusion is needed in general.

Thanks,
bin
>
> Thanks,
> Kugan
>
>
>> Thanks,
>> bin
>>>
>>> Richard.
>>>
>>>>Steve Ellcey
>>>>sellcey@cavium.com
>>>

```