# Duplicating loops and virtual phis

Kugan Vivekanandarajah kugan.vivekanandarajah@linaro.org
Mon May 22 00:42:00 GMT 2017

```Hi Bin and Steve,

On 17 May 2017 at 19:41, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Mon, May 15, 2017 at 7:32 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On May 15, 2017 6:56:53 PM GMT+02:00, Steve Ellcey <sellcey@cavium.com> wrote:
>>>On Sat, 2017-05-13 at 08:18 +0200, Richard Biener wrote:
>>>> On May 12, 2017 10:42:34 PM GMT+02:00, Steve Ellcey <sellcey@cavium.c
>>>> om> wrote:
>>>> >
>>>> > (Short version of this email, is there a way to recalculate/rebuild
>>>> > virtual
>>>> > phi nodes after modifying the CFG.)
>>>> >
>>>> > I have a question about duplicating loops and virtual phi nodes.
>>>> > I am trying to implement the following optimization as a pass:
>>>> >
>>>> > Transform:
>>>> >
>>>> >   for (i = 0; i < n; i++) {
>>>> >    A[i] = A[i] + B[i];
>>>> >    C[i] = C[i-1] + D[i];
>>>> >   }
>>>> >
>>>> > Into:
>>>> >
>>>> >   if (noalias between A&B, A&C, A&D)
>>>> >    for (i = 0; i < 100; i++)
>>>> >            A[i] = A[i] + B[i];
>>>> >    for (i = 0; i < 100; i++)
>>>> >            C[i] = C[i-1] + D[i];
>>>> >   else
>>>> >    for (i = 0; i < 100; i++) {
>>>> >            A[i] = A[i] + B[i];
>>>> >            C[i] = C[i-1] + D[i];
>>>> >    }
>>>> >
>>>> > Right now the vectorizer sees that 'C[i] = C[i-1] + D[i];' cannot
>>>be
>>>> > vectorized so it gives up and does not vectorize the loop.  If we
>>>split
>>>> > up the loop into two loops then the vector add with A[i] could be
>>>> > vectorized
>>>> > even if the one with C[i] could not.
>>>> Loop distribution does this transform but it doesn't know about
>>>> versioning for unknown dependences.
>>>>
>>>
>>>Yes, I looked at loop distribution.  But it only works with global
>>>arrays and not with pointer arguments where it doesn't know the size of
>>>the array being pointed at.  I would like to be able to have it work
>>>with pointer arguments.  If I call a function with 2 or
>>>more integer pointers, and I have a loop that accesses them with
>>>offsets between 0 and N where N is loop invariant then I should have
>>>enough information (at runtime) to determine if there are overlapping
>>>memory accesses through the pointers and determine whether or not I can
>>>distribute the loop.
>>
>> Not sure where you got that from. Loop distribution works with our data reference / dependence analysis.  The cost model might be more restricted but that can be fixed.
>>
>>>The loop splitting code seemed like a better template since it already
>>>knows how to split a loop based on a runtime determined condition. That
>>>part seems to be working for me, it is when I try to
>>>distribute/duplicate one of those loops (under the unaliased condition)
>>>that I am running into the problem with virtual PHIs.
>>
>> There's mark_virtual*for_renaming (sp?).
>>
>> But as said you are performing loop distribution so please enhance the existing pass rather than writing a new one.
> I happen to be working on loop distribution now (If guess correctly,
> to get hmmer fixed).  So far my idea is to fuse the finest distributed
> loop in two passes, in the first pass, we merge all SCCs due to "true"
> data dependence; in the second one we identify all SCCs and breaks
> them on dependent edges due to possible alias.  Breaking SCCs with
> minimal edge set can be modeled as Feedback arc set problem which is
> NP-hard. Fortunately the problem is small in our case and there are
> approximation algorithms.  OTOH, we should also improve loop
> distribution/fusion to maximize parallelism / minimize
> synchronization, as well as maximize data locality, but I think this
> is not needed to get hmmer vectorized.

I am also looking into vectoring homer loop. Glad to know you are also
looking at this problem and looking forward to seeing the patches.

I have some experimental patches where I added the data reference that
needs runtime checking to a list

static int
pg_add_dependence_edges (struct graph *rdg, vec<loop_p> loops, int dir,
vec<data_reference_p> drs1,
-             vec<data_reference_p> drs2)
+             vec<data_reference_p> drs2,
+             vec<ddr_p> &ddrs,
+             bool runtime_alias_check)

Then I am vesioning the main loop based on the condition generated
from the runtime check. I have borrowed the logic from vectorizer
(like pruning the list and generating the condition). I have neither
verified nor benchmarked it enough yet.

As I understand, we also should  have some form of cost model where we
should be able too see the data access patterns and decide if the
distributed loops can be vectorized?

Cost model in  similar_memory_accesses also need to be relaxd based on
the ability to vectorize distributed loops.

Thanks,
Kugan

> Thanks,
> bin
>>
>> Richard.
>>
>>>Steve Ellcey
>>>sellcey@cavium.com
>>

```