[PATCH][1/2] Fix PR68553

Fri Dec 4 17:46:00 GMT 2015

On 04/12/15 16:04, Richard Biener wrote:
> On December 4, 2015 4:32:33 PM GMT+01:00, Alan Lawrence <alan.lawrence@arm.com> wrote:
>> On 27/11/15 08:30, Richard Biener wrote:
>>>
>>> This is part 1 of a fix for PR68533 which shows that some targets
>>> cannot can_vec_perm_p on an identity permutation.  I chose to fix
>>> this in the vectorizer by detecting the identity itself but with
>>> the current structure of vect_transform_slp_perm_load this is
>>> somewhat awkward.  Thus the following no-op patch simplifies it
>>> greatly (from the times it was restricted to do interleaving-kind
>>> of permutes).  It turned out to not be 100% no-op as we now can
>>> handle non-adjacent source operands so I split it out from the
>>> actual fix.
>>>
>>> The two adjusted testcases no longer fail to vectorize because
>>> of "need three vectors" but unadjusted would fail because there
>>> are simply not enough scalar iterations in the loop.  I adjusted
>>> that and now we vectorize it just fine (running into PR68559
>>> which I filed).
>>>
>>> Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.
>>>
>>> Richard.
>>>
>>> 2015-11-27  Richard Biener  <rguenther@suse.de>
>>>
>>> 	PR tree-optimization/68553
>>> 	* tree-vect-slp.c (vect_get_mask_element): Remove.
>>> 	(vect_transform_slp_perm_load): Implement in a simpler way.
>>>
>>> 	* gcc.dg/vect/pr45752.c: Adjust.
>>> 	* gcc.dg/vect/slp-perm-4.c: Likewise.
>>
>> On aarch64 and ARM targets, this causes
>>
>> PASS->FAIL: gcc.dg/vect/O3-pr36098.c scan-tree-dump-times vect
>> "vectorizing 
>> stmts using SLP" 0
>>
>> That is, we now vectorize using SLP, when previously we did not.
>>
>> On aarch64 (and I expect ARM too), previously we used a VEC_LOAD_LANES,
>> without 
>> unrolling, 
> but now we unroll * 4, and vectorize using 3 loads and
>> permutes:
> 
> Happens on x86_64 as well with at least Sse4.1.  Unfortunately we'll have to start introducing much more fine-grained target-supports for vect_perm to reliably guard all targets.

I don't know enough about SSE4.1 to know whether it's a problem there or not. This is an actual regression on AArch64 and ARM and not just a testism, you now get :

.L5:
        ldr     q0, [x5, 16]
        add     x4, x4, 48
        ldr     q1, [x5, 32]
        add     w6, w6, 1
        ldr     q4, [x5, 48]
        cmp     w3, w6
        ldr     q2, [x5], 64
        orr     v3.16b, v0.16b, v0.16b
        orr     v5.16b, v4.16b, v4.16b
        orr     v4.16b, v1.16b, v1.16b
        tbl     v0.16b, {v0.16b - v1.16b}, v6.16b
        tbl     v2.16b, {v2.16b - v3.16b}, v7.16b
        tbl     v4.16b, {v4.16b - v5.16b}, v16.16b
        str     q0, [x4, -32]
        str     q2, [x4, -48]
        str     q4, [x4, -16]
        bhi     .L5

instead of 

.L5:
        ld4     {v4.4s - v7.4s}, [x7], 64
        add     w4, w4, 1
        cmp     w3, w4
        orr     v1.16b, v4.16b, v4.16b
        orr     v2.16b, v5.16b, v5.16b
        orr     v3.16b, v6.16b, v6.16b
        st3     {v1.4s - v3.4s}, [x6], 48
        bhi     .L5

LD4 and ST3 do all the permutes without needing actual permute instructions - a strategy that favours generic permutes avoiding the load_lanes case is likely to be more expensive on most implementations. I think worth a PR atleast.

regards
Ramana

> 
> Richard.
> 
>> ../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt:
>>
>> vect__31.15_94 = VEC_PERM_EXPR <vect__31.11_87, vect__31.12_89, { 0, 1,
>> 2, 4 }>;
>> ../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt:
>>
>> vect__31.16_95 = VEC_PERM_EXPR <vect__31.12_89, vect__31.13_91, { 1, 2,
>> 4, 5 }>;
>> ../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt:
>>
>> vect__31.17_96 = VEC_PERM_EXPR <vect__31.13_91, vect__31.14_93, { 2, 4,
>> 5, 6 }>
>>
>> which *is* a valid vectorization strategy...
>>
>>
>> --Alan
> 
>