[PATCH] Add emulated gather capability to the vectorizer

Kewen.Lin linkw@linux.ibm.com
Mon Aug 2 09:20:18 GMT 2021


on 2021/8/2 下午5:11, Richard Biener wrote:
> On Mon, 2 Aug 2021, Kewen.Lin wrote:
> 
>> on 2021/8/2 下午3:09, Richard Biener wrote:
>>> On Mon, 2 Aug 2021, Kewen.Lin wrote:
>>>
>>>> on 2021/7/30 下午10:04, Kewen.Lin via Gcc-patches wrote:
>>>>> Hi Richi,
>>>>>
>>>>> on 2021/7/30 下午7:34, Richard Biener wrote:
>>>>>> This adds a gather vectorization capability to the vectorizer
>>>>>> without target support by decomposing the offset vector, doing
>>>>>> sclar loads and then building a vector from the result.  This
>>>>>> is aimed mainly at cases where vectorizing the rest of the loop
>>>>>> offsets the cost of vectorizing the gather.
>>>>>>
>>>>>> Note it's difficult to avoid vectorizing the offset load, but in
>>>>>> some cases later passes can turn the vector load + extract into
>>>>>> scalar loads, see the followup patch.
>>>>>>
>>>>>> On SPEC CPU 2017 510.parest_r this improves runtime from 250s
>>>>>> to 219s on a Zen2 CPU which has its native gather instructions
>>>>>> disabled (using those the runtime instead increases to 254s)
>>>>>> using -Ofast -march=znver2 [-flto].  It turns out the critical
>>>>>> loops in this benchmark all perform gather operations.
>>>>>>
>>>>>
>>>>> Wow, it sounds promising!
>>>>>
>>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>>>>>>
>>>>>> Any comments?  I still plan to run this over full SPEC and
>>>>>> I have to apply TLC to the followup patch before I can post it.
>>>>>>
>>>>>> I think neither power nor z has gather so I'm curious if the
>>>>>> patch helps 510.parest_r there, I'm unsure about neon/advsimd.
>>>>>
>>>>> Yes, Power (latest Power10) doesn't support gather load.
>>>>> I just measured 510.parest_r with this patch on Power9 at option
>>>>> -Ofast -mcpu=power9 {,-funroll-loops}, both are neutral.
>>>>>
>>>>> It fails to vectorize the loop in vect-gather-1.c:
>>>>>
>>>>> vect-gather.c:12:28: missed:  failed: evolution of base is not affine.
>>>>> vect-gather.c:11:46: missed:   not vectorized: data ref analysis failed _6 = *_5;
>>>>> vect-gather.c:12:28: missed:   not vectorized: data ref analysis failed: _6 = *_5;
>>>>> vect-gather.c:11:46: missed:  bad data references.
>>>>> vect-gather.c:11:46: missed: couldn't vectorize loop
>>>>>
>>>>
>>>> By further investigation, it's due to that rs6000 fails to make
>>>> maybe_gather true in:
>>>>
>>>> 	  bool maybe_gather
>>>> 	    = DR_IS_READ (dr)
>>>> 	      && !TREE_THIS_VOLATILE (DR_REF (dr))
>>>> 	      && (targetm.vectorize.builtin_gather != NULL
>>>> 		  || supports_vec_gather_load_p ());
>>>>
>>>> With the hacking defining TARGET_VECTORIZE_BUILTIN_GATHER (as
>>>> well as TARGET_VECTORIZE_BUILTIN_SCATTER) for rs6000, the
>>>> case gets vectorized as expected.
>>>
>>> Ah, yeah - this check needs to be elided.  Checking with a cross
>>> that's indeed what is missing.
>>>
>>>> But re-evaluated 510.parest_r with this extra hacking, the
>>>> runtime performance doesn't have any changes.
>>>
>>> OK, it likely needs the followup as well then.  For the added
>>> testcase I end up with
>>>
>>> .L4:
>>>         lwz %r10,8(%r9)
>>>         lwz %r31,12(%r9)
>>>         addi %r8,%r8,32
>>>         addi %r9,%r9,16
>>>         lwz %r7,-16(%r9)
>>>         lwz %r0,-12(%r9)
>>>         lxv %vs10,-16(%r8)
>>>         lxv %vs12,-32(%r8)
>>>         extswsli %r10,%r10,3
>>>         extswsli %r31,%r31,3
>>>         extswsli %r7,%r7,3
>>>         extswsli %r0,%r0,3
>>>         ldx %r10,%r6,%r10
>>>         ldx %r31,%r6,%r31
>>>         ldx %r7,%r6,%r7
>>>         ldx %r12,%r6,%r0
>>>         mtvsrdd %vs0,%r31,%r10
>>>         mtvsrdd %vs11,%r12,%r7
>>>         xvmuldp %vs0,%vs0,%vs10
>>>         xvmaddadp %vs0,%vs12,%vs11
>>>         xvadddp %vs32,%vs32,%vs0
>>>         bdnz .L4
>>>
>>> as the inner vectorized loop.  Looks like power doesn't have
>>> extending SI->DImode loads?  Otherwise the code looks
>>
>> You meant one instruction for both left shifting and loads?
> 
> No, I meant doing
> 
>          lwz %r10,8(%r9)
>          extswsli %r10,%r10,3
> 
> in one step.  I think extswsli should be a SImode->DImode sign-extend.
> Ah - it does the multiplication by 8 and the sign-extend.  OK, that's
> nice as well.  

Ah, OK.  Yeah, extswsli is short for "Extend Sign Word and Shift Left
Immediate", which is introduced since Power9.  It does do both the
sign-extend and the multiplication by 8 as you noted.  :)

> On x86 the multiplication by 8 is done via complex
> addressing mode on the dependent load and the scalar load does
> the sign-extension part.
> 

OK, thanks for the clarification!

BR,
Kewen

>> It doesn't if so.  btw, the above code sequence looks nice!
>>> straight-forward (I've used -mcpu=power10), but eventually
>>> the whole gather, which I think boils down to
>>>
>>>         lwz %r10,8(%r9)
>>>         lwz %r31,12(%r9)
>>>         addi %r9,%r9,16
>>>         lwz %r7,-16(%r9)
>>>         lwz %r0,-12(%r9)
>>>         extswsli %r10,%r10,3
>>>         extswsli %r31,%r31,3
>>>         extswsli %r7,%r7,3
>>>         extswsli %r0,%r0,3
>>>         ldx %r10,%r6,%r10
>>>         ldx %r31,%r6,%r31
>>>         ldx %r7,%r6,%r7
>>>         ldx %r12,%r6,%r0
>>>         mtvsrdd %vs0,%r31,%r10
>>>         mtvsrdd %vs11,%r12,%r7
>>>
>>> is too difficult for the integer/load side of the pipeline.  It's
>>> of course the same in the scalar loop but there the FP ops only
>>> need to wait for one lane to complete.  Note the above is with
>>> the followup patch.
>>>
>>
>> Good point!  Will check the actual effect after this and its follow-up
>> are landed, may need some cost tweaking on it.
> 
> Yes - I hope the vectorizer costing part is conservative enough.
> 
> Richard.
> 



More information about the Gcc-patches mailing list