[PATCH] VECT: Support loop len control on EXTRACT_LAST vectorization
钟居哲
juzhe.zhong@rivai.ai
Wed Aug 9 14:35:50 GMT 2023
Hi, Richard.
>> Yes, I think VEC_EXTRACT is suitable for this.
Thanks for suggestion.
Actually I was trying to use VEC_EXTRACT yesterday but it ICE since GCC failed to create LCSSA PHI for VEC_EXTRACT.
For example, like ARM SVE:
https://godbolt.org/z/rfbb4rfKv
vect dump IR:
;; Created LCSSA PHI: loop_mask_36 = PHI <loop_mask_22(3)>
<bb 8> [local count: 105119324]:
# vect_last_12.8_24 = PHI <vect_last_12.8_23(3)>
# loop_mask_36 = PHI <loop_mask_22(3)>
_25 = .EXTRACT_LAST (loop_mask_36, vect_last_12.8_24);
Then it can work.
I was trying to do the similar thing but with VEC_EXTRACT as follows for RVV:
...
loop_len_36 = SELECT_VL
....
# vect_last_12.8_24 = PHI <vect_last_12.8_23(3)>
// Missed a LCSSA PHI.
_25 = .VEC_EXTRACT (loop_len_36, vect_last_12.8_24);
This Gimple IR will cause an ICE.
When I use EXTRACT_LAST instead of VEC_EXTRACT, then:
...
loop_len_36 = SELECT_VL
....
# vect_last_12.8_24 = PHI <vect_last_12.8_23(3)>
# loop_len_22 = PHI <loop_len_36 (3)>
_25 = .VEC_EXTRACT (loop_len_22, vect_last_12.8_24);
Then it works.
I didn't figure out where to make GCC recognize VEC_EXTRACT to generate LCSSA PHI for VEC_EXTRACT.
Could you give me some help for this?
Thanks.
juzhe.zhong@rivai.ai
From: Richard Biener
Date: 2023-08-09 22:17
To: 钟居哲
CC: richard.sandiford; gcc-patches
Subject: Re: Re: [PATCH] VECT: Support loop len control on EXTRACT_LAST vectorization
On Wed, 9 Aug 2023, ??? wrote:
> Hi, Richard.
>
> >> I'm a bit behind of email, but why isn't BIT_FIELD_REF enough for
> >> the case that the patch is handling? It seems that:
>
> >> .EXTRACT_LAST (len, vec)
>
> >> is equivalent to:
>
> >> vec[len - 1]
>
> >> I think eventually there'll be the temptation to lower/fold it like that.
>
> Current BIT_FIELD_REF doesn't make use of LOOP_LEN.
Yes, BIT_FIELD_REF doesn't support variable offset.
> Consider this following case:
>
> #include <stdint.h>
> #define EXTRACT_LAST(TYPE) \
> TYPE __attribute__ ((noinline, noclone)) \
> test_##TYPE (TYPE *x, int n, TYPE value) \
> { \
> TYPE last; \
> for (int j = 0; j < n; ++j) \
> { \
> last = x[j]; \
> x[j] = last * value; \
> } \
> return last; \
> }
> #define TEST_ALL(T) \
> T (uint8_t) \
> TEST_ALL (EXTRACT_LAST)
>
> The assembly:
> https://godbolt.org/z/z1PPT948b
>
> test_uint8_t:
> mv a3,a0
> ble a1,zero,.L10
> addiw a5,a1,-1
> li a4,14
> sext.w a0,a1
> bleu a5,a4,.L11
> srliw a4,a0,4
> slli a4,a4,4
> mv a5,a3
> add a4,a4,a3
> vsetivli zero,16,e8,m1,ta,ma
> vmv.v.x v3,a2
> .L4:
> vl1re8.v v1,0(a5)
> vmul.vv v2,v1,v3
> vs1r.v v2,0(a5)
> addi a5,a5,16
> bne a4,a5,.L4
> andi a4,a1,-16
> mv a5,a4
> vsetivli zero,8,e8,mf2,ta,ma
> beq a0,a4,.L16
> .L3:
> subw a0,a0,a4
> addiw a7,a0,-1
> li a6,6
> bleu a7,a6,.L7
> slli a4,a4,32
> srli a4,a4,32
> add a4,a3,a4
> andi a6,a0,-8
> vle8.v v2,0(a4)
> vmv.v.x v1,a2
> andi a0,a0,7
> vmul.vv v1,v1,v2
> vse8.v v1,0(a4)
> addw a5,a6,a5
> beq a0,zero,.L8
> .L7:
> add a6,a3,a5
> lbu a0,0(a6)
> addiw a4,a5,1
> mulw a7,a0,a2
> sb a7,0(a6)
> bge a4,a1,.L14
> add a4,a3,a4
> lbu a0,0(a4)
> addiw a6,a5,2
> mulw a7,a2,a0
> sb a7,0(a4)
> ble a1,a6,.L14
> add a6,a3,a6
> lbu a0,0(a6)
> addiw a4,a5,3
> mulw a7,a2,a0
> sb a7,0(a6)
> ble a1,a4,.L14
> add a4,a3,a4
> lbu a0,0(a4)
> addiw a6,a5,4
> mulw a7,a2,a0
> sb a7,0(a4)
> ble a1,a6,.L14
> add a6,a3,a6
> lbu a0,0(a6)
> addiw a4,a5,5
> mulw a7,a2,a0
> sb a7,0(a6)
> ble a1,a4,.L14
> add a4,a3,a4
> lbu a0,0(a4)
> addiw a5,a5,6
> mulw a6,a2,a0
> sb a6,0(a4)
> ble a1,a5,.L14
> add a3,a3,a5
> lbu a0,0(a3)
> mulw a2,a2,a0
> sb a2,0(a3)
> ret
> .L10:
> li a0,0
> .L14:
> ret
> .L8:
> vslidedown.vi v2,v2,7
> vmv.x.s a0,v2
> andi a0,a0,0xff
> ret
> .L11:
> li a4,0
> li a5,0
> vsetivli zero,8,e8,mf2,ta,ma
> j .L3
> .L16:
> vsetivli zero,16,e8,m1,ta,ma
> vslidedown.vi v1,v1,15
> vmv.x.s a0,v1
> andi a0,a0,0xff
> ret
>
>
> This patch is trying to optimize the codegen for RVV for length control,
> after this patch:
>
> Gimple IR:
>
> <bb 4> [local count: 955630224]:
> # vectp_x.6_22 = PHI <vectp_x.6_23(4), x_11(D)(3)>
> # vectp_x.10_30 = PHI <vectp_x.10_31(4), x_11(D)(3)>
> # ivtmp_34 = PHI <ivtmp_35(4), _33(3)>
> _36 = .SELECT_VL (ivtmp_34, 16);
> vect_last_12.8_24 = .MASK_LEN_LOAD (vectp_x.6_22, 8B, { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 }, _36, 0);
> vect__4.9_28 = vect_last_12.8_24 * vect_cst__27;
> .MASK_LEN_STORE (vectp_x.10_30, 8B, { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 }, _36, 0, vect__4.9_28);
> vectp_x.6_23 = vectp_x.6_22 + _36;
> vectp_x.10_31 = vectp_x.10_30 + _36;
> ivtmp_35 = ivtmp_34 - _36;
> if (ivtmp_35 != 0)
> goto <bb 4>; [89.00%]
> else
> goto <bb 5>; [11.00%]
>
> <bb 5> [local count: 105119324]:
> _26 = .EXTRACT_LAST (_36, vect_last_12.8_24); [tail call]
>
> ASM:
> test_uint8_t:
> ble a1,zero,.L4
> mv a4,a0
> vsetivli zero,16,e8,m1,ta,ma
> vmv.v.x v3,a2
> .L3:
> vsetvli a5,a1,e8,m1,ta,ma
> vle8.v v1,0(a0)
> vsetivli zero,16,e8,m1,ta,ma
> sub a1,a1,a5
> vmul.vv v2,v1,v3
> vsetvli zero,a5,e8,m1,ta,ma
> vse8.v v2,0(a4)
> add a0,a0,a5
> add a4,a4,a5
> bne a1,zero,.L3
> addi a5,a5,-1
> vsetivli zero,16,e8,m1,ta,ma
> vslidedown.vx v1,v1,a5
> vmv.x.s a0,v1
> andi a0,a0,0xff
> ret
> .L4:
> li a0,0
> ret
>
> I think this codegen is much better with this patch.
>
> >> FWIW, I agree a IFN_LEN_EXTRACT_LAST/IFN_EXTRACT_LAST_LEN would be OK,
> >> with a mask, vector, length and bias. But even then, I think there'll
> >> be a temptation to lower calls with all-1 masks to val[len - 1 - bias].
> >> So I think the function only makes sense if we have a use case where
> >> the mask might not be all-1s.
>
> I am wondering that, instead of adding IFN_LEN_EXTRACT_LAST/IFN_EXTRACT_LAST_LEN.
> Does it make sense I use VEC_EXTRACT as follows :?
>
> Loop:
> loop_len_22 = SELECT_VL.
> ....
>
> Epilogue:
> new_loop_len_22 = PHI <loop_len_22>
> scalar_res = VEC_EXTRACT (v, new_loop_len_22 - 1 - BIAS)
Yes, I think VEC_EXTRACT is suitable for this.
Richard.
> Feel free to correct me if I am wrong.
>
> Thanks.
>
>
> juzhe.zhong@rivai.ai
>
> From: Richard Sandiford
> Date: 2023-08-09 21:30
> To: juzhe.zhong\@rivai.ai
> CC: rguenther; gcc-patches
> Subject: Re: [PATCH] VECT: Support loop len control on EXTRACT_LAST vectorization
> "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> > Hi, Richi.
> >
> >>> that should be
> >
> >>> || (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
> >>> && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> >
> >>> I think. It seems to imply that SLP isn't supported with
> >>> masking/lengthing.
> >
> > Oh, yes. At first glance, the original code is quite suspicious and your comments make sense to me.
> >
> >>> Hum, how does CFN_EXTRACT_LAST handle both mask and length transparently?
> >>> Don't you need some CFN_LEN_EXTRACT_LAST instead?
> >
> > I think CFN_EXTRACT_LAST always has either loop mask or loop len.
> >
> > When both mask and length are not needed, IMHO, I think current BIT_FIELD_REF flow is good enough:
> > https://godbolt.org/z/Yr5M9hcc6
>
> I'm a bit behind of email, but why isn't BIT_FIELD_REF enough for
> the case that the patch is handling? It seems that:
>
> .EXTRACT_LAST (len, vec)
>
> is equivalent to:
>
> vec[len - 1]
>
> I think eventually there'll be the temptation to lower/fold it like that.
>
> FWIW, I agree a IFN_LEN_EXTRACT_LAST/IFN_EXTRACT_LAST_LEN would be OK,
> with a mask, vector, length and bias. But even then, I think there'll
> be a temptation to lower calls with all-1 masks to val[len - 1 - bias].
> So I think the function only makes sense if we have a use case where
> the mask might not be all-1s.
>
> Thanks,
> Richard
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
More information about the Gcc-patches
mailing list