Question about dynamic choosing vectorization factor for RVV

Thu Aug 31 11:29:46 GMT 2023

On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:

> Hi. Thanks Richard and Richi.
> 
> Now, I figure out how to choose smaller LMUL now.
> 
> void
> costs::finish_cost (const vector_costs *scalar_costs)
> {
>   loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>   if (loop_vinfo)
>     {
>       if (loop_vinfo->vector_mode == RVVM8SImode
>       || riscv_v_ext_vls_mode_p (loop_vinfo->vector_mode))
>         {
>           m_costs[vect_prologue] = 8;
>           m_costs[vect_body] = 8;
>           m_costs[vect_epilogue] = 8;
>         }
>       else
>         {
>           m_costs[vect_prologue] = 1;
>           m_costs[vect_body] = 1;
>           m_costs[vect_epilogue] = 1;
>         }
>     }
>    // m_suggested_unroll_factor = 2;
>   vector_costs::finish_cost (scalar_costs);
> }

I don't think that's "good" use of the API.

> Previous odd codes are because of VLS modes
> 
> Now, I can get the LMUL = 4 by adjusting cost.
> vsetvli a5,a2,e32,m4,ta,ma
> vle32.v v8,0(a0)
> vle32.v v4,0(a1)
> vsetvli a6,zero,e32,m4,ta,ma
> slli a3,a5,2
> vadd.vv v4,v4,v8
> sub a2,a2,a5
> vsetvli zero,a5,e32,m4,ta,ma
> vse32.v v4,0(a4)
> add a0,a0,a3
> add a1,a1,a3
> add a4,a4,a3
> bne a2,zero,.L3
> 
> Fantastic architecture of GCC Vector Cost model!
> 
> Thanks a lot.
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-08-31 19:20
> To: juzhe.zhong@rivai.ai
> CC: gcc; richard.sandiford
> Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV
> On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:
>  
> > Thanks Richi.
> > 
> > I am trying to figure out how to adjust finish_cost to lower the LMUL
> > 
> > For example:
> > 
> > void
> > foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> > {
> >   for (int i = 0; i < n; i++)
> >     a[i] = a[i] + b[i];
> > }
> > 
> > preferred_simd_mode pick LMUL = 8 (RVVM8SImode)
> > 
> > Is is possible that we can adjust the COST in finish cost make Loop 
> > vectorizer pick LMUL = 4?
>  
> I see you have a autovectorize_vector_modes hook and you use
> VECT_COMPARE_COSTS.  So the appropriate place would be to
> amend your vector_costs::better_main_loop_than_p.
>  
> > I am experimenting with this following cost:
> > 
> >   if (loop_vinfo)
> >     {
> >       if (loop_vinfo->vector_mode == RVVM8SImode)
> >         {
> >           m_costs[vect_prologue] = 2;
> >           m_costs[vect_body] = 20;
> >           m_costs[vect_epilogue] = 2;
> >         }
> >       else
> >         {
> >           m_costs[vect_prologue] = 1;
> >           m_costs[vect_body] = 1;
> >           m_costs[vect_epilogue] = 1;
> >         }
> >     }
> > 
> > I increase LMUL = 8 cost. The codegen is odd:
> > 
> > foo:
> > ble a2,zero,.L12
> > addiw a5,a2,-1
> > li a4,30
> > sext.w t1,a2
> > bleu a5,a4,.L7
> > srliw a7,t1,5
> > slli a7,a7,7
> > li a4,32
> > add a7,a7,a0
> > mv a5,a0
> > mv a3,a1
> > vsetvli zero,a4,e32,m8,ta,ma
> > .L4:
> > vle32.v v8,0(a5)
> > vle32.v v16,0(a3)
> > vadd.vv v8,v8,v16
> > vse32.v v8,0(a5)
> > addi a5,a5,128
> > addi a3,a3,128
> > bne a5,a7,.L4
> > andi a2,a2,-32
> > beq t1,a2,.L14
> > .L3:
> > slli a4,a2,32
> > subw a5,t1,a2
> > srli a4,a4,32
> > slli a5,a5,32
> > slli a4,a4,2
> > srli a5,a5,32
> > add a0,a0,a4
> > add a1,a1,a4
> > vsetvli a4,a5,e8,m1,ta,ma
> > vle32.v v8,0(a0)
> > vle32.v v4,0(a1)
> > vsetvli a2,zero,e32,m4,ta,ma
> > vadd.vv v4,v4,v8
> > vsetvli zero,a5,e32,m4,ta,ma
> > vse32.v v4,0(a0)
> > sub a3,a5,a4
> > beq a5,a4,.L12
> > slli a4,a4,2
> > vsetvli zero,a3,e8,m1,ta,ma
> > add a0,a0,a4
> > add a1,a1,a4
> > vle32.v v4,0(a0)
> > vle32.v v8,0(a1)
> > vsetvli a2,zero,e32,m4,ta,ma
> > vadd.vv v4,v4,v8
> > vsetvli zero,a3,e32,m4,ta,ma
> > vse32.v v4,0(a0)
> > .L12:
> > ret
> > .L7:
> > li a2,0
> > j .L3
> > .L14:
> > ret
> > 
> > I hope it can generate the code like this:
> > 
> > foo:
> > ble a2,zero,.L5
> > mv a4,a0
> > .L3:
> > vsetvli a5,a2,e32,m4,ta,ma
> > vle32.v v8,0(a0)
> > vle32.v v4,0(a1)
> > vsetvli a6,zero,e32,m4,ta,ma
> > slli a3,a5,2
> > vadd.vv v4,v4,v8
> > sub a2,a2,a5
> > vsetvli zero,a5,e32,m4,ta,ma
> > vse32.v v4,0(a4)
> > add a0,a0,a3
> > add a1,a1,a3
> > add a4,a4,a3
> > bne a2,zero,.L3
> > .L5:
> > ret
> > 
> > I am experimenting whether we can adjust cost statically to make loop 
> > vectorizer use LMUL = 4 even though preferred_simd_mode return LMUL = 8. 
> > If we can do that, I think we can apply analysis and then adjust the 
> > cost according to analysis.
> >
> > Thanks.
> > 
> > 
> > juzhe.zhong@rivai.ai
> >  
> > From: Richard Biener
> > Date: 2023-08-31 15:38
> > To: juzhe.zhong@rivai.ai
> > CC: gcc; richard.sandiford
> > Subject: Re: Question about dynamic choosing vectorization factor for RVV
> > On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:
> >  
> > > Hi, Richard and Richi.
> > > 
> > > Currently, we are statically returning vectorization factor in 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE'
> > > according to compile option.
> > > 
> > > For example:
> > > void
> > > foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> > > {
> > >   for (int i = 0; i < n; i++)
> > >     a[i] = a[i] + b[i];
> > > }
> > > 
> > > with --param=riscv-autovec-lmul = m1:
> > > 
> > > vsetvli a5,a2,e32,m1,ta,ma
> > > vle32.v v2,0(a0)
> > > vle32.v v1,0(a1)
> > > vsetvli a6,zero,e32,m1,ta,ma
> > > slli a3,a5,2
> > > vadd.vv v1,v1,v2
> > > sub a2,a2,a5
> > > vsetvli zero,a5,e32,m1,ta,ma
> > > vse32.v v1,0(a4)
> > > add a0,a0,a3
> > > add a1,a1,a3
> > > add a4,a4,a3
> > > bne a2,zero,.L3
> > > 
> > > The 'vadd.vv' is only performing operations on a single register.
> > > 
> > > with --param=riscv-autovec-lmul=m8:
> > > 
> > >   vsetvli a5,a2,e8,m2,ta,ma
> > >   vle32.v v16,0(a0)
> > >   vle32.v v8,0(a1)
> > >   vsetvli a6,zero,e32,m8,ta,ma
> > >   slli a3,a5,2
> > >   vadd.vv v8,v8,v16
> > >   vsetvli zero,a2,e32,m8,ta,ma
> > >   sub a2,a2,a5
> > >   vse32.v v8,0(a4)
> > >   add a0,a0,a3
> > >   add a1,a1,a3
> > >   add a4,a4,a3
> > >   bne a2,zero,.L3
> > > 
> > > The 'vadd.vv' here is performing operations on 8 consecutive registers:
> > > 
> > > vadd.vv [v8 - v15], [v8 - v15], [v16 - v23]
> > > 
> > > Users statically set the vectorization factor is not ideal.
> > > 
> > > We want GCC to dynamic choose vectorization factor to do the auto-vectorization according to loop analysis.
> > > 
> > > Currently, I have implement simplistic loop analysis like analyze live range of each local decl of current function.
> > > 
> > > Here is the analysis, we have 32 vector registers for RVV.
> > > So we calculate the live range of current function local decl:
> > > 
> > > the number of decls live at the same time * LMUL <= 32. 
> > > According to this analysis, I set the vectorization factor in TARGET_VECTORIZE_PREFERRED_SIMD_MODE
> > > 
> > > Then this simplistic algorithm (implemented in RISC-V backend) work well for the testcases I produces.
> > > 
> > > However, I can only choose optimal vectorization for whole function but failed to specific loop.
> > > 
> > > Here is the example:
> > > 
> > > void foo2 (int32_t *__restrict a,
> > >           int32_t *__restrict b,
> > >           int32_t *__restrict c,
> > >           int32_t *__restrict a2,
> > >           int32_t *__restrict b2,
> > >           int32_t *__restrict c2,
> > >           int32_t *__restrict a3,
> > >           int32_t *__restrict b3,
> > >           int32_t *__restrict c3,
> > >           int32_t *__restrict a4,
> > >           int32_t *__restrict b4,
> > >           int32_t *__restrict c4,
> > >           int32_t *__restrict a5,
> > >           int32_t *__restrict b5,
> > >           int32_t *__restrict c5,
> > >           int n)
> > > {
> > > // Loop 1
> > >     for (int i = 0; i < n; i++)
> > >        a[i] = a[i] + b[i];
> > > // Loop 2
> > >     for (int i = 0; i < n; i++){
> > >       a[i] = b[i] + c[i];
> > >       a2[i] = b2[i] + c2[i];
> > >       a3[i] = b3[i] + c3[i];
> > >       a4[i] = b4[i] + c4[i];
> > >       a5[i] = a[i] + a4[i];
> > >       a[i] = a3[i] + a2[i]+ a5[i];
> > >     }
> > > }
> > > 
> > > Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose LMUL = 4 (since LMUL = 8 will cause vector register spillings).
> > > 
> > > If I split loop 1 and loop 2 into 2 separate functions, my algorithm works well.
> > > 
> > > However, if we put these 2 loop in the same function, I finally pick LMUL = 4 for both loop 1 and loop 2 since as I said above, I do the analysis base on function not base
> > > on the loop.
> > > 
> > > I am struggling whether we could have a good idea for such issue. Can we pass through loop_vec_info
> > > to 'preferred_simd_mode' target hook?
> >  
> > That's not how it's currently designed to work - there's
> > the autovectorize_vector_modes hook where you should provide a vector
> > of modes the vectorizer iterates over and return VECT_COMPARE_COST
> > if you want to evaluate costs between choices.  Your analysis should
> > then happen in the finish_cost method.
> >  
> > That's how it's currently designed.  It might not be optimal for
> > compile-time reasons when there are many modes, giving the target
> > more control (and context) might be possible.
> >  
> > Richard.
> >  
> > > Thanks.
> > > 
> > > 
> > > juzhe.zhong@rivai.ai
> > > 
> >  
> > 
>  
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)