111926 – RISC-V: Use vsetvl insn replace csrr vlenb insn

Bug 111926 - RISC-V: Use vsetvl insn replace csrr vlenb insn

Summary: RISC-V: Use vsetvl insn replace csrr vlenb insn

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	14.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2023-10-23 03:36 UTC by Lehua Ding
Modified:	2023-11-13 00:11 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:	RISC-V
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Lehua Ding 2023-10-23 03:36:03 UTC

We can use: 
        vsetvl a5, zero, e8, mf8, ta, ta
replace:
        csrr    a4,vlenb
        srli    a4,a4,3

The reason for this is that the performance of the vsetvl instruction tends to be better optimised than the csrr instruction.

#include <riscv_vector.h>

#define exhaust_vector_regs()                                                  \
  asm volatile("#" ::                                                          \
		 : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", \
		   "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17",     \
		   "v18", "v19", "v20", "v21", "v22", "v23", "v24", "v25",     \
		   "v26", "v27", "v28", "v29", "v30", "v31");
           
void
spill_1 (int8_t *in, int8_t *out)
{
  vint8mf8_t v1 = *(vint8mf8_t*)in;
  exhaust_vector_regs ();
  *(vint8mf8_t*)out = v1;
}

spill_1(signed char*, signed char*):
        csrr    a4,vlenb
        srli    a4,a4,3
        csrr    t0,vlenb
        slli    a3,a4,3
        sub     sp,sp,t0
        sub     a3,a3,a4
        add     a3,a3,sp
        vsetvli a5,zero,e8,mf8,ta,ma
        vle8.v  v1,0(a0)
        vse8.v  v1,0(a3)
        csrr    a4,vlenb
        srli    a4,a4,3
        slli    a3,a4,3
        sub     a3,a3,a4
        add     a3,a3,sp
        vle8.v  v1,0(a3)
        csrr    t0,vlenb
        vse8.v  v1,0(a1)
        add     sp,sp,t0
        jr      ra


https://godbolt.org/z/TcKxbjnoh

Comment 1 Kito Cheng 2023-10-23 04:24:03 UTC

Plz leave an option to let user has choice, performance things is hard to saw which is absolutely better for all uarch, my thought is leaving an option and let mtune and a command line option to control that.

Comment 2 Kito Cheng 2023-10-23 04:25:05 UTC

Forgot to mention, personally I love idea to simplify code gen, I could imagine that's definitely an optimization for specific uarch :)

Comment 3 Lehua Ding 2023-10-24 03:00:29 UTC

(In reply to Kito Cheng from comment #1)
> Plz leave an option to let user has choice, performance things is hard to
> saw which is absolutely better for all uarch, my thought is leaving an
> option and let mtune and a command line option to control that.

Understood, I'll put the possible optimisation points I came across here. How to do it is like you said still under consideration.