So the performance && correctness can be well trusted.
Here is the example:
void f (void * in, void *out, int32_t x, int n, int m)
{
for (int i = 0; i < n; i++) {
vint32m1_t v = __riscv_vle32_v_i32m1 (in + i, 4);
vint32m1_t v2 = __riscv_vle32_v_i32m1_tu (v, in + 100 + i, 4);
vint32m1_t v3 = __riscv_vaadd_vx_i32m1 (v2, 0, VXRM_RDN, 4);
v3 = __riscv_vaadd_vx_i32m1 (v3, 3, VXRM_RDN, 4);
__riscv_vse32_v_i32m1 (out + 100 + i, v3, 4);
}
for (int i = 0; i < n; i++) {
vint32m1_t v = __riscv_vle32_v_i32m1 (in + i + 1000, 4);
vint32m1_t v2 = __riscv_vle32_v_i32m1_tu (v, in + 100 + i + 1000, 4);
vint32m1_t v3 = __riscv_vaadd_vx_i32m1 (v2, 0, VXRM_RDN, 4);
v3 = __riscv_vaadd_vx_i32m1 (v3, 3, VXRM_RDN, 4);
__riscv_vse32_v_i32m1 (out + 100 + i + 1000, v3, 4);
}
}
mode switching can global recognize both Loop 1 and Loop 2 are using RDN
rounding mode and hoist such single "csrwi vxrm,2" to dominate both Loop 1
and Loop 2.
Besides, I have add correctness check sanity tests in this patch too.
* gcc.target/riscv/rvv/base/vxrm-10.c: New test.
* gcc.target/riscv/rvv/base/vxrm-6.c: New test.
* gcc.target/riscv/rvv/base/vxrm-7.c: New test.
* gcc.target/riscv/rvv/base/vxrm-8.c: New test.
* gcc.target/riscv/rvv/base/vxrm-9.c: New test.