As reported in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111311#c8 we have following execute failures on trunk. === gcc: Unexpected fails for rv64gcv lp64d medlow === FAIL: gcc.c-torture/execute/memset-3.c -O3 -g execution test The issue is an extraneous VSETVLI instruction (with wrong SEW) being generated which creates wrong fill pattern for memset. ``` main: [...] .L36: ; 2. loop start for @off 0 vse8.v v1,0(t3) vse8.v v1,0(t6) vse8.v v1,0(s1) vse8.v v3,0(a5) ... ; loop epilogue li a7,15 beq a4,a7,.L171 vsetvli zero,zero,e32,m2,ta,ma <--- wrong j .L36 ``` vsetvli pass dumps: ``` Phase 3: Reduce global vsetvl infos. Compute LCM insert and delete data: Expr[2]: VALID (insn 2847, bb 3) Demand fields: demand_sew_lmul demand_avl SEW=8, VLMUL=mf2, RATIO=16, MAX_SEW=64 TAIL_POLICY=agnostic, MASK_POLICY=agnostic AVL=(const_int 8 [0x8]) VL=(nil) VSETVL infos after phase 3 bb 3: probability: always (guessed) Header vsetvl info:VALID (insn 2847, bb 3) (deleted) <--- Demand fields: demand_sew_lmul demand_avl SEW=8, VLMUL=mf2, RATIO=16, MAX_SEW=64 TAIL_POLICY=agnostic, MASK_POLICY=agnostic AVL=(const_int 8 [0x8]) VL=(nil) ``` So it seems LCM is deleting the valid VSETVLI insn which later causes Phase 4 to insert a different/incorrect one. I revert the following commit and the issue goes away. 2023-10-18 f0e28d8c1371 RISC-V: Fix failed hoist in LICM of vmv.v.x instruction This at least tells us the cause of issue, next step is to fix the issue.
Could you share more assembly ?
Created attachment 56539 [details] manually reduced src
Created attachment 56540 [details] asm output ok
Created attachment 56541 [details] asm output nok
I don't think VSETVL is wrong. vsetivli zero,8,e8,mf2,ta,ma sd ra,120(sp) vmv.x.s a1,v1 ... .L36: vse8.v ... vsetivli zero,8,e32,m2,ta,ma j .L36 Both e8mf2 and e32m2 are valid for vse8.v since they have same ratio = 16
I have debugged this by single stepped in qemu when the test fails (first loop for offset 0, iteration 8) The last VSETVLI is this one, 0x10a3e 0d107057 vsetvli zero,zero,e32,m2,ta,ma 0x10a42 j 0x10666 We eventually hit a VMV.v.x. which creates invalid pattern due to e32. (gdb) info reg vtype vtype 0xd1 209 # SEW = 010’b / e32, LMUL = 001’b / m2 (gdb) info reg vl vl 0x8 8 (gdb) info reg a0 a0 0x41 65 vmv.v.x v2,a0 (gdb) info reg v2 v2 {q = {0x41000000410000004100000041} (gdb) info reg v3 v2 {q = {0x41000000410000004100000041}
Oh. I missed it: vmv.v.x v2,s0 vse8.v v2,0(a5) Leave it to me today. It should be simple fix. Thanks for report it.
Could you continue debug more cases ? FAIL: gcc.c-torture/execute/pr89369.c -O2 execution test FAIL: gcc.c-torture/execute/pr89369.c -O2 -flto -fno-use-linker-plugin -flto-partition=none execution test FAIL: gcc.c-torture/execute/pr89369.c -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects execution test FAIL: gcc.c-torture/execute/pr89369.c -O3 -g execution test FAIL: gcc.dg/pr96239.c execution test FAIL: gcc.dg/vshift-5.c execution test FAIL: gcc.dg/torture/pr61346.c -O2 execution test FAIL: gcc.dg/torture/pr61346.c -O2 -flto -fno-use-linker-plugin -flto-partition=none execution test FAIL: gcc.dg/torture/pr61346.c -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test FAIL: gcc.dg/torture/pr61346.c -O3 -g execution test They are RV32 system. memset issue I will fix it soon today.
(In reply to JuzheZhong from comment #7) > Oh. I missed it: > > vmv.v.x v2,s0 > vse8.v v2,0(a5) > > Leave it to me today. It should be simple fix. > > Thanks for report it. Can I request you to let me continue to debug and fix it. I want to familiarize myself with the vsetv pass and this seems like a good opportunity to do so considering you think the fix is not hard.
(In reply to Vineet Gupta from comment #9) > (In reply to JuzheZhong from comment #7) > > Oh. I missed it: > > > > vmv.v.x v2,s0 > > vse8.v v2,0(a5) > > > > Leave it to me today. It should be simple fix. > > > > Thanks for report it. > > Can I request you to let me continue to debug and fix it. I want to > familiarize myself with the vsetv pass and this seems like a good > opportunity to do so considering you think the fix is not hard. OK. Thanks.
As a hack I commented out set_delete() to see what the extraneous vsetvli would have been. ``` .L36: # bb 3: start of outer loop: off 0 vsetvli zero,zero,e8,mf2,ta,ma # insn 2915 vse8.v v1,0(t3) # insn 2874, bb 3 vse8.v v1,0(t6) vse8.v v1,0(s0) ... # bb 181 addi a4,a4,1 li a7,15 bne a4,a7,.L36 # insn 1082 # bb 182: start of outer loop: off 1 vsetvli zero,zero,e32,mf2,ta,ma # insn 2919 vmv.x.s a3,v1 # insn 1858 vsetvli zero,zero,e16,mf2,ta,ma sw a3,8(sp) vmv.x.s a3,v1 ``` Essentially phase 2 is fusing vsetvl info for insn 2874 and insn 1858 But the fused info doesn't seem right. vsetvli zero,zero,e32,m2,ta,ma j .L36 Manually modifying it to orig value fixes the test. vsetvli zero,zero,e8,mf2,ta,ma j .L36 Phase 2 logs ``` Try lift up 1. earliest: Edge(bb 0 -> bb 2): n_bits = 13, set = {0 } Edge(bb 62 -> bb 63): n_bits = 13, set = {4 } Edge(bb 180 -> bb 181): n_bits = 13, set = {8 } Edge(bb 181 -> bb 3): n_bits = 13, set = {2 } Fuse curr info since prev info compatible with it: prev_info: VALID (insn 1858, bb 181) <-- due to Edge(bb 181 -> bb 3) Demand fields: demand_sew_only SEW=32, VLMUL=mf2, RATIO=64, MAX_SEW=64 TAIL_POLICY=agnostic, MASK_POLICY=agnostic AVL=(nil) VL=(nil) curr_info: VALID (insn 2874, bb 3) Demand fields: demand_ratio_only demand_avl SEW=8, VLMUL=mf2, RATIO=16, MAX_SEW=64 TAIL_POLICY=agnostic, MASK_POLICY=agnostic AVL=(const_int 8 [0x8]) VL=(nil) prev_info after fused: VALID (insn 1858, bb 181) Demand fields: demand_sew_lmul demand_avl SEW=32, VLMUL=m2, RATIO=16, MAX_SEW=64 TAIL_POLICY=agnostic, MASK_POLICY=agnostic AVL=(const_int 8 [0x8]) VL=(nil) ``` This fuse in turn is happening from DEF_SEW_LMUL_RULE (sew_only, ratio_only, sew_lmul, next_ratio_valid_for_prev_sew_p, always_false, modify_lmul_with_next_ratio) I'm not really sure if the merge callback here is correct.
The merge is correct here. vmv.x.s demand SEW = 32 wheras vle/vse demand RATIO = 16 (e8mf2) So, to make vsetvl valid for both of them, the vtype should be sew = 32 and lmul = M2 (32 / 16 = 2).
Then I don't know where the problem actually is ?
Let me give you some guide which helps you to dig into the problem. First, reduce the case as follows: #include <string.h> void abort (void); void exit (int); #ifndef MAX_OFFSET #define MAX_OFFSET (sizeof (long long)) #endif #ifndef MAX_COPY #define MAX_COPY 15 #endif #ifndef MAX_EXTRA #define MAX_EXTRA (sizeof (long long)) #endif #define MAX_LENGTH (MAX_OFFSET + MAX_COPY + MAX_EXTRA) static union { char buf[MAX_LENGTH]; long long align_int; long double align_fp; } u; char A = 'A'; void reset () { int i; for (i = 0; i < MAX_LENGTH; i++) u.buf[i] = 'a'; } void check (int off, int len, int ch) { char *q; int i; q = u.buf; for (i = 0; i < off; i++, q++) if (*q != 'a') abort (); for (i = 0; i < len; i++, q++) if (*q != ch) abort (); for (i = 0; i < MAX_EXTRA; i++, q++) if (*q != 'a') abort (); } int main () { int len; char *p; /* off == 0 */ for (len = 0; len < MAX_COPY; len++) { reset (); p = memset (u.buf, '\0', len); if (p != u.buf) abort (); check (0, len, '\0'); p = memset (u.buf, A, len); if (p != u.buf) abort (); check (0, len, 'A'); p = memset (u.buf, 'B', len); if (p != u.buf) abort (); check (0, len, 'B'); } /* off == 1 */ for (len = 0; len < MAX_COPY; len++) { reset (); p = memset (u.buf+1, '\0', len); if (p != u.buf+1) abort (); check (1, len, '\0'); p = memset (u.buf+1, A, len); if (p != u.buf+1) abort (); check (1, len, 'A'); p = memset (u.buf+1, 'B', len); if (p != u.buf+1) abort (); check (1, len, 'B'); } exit (0); }
(In reply to JuzheZhong from comment #14) > Let me give you some guide which helps you to dig into the problem. > > First, reduce the case as follows: Did your msg get truncated or pressed send too soon ? Because the reduced test you pasted is exactly what I uploaded to the bug and I can't reduce it any further.
The victim should be these 2 pieces of codes: .L20: lbu a1,0(a3) li t1,97 bne a1,t1,.L21 lbu t1,1(a3) bne t1,a1,.L21 lbu a1,2(a3) bne a1,t1,.L21 lbu t1,3(a3) bne t1,a1,.L21 lbu a1,4(a3) bne a1,t1,.L21 lbu t1,5(a3) bne t1,a1,.L21 lbu a1,6(a3) bne a1,t1,.L21 lbu a3,7(a3) bne a3,a1,.L21 lui a3,%hi(A) lbu a3,%lo(A)(a3) mv t1,a5 mv a1,a4 bltu a4,t3,.L24 mv t1,t4 addi a1,a4,-8 vmv.v.x v2,a3 vse8.v v2,0(a2) .L24: .L29: lbu t1,0(a1) li t6,97 bne t1,t6,.L21 lbu t6,1(a1) bne t6,t1,.L21 lbu t1,2(a1) bne t1,t6,.L21 lbu t6,3(a1) bne t6,t1,.L21 lbu t1,4(a1) bne t1,t6,.L21 lbu t6,5(a1) bne t6,t1,.L21 lbu t1,6(a1) bne t1,t6,.L21 lbu a1,7(a1) bne a1,t1,.L21 mv t1,a5 mv a1,a4 bltu a4,t3,.L31 li t6,66 mv t1,t4 addi a1,a4,-8 vmv.v.x v2,t6 vse8.v v2,0(a2) .L31: They are located on BB 54 and BB 113. Their VSETVLs should not be eliminiated.
The incorrect elimination happens on pre_global_vsetvl_info You can try simple hack like this: diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc index 8466b5d019e..65dcf931808 100644 --- a/gcc/config/riscv/riscv-vsetvl.cc +++ b/gcc/config/riscv/riscv-vsetvl.cc @@ -3135,6 +3135,8 @@ pre_vsetvl::pre_global_vsetvl_info () for (const bb_info *bb : crtl->ssa->bbs ()) { sbitmap d = m_del[bb->index ()]; + if (bb->index () == 113 || bb->index () == 54) + continue; if (bitmap_count_bits (d) == 0) continue; FAIL will be fixed. So, the idea is that we should investigate why LCM calculation return m_del to be true on BB 54 and BB 113. The calculation is done by pre_edge_lcm_avs
I know how to fix it. Testing a candidate patch.
Created attachment 56589 [details] bug fix patch Hi,Vineet. The attachment is the bug fix patch. I have tested on both RV32 and RV64 C/C++. No regression with reducing memset-3.c FAIL. You can verify and send the patch. Thanks.
This is the reason of this patch: The whole analysis is correct.But we made a mistake on inserting vsetvls. This is the story. We have 2 types of global vsetvls insertion. One is earliest fusion of each end of the block. The other is LCM suggested edge vsetvls. So before this patch, insertion as follows: (insn 2817 2820 2818 361 (set (reg:SI 67 vtype) (unspec:SI [ (const_int 8 [0x8]) (const_int 7 [0x7]) (const_int 1 [0x1]) repeated x2 ] UNSPEC_VSETVL)) 1708 {vsetvl_vtype_change_only} (nil)) (insn 2818 2817 999 361 (set (reg:SI 67 vtype) (unspec:SI [ (const_int 32 [0x20]) (const_int 1 [0x1]) repeated x3 ] UNSPEC_VSETVL)) 1708 {vsetvl_vtype_change_only} (nil)) After this patch: (insn 2817 2820 2819 361 (set (reg:SI 67 vtype) (unspec:SI [ (const_int 32 [0x20]) (const_int 1 [0x1]) repeated x3 ] UNSPEC_VSETVL)) 1708 {vsetvl_vtype_change_only} (nil)) (insn 2819 2817 999 361 (set (reg:SI 67 vtype) (unspec:SI [ (const_int 8 [0x8]) (const_int 7 [0x7]) (const_int 1 [0x1]) repeated x2 ] UNSPEC_VSETVL)) 1708 {vsetvl_vtype_change_only} (nil)) The original insertion order is incorrect. We should first insert earliest fusion since it is the vsetvls information already there which was seen by later LCM. We just delay the insertion. So it should be come before the LCM suggested insertion. Feel free to investigate it with comparing before and after the patch. Then feel free to send patch with this fix after you understand the reasons. Thanks.
The master branch has been updated by Vineet Gupta <vineetg@gcc.gnu.org>: https://gcc.gnu.org/g:d1189ceefc1da1515e039c9cfd2f5a67b5820207 commit r14-5507-gd1189ceefc1da1515e039c9cfd2f5a67b5820207 Author: Juzhe-Zhong <juzhe.zhong@rivai.ai> Date: Tue Nov 14 19:23:19 2023 -0800 RISC-V: fix vsetvli pass testsuite failure [PR/112447] Fixes: f0e28d8c1371 ("RISC-V: Fix failed hoist in LICM of vmv.v.x instruction") Since above commit, we have following failure: FAIL: gcc.c-torture/execute/memset-3.c -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test FAIL: gcc.c-torture/execute/memset-3.c -O3 -g execution test The issue was not the commit but rather it unravelled an issue in the vsetvli pass. Here's Juzhe's analysis: We have 2 types of global vsetvls insertion. One is earliest fusion of each end of the block. The other is LCM suggested edge vsetvls. So before this patch, insertion as follows: | (insn 2817 2820 2818 361 (set (reg:SI 67 vtype) | (unspec:SI [ | (const_int 8 [0x8]) | (const_int 7 [0x7]) | (const_int 1 [0x1]) repeated x2 | ] UNSPEC_VSETVL)) 1708 {vsetvl_vtype_change_only} | (nil)) | (insn 2818 2817 999 361 (set (reg:SI 67 vtype) | (unspec:SI [ | (const_int 32 [0x20]) | (const_int 1 [0x1]) repeated x3 | ] UNSPEC_VSETVL)) 1708 {vsetvl_vtype_change_only} | (nil)) After this patch: | (insn 2817 2820 2819 361 (set (reg:SI 67 vtype) | (unspec:SI [ | (const_int 32 [0x20]) | (const_int 1 [0x1]) repeated x3 | ] UNSPEC_VSETVL)) 1708 {vsetvl_vtype_change_only} | (nil)) | (insn 2819 2817 999 361 (set (reg:SI 67 vtype) | (unspec:SI [ | (const_int 8 [0x8]) | (const_int 7 [0x7]) | (const_int 1 [0x1]) repeated x2 | ] UNSPEC_VSETVL)) 1708 {vsetvl_vtype_change_only} | (nil)) The original insertion order is incorrect. We should first insert earliest fusion since it is the vsetvls information already there which was seen by later LCM. We just delay the insertion. So it should be come before the LCM suggested insertion. PR target/112447 gcc/ChangeLog: * config/riscv/riscv-vsetvl.cc (pre_vsetvl::emit_vsetvl): Insert local vsetvl info before LCM suggested one. Tested-by: Patrick O'Neill <patrick@rivosinc.com> # pre-commit-CI #679 Co-developed-by: Vineet Gupta <vineetg@rivosinc.com>
Fixed for gcc-14.