This test case is used to check whether the value stored in the memory is correct when the memory overlaps. compile option: -march=rv64gcv_zvl128b -mabi=lp64d -fno-inline -O3 testcase: #include <stdlib.h> #include <string.h> struct st { unsigned int num : 8; }; void mem_overlap (struct st *a, struct st *b) { for (int i = 0; i < 9; i++) { a[i].num = b[i].num + 1; } } int main (void) { struct st * a; a = (struct st*) malloc (9 * sizeof(struct st)); memset (a, 0, 9 * sizeof(struct st)); // input a = 0, 0, 0, 0, 0, 0, 0, 0, 0 mem_overlap(&a[1], a); // output a = 0, 1, 2, 3, 4, 5, 6, 7, 8 if (a[2].num == 2) exit(0); else abort(); } Following assembly shows, the first few instructions have checking for overlapping memory regions. But some checks may be missing. mem_overlap: csrr a5,vlenb sub a4,a0,a1 slli a5,a5,1 addi a4,a4,-7 addi a5,a5,-14 bleu a4,a5,.L2 vsetivli zero,8,e8,mf2,ta,ma vlseg4e8.v v4,(a1) vid.v v1 vsll.vi v1,v1,2 vadd.vi v4,v4,1 vsuxei8.v v4,(a0),v1 lbu a5,32(a1) addiw a5,a5,1 andi a5,a5,0xff sb a5,32(a0) ret .L2: lbu a5,0(a1) addiw a5,a5,1 sb a5,0(a0) lbu a5,4(a1) addiw a5,a5,1 sb a5,4(a0) lbu a5,8(a1) addiw a5,a5,1 sb a5,8(a0) lbu a5,12(a1) addiw a5,a5,1 sb a5,12(a0) lbu a5,16(a1) addiw a5,a5,1 .... This error can also be reproduced in the master branch. Just revert following patch. commit e0b9c8ad7098fb08a25a61fe17d4274dd73e5145 Refs: basepoints/gcc-15-639-ge0b9c8ad709 Author: Robin Dapp <rdapp@ventanamicro.com> AuthorDate: Mon Feb 26 13:09:15 2024 +0100 Commit: Robin Dapp <rdapp@ventanamicro.com> CommitDate: Fri May 17 22:31:43 2024 +0200 RISC-V: Add initial cost handling for segment loads/stores. This patch makes segment loads and stores more expensive. It adds segment_permute_2 as well as 3 to 8 cost fields to the common vector costs and adds handling to adjust_stmt_cost.
Confirmed failing on the trunk, instead of reverting the patch we can just use -fno-vect-cost-model, so the full command line would be -O3 -march=rv64gcv -fno-inline -fno-vect-cost-model.
Confirmed, this also fails on x86-64. It works when using unsigned char or unsigned int data and only fails with the bitfield. The bitfield has the effect of aligning num and we get padding. create runtime check for data references _3->D.3529 and _5->D.3529 using an address-based WAR/WAW test t.c:10:21: note: created 1 versioning for alias checks. _39 = b_10(D) + 4; _40 = a_11(D) - _39; _41 = (sizetype) _40; _42 = _41 + 18446744073709551613; // _41 - 3 _18 = 1; if (_42 > 18) goto <bb 17>; [80.00%] $2 = (const dr_with_seg_len_pair_t &) @0x4774088: {first = {dr = 0x48b1610, seg_len = <integer_cst 0x7ffff686b7c8>, access_size = {coeffs = {1}}, align = 4}, second = {dr = 0x48b31a0, seg_len = <integer_cst 0x7ffff686b7c8>, access_size = {coeffs = {1}}, align = 4}, flags = 2} (gdb) p debug_generic_expr (0x7ffff686b7c8) 28 the vectorization factor is 8. Richard, you did create_waw_or_war_checks - can you check? The testcase needs bitfield handling in if-conversion but I suspect we can do without as well. The following fails with GCC 12 already (you might or might not need -fno-vect-cost-model) typedef struct __attribute__((aligned(4))) { unsigned char num; } st; void __attribute__((noipa)) mem_overlap (st *a, st *b) { for (int i = 0; i < 9; i++) a[i].num = b[i].num + 1; } int main () { st * a; a = (st*) __builtin_malloc (9 * sizeof(st)); __builtin_memset (a, 0, 9 * sizeof(st)); // input a = 0, 0, 0, 0, 0, 0, 0, 0, 0 mem_overlap(&a[1], a); // output a = 0, 1, 2, 3, 4, 5, 6, 7, 8 if (a[2].num != 2) __builtin_abort(); return 0; }
We document class dr_with_seg_len { ... /* The minimum common alignment of DR's start address, SEG_LEN and ACCESS_SIZE. */ unsigned int align; but here we have access_size == 1 and align == 4. It's also said /* All addresses involved are known to have a common alignment ALIGN. We can therefore subtract ALIGN from an exclusive endpoint to get an inclusive endpoint. In the best (and common) case, ALIGN is the same as the access sizes of both DRs, and so subtracting ALIGN cancels out the addition of an access size. */ unsigned int align = MIN (dr_a.align, dr_b.align); poly_uint64 last_chunk_a = dr_a.access_size - align; poly_uint64 last_chunk_b = dr_b.access_size - align; and We also know that last_chunk_b <= |step|; this is checked elsewhere if it isn't guaranteed at compile time. step == 4, but last_chunk_a/b are -3U. I couldn't find the "elsewhere" to check what we validate there. I think the case of align > access_size can easily happen with grouped accesses with a gap at the end (see vect_vfa_access_size), so simply failing the address-based check for this case is too pessimistic. Richard, I'd really would like you to handle this.
Will have a look (but might be a few days).
(In reply to Richard Biener from comment #3) > We document > > class dr_with_seg_len > { > ... > /* The minimum common alignment of DR's start address, SEG_LEN and > ACCESS_SIZE. */ > unsigned int align; > > but here we have access_size == 1 and align == 4. It's also said > > /* All addresses involved are known to have a common alignment ALIGN. > We can therefore subtract ALIGN from an exclusive endpoint to get > an inclusive endpoint. In the best (and common) case, ALIGN is the > same as the access sizes of both DRs, and so subtracting ALIGN > cancels out the addition of an access size. */ > unsigned int align = MIN (dr_a.align, dr_b.align); > poly_uint64 last_chunk_a = dr_a.access_size - align; > poly_uint64 last_chunk_b = dr_b.access_size - align; > > and > > We also know > that last_chunk_b <= |step|; this is checked elsewhere if it isn't > guaranteed at compile time. > > step == 4, but last_chunk_a/b are -3U. I couldn't find the "elsewhere" > to check what we validate there. The assumption that access_size is a multiple of align is crucial, so like you say, it all falls apart if that doesn't hold. In this case, that means that last_chunk_* should never have been negative. But I agree that the “elsewhere” doesn't seem to exist after all. That is, the step can be arbitrarily smaller than the access size. Somewhat relatedly, we seem to vectorise: struct s { int x; } __attribute__((packed)); void f (char *xc, char *yc, int z) { for (int i = 0; i < 100; ++i) { struct s *x = (struct s *) xc; struct s *y = (struct s *) yc; x->x += y->x; xc += z; yc += z; } } on aarch64 even with -mstrict-align -fno-vect-cost-model, generating elementwise accesses that assume that the ints are aligned. E.g.: _71 = (char *) ivtmp.19_21; _30 = ivtmp.29_94 - _26; _60 = (char *) _30; _52 = __MEM <int> ((int *)_71); _53 = (char *) ivtmp.25_18; _54 = __MEM <int> ((int *)_53); _55 = (char *) ivtmp.26_16; _56 = __MEM <int> ((int *)_55); _57 = (char *) ivtmp.27_88; _58 = __MEM <int> ((int *)_57); _59 = _Literal (int [[gnu::vector_size(16)]]) {_52, _54, _56, _58}; But the vector loop is executed even for a step of 1 (byte), provided that x and y don't overlap. > I think the case of align > access_size can easily happen with grouped > accesses with a gap at the end (see vect_vfa_access_size), so simply > failing the address-based check for this case is too pessimistic. Yeah. Something similar happened in PR115192, and I think this PR shows that the fix for PR115192 was misplaced. We should fix the overly large alignment at source, to meet the dr_with_seg_len precondition you quoted above.
(In reply to Richard Sandiford from comment #5) > (In reply to Richard Biener from comment #3) > > We document > > > > class dr_with_seg_len > > { > > ... > > /* The minimum common alignment of DR's start address, SEG_LEN and > > ACCESS_SIZE. */ > > unsigned int align; > > > > but here we have access_size == 1 and align == 4. It's also said > > > > /* All addresses involved are known to have a common alignment ALIGN. > > We can therefore subtract ALIGN from an exclusive endpoint to get > > an inclusive endpoint. In the best (and common) case, ALIGN is the > > same as the access sizes of both DRs, and so subtracting ALIGN > > cancels out the addition of an access size. */ > > unsigned int align = MIN (dr_a.align, dr_b.align); > > poly_uint64 last_chunk_a = dr_a.access_size - align; > > poly_uint64 last_chunk_b = dr_b.access_size - align; > > > > and > > > > We also know > > that last_chunk_b <= |step|; this is checked elsewhere if it isn't > > guaranteed at compile time. > > > > step == 4, but last_chunk_a/b are -3U. I couldn't find the "elsewhere" > > to check what we validate there. > The assumption that access_size is a multiple of align is crucial, so like > you say, it all falls apart if that doesn't hold. In this case, that means > that last_chunk_* should never have been negative. > > But I agree that the “elsewhere” doesn't seem to exist after all. That is, > the step can be arbitrarily smaller than the access size. Somewhat > relatedly, we seem to vectorise: > > struct s { int x; } __attribute__((packed)); > > void f (char *xc, char *yc, int z) > { > for (int i = 0; i < 100; ++i) > { > struct s *x = (struct s *) xc; > struct s *y = (struct s *) yc; > x->x += y->x; > xc += z; > yc += z; > } > } > > on aarch64 even with -mstrict-align -fno-vect-cost-model, generating > elementwise accesses that assume that the ints are aligned. E.g.: > > _71 = (char *) ivtmp.19_21; > _30 = ivtmp.29_94 - _26; > _60 = (char *) _30; > _52 = __MEM <int> ((int *)_71); > _53 = (char *) ivtmp.25_18; > _54 = __MEM <int> ((int *)_53); > _55 = (char *) ivtmp.26_16; > _56 = __MEM <int> ((int *)_55); > _57 = (char *) ivtmp.27_88; > _58 = __MEM <int> ((int *)_57); > _59 = _Literal (int [[gnu::vector_size(16)]]) {_52, _54, _56, _58}; > > But the vector loop is executed even for a step of 1 (byte), provided that x > and y don't overlap. I think this is due a similar issue to what you noticed wrt dr_aligned and how we emit aligned loads then instead of checking the byte alignment vs. the access size we emit - I think we don't consider misaligned elements at all when code generating element accesses. We do see t.c:5:21: note: vect_compute_data_ref_alignment: t.c:5:21: missed: step doesn't divide the vector alignment. t.c:5:21: missed: Unknown alignment for access: MEM[(struct s *)xc_21].x t.c:5:21: note: vect_compute_data_ref_alignment: t.c:5:21: missed: step doesn't divide the vector alignment. t.c:5:21: missed: Unknown alignment for access: MEM[(struct s *)yc_22].x t.c:5:21: note: vect_compute_data_ref_alignment: t.c:5:21: missed: step doesn't divide the vector alignment. t.c:5:21: missed: Unknown alignment for access: MEM[(struct s *)xc_21].x I'm splitting this out to another PR.
PR119155
The trunk branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>: https://gcc.gnu.org/g:e8651b80aeb86da935035e218747a6b41b611497 commit r15-7881-ge8651b80aeb86da935035e218747a6b41b611497 Author: Richard Sandiford <richard.sandiford@arm.com> Date: Fri Mar 7 10:18:35 2025 +0000 vect: Enforce dr_with_seg_len::align precondition [PR116125] tree-data-refs.cc uses alignment information to try to optimise the code generated for alias checks. The assumption for "normal" non-grouped, full-width scalar accesses was that the access size would be a multiple of the alignment. As Richi notes in the PR, this is a documented precondition of dr_with_seg_len: /* The minimum common alignment of DR's start address, SEG_LEN and ACCESS_SIZE. */ unsigned int align; PR115192 was a case in which this assumption didn't hold. The access was part of an aligned 4-element group, but only the first 2 elements of the group were accessed. The alignment was therefore double the access size. In r15-820-ga0fe4fb1c8d78045 I'd "fixed" that by capping the alignment in one of the output routines. But I think that was misconceived. The precondition means that we should cap the alignment at source instead. Failure to do that caused a similar wrong code bug in this PR, where the alignment comes from a short bitfield access rather than from a group access. gcc/ PR tree-optimization/116125 * tree-vect-data-refs.cc (vect_prune_runtime_alias_test_list): Make the dr_with_seg_len alignment fields describe tha access sizes as well as the pointer alignment. * tree-data-ref.cc (create_intersect_range_checks): Don't compensate for invalid alignment fields here. gcc/testsuite/ PR tree-optimization/116125 * gcc.dg/vect/pr116125.c: New test.
Fixed on trunk, will backport after a grace period.