| Summary: | [4.9 regression] ARM: NEON: memcpy compiles to vst1 with incorrect alignment | ||
|---|---|---|---|
| Product: | gcc | Reporter: | Mark H Weaver <mhw> |
| Component: | tree-optimization | Assignee: | Richard Biener <rguenth> |
| Status: | RESOLVED FIXED | ||
| Severity: | major | CC: | mikpelinux, ramana, rguenth |
| Priority: | P3 | Keywords: | wrong-code |
| Version: | 4.9.3 | ||
| Target Milestone: | 5.3 | ||
| Host: | arm-unknown-linux-gnueabihf | Target: | arm-unknown-linux-gnueabihf |
| Build: | arm-unknown-linux-gnueabihf | Known to work: | 5.3.0, 6.0 |
| Known to fail: | 4.8.5, 4.9.3, 5.1.0, 5.2.0 | Last reconfirmed: | 2015-07-29 00:00:00 |
| Attachments: |
Minimal example code that is miscompiled
patch |
||
|
Description
Mark H Weaver
2015-07-17 20:06:43 UTC
I can reproduce the alignment faults on armv7l-linux-gnueabi with -O3 -mfpu=neon and gcc-4.8.5/4.9.3/5.2/6.0, but not with gcc-4.7.4. This test case changed behaviour twice in the 4.7->4.8 development cycle. First r185807 broke it by replacing code for unaligned memory accesses with code requiring more alignment than present in the source: --- pr66917.s-r185806 2015-07-19 17:16:23.536116155 +0200 +++ pr66917.s-r185807 2015-07-19 17:13:23.016388416 +0200 @@ -17,42 +17,13 @@ .global test_neon_load_store_alignment .type test_neon_load_store_alignment, %function test_neon_load_store_alignment: - @ args = 0, pretend = 0, frame = 32 + @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. - mov r3, r0 - stmfd sp!, {r4, r5, r6, r7} - mov r7, r1 - ldr r0, [r0, #0] @ unaligned - mov r6, r2 - sub sp, sp, #32 - ldr r1, [r3, #4] @ unaligned - mov r5, sp - ldr r2, [r3, #8] @ unaligned - add r4, sp, #16 - ldr r3, [r3, #12] @ unaligned - mov ip, sp - stmia r5!, {r0, r1, r2, r3} - ldr r0, [r7, #0] @ unaligned - ldr r1, [r7, #4] @ unaligned - ldr r2, [r7, #8] @ unaligned - ldr r3, [r7, #12] @ unaligned - fldd d16, [sp, #0] @ int - fldd d19, [sp, #8] @ int - stmia r4!, {r0, r1, r2, r3} - fldd d18, [sp, #16] @ int - veor d17, d16, d18 - fldd d18, [sp, #24] @ int - fstd d17, [sp, #0] @ int - veor d16, d19, d18 - fstd d16, [sp, #8] @ int - ldmia ip!, {r0, r1, r2, r3} - str r0, [r6, #0] @ unaligned - str r1, [r6, #4] @ unaligned - str r2, [r6, #8] @ unaligned - str r3, [r6, #12] @ unaligned - add sp, sp, #32 - ldmfd sp!, {r4, r5, r6, r7} + vldmia r0, {d18-d19} + vldmia r1, {d16-d17} + veor q8, q9, q8 + vstmia r2, {d16-d17} bx lr .size test_neon_load_store_alignment, .-test_neon_load_store_alignment .section .text.startup,"ax",%progbits On Linux, this code SIGBUSes because the kernel can't fix up the first misaligned access: [292105.326391] Alignment trap: not handling instruction ecd02b04 at [<00008e84>] [292105.396370] Unhandled fault: alignment exception (0x001) at 0x0008b109 Then r191399 changed it again by replacing the vldm/vstm instructions with vld1.64/vst1.64 instructions: --- pr66917.s-r191398 2015-07-19 19:12:15.815583139 +0200 +++ pr66917.s-r191399 2015-07-19 19:08:36.416037498 +0200 @@ -20,10 +20,10 @@ @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. - vldmia r0, {d18-d19} - vldmia r1, {d16-d17} + vld1.64 {d18-d19}, [r0:64] + vld1.64 {d16-d17}, [r1:64] veor q8, q9, q8 - vstmia r2, {d16-d17} + vst1.64 {d16-d17}, [r2:64] bx lr .size test_neon_load_store_alignment, .-test_neon_load_store_alignment .section .text.startup,"ax",%progbits These instructions still fault, but the kernel recognizes them and fixes up the alignment faults (if suitably configured). If I compile the test case for x86_64 w/ -O3 -mavx, the compiler generates vmovdqu instructions which permit unaligned addresses. So I suspect a target bug. for STRICT_ALIGNMENT targets properly and (In reply to Mikael Pettersson from comment #2) > This test case changed behaviour twice in the 4.7->4.8 development cycle. > First r185807 broke it by replacing code for unaligned memory accesses with > code requiring more alignment than present in the source: > > --- pr66917.s-r185806 2015-07-19 17:16:23.536116155 +0200 > +++ pr66917.s-r185807 2015-07-19 17:13:23.016388416 +0200 > @@ -17,42 +17,13 @@ > .global test_neon_load_store_alignment > .type test_neon_load_store_alignment, %function > test_neon_load_store_alignment: > - @ args = 0, pretend = 0, frame = 32 > + @ args = 0, pretend = 0, frame = 0 > @ frame_needed = 0, uses_anonymous_args = 0 > @ link register save eliminated. > - mov r3, r0 > - stmfd sp!, {r4, r5, r6, r7} > - mov r7, r1 > - ldr r0, [r0, #0] @ unaligned > - mov r6, r2 > - sub sp, sp, #32 > - ldr r1, [r3, #4] @ unaligned > - mov r5, sp > - ldr r2, [r3, #8] @ unaligned > - add r4, sp, #16 > - ldr r3, [r3, #12] @ unaligned > - mov ip, sp > - stmia r5!, {r0, r1, r2, r3} > - ldr r0, [r7, #0] @ unaligned > - ldr r1, [r7, #4] @ unaligned > - ldr r2, [r7, #8] @ unaligned > - ldr r3, [r7, #12] @ unaligned > - fldd d16, [sp, #0] @ int > - fldd d19, [sp, #8] @ int > - stmia r4!, {r0, r1, r2, r3} > - fldd d18, [sp, #16] @ int > - veor d17, d16, d18 > - fldd d18, [sp, #24] @ int > - fstd d17, [sp, #0] @ int > - veor d16, d19, d18 > - fstd d16, [sp, #8] @ int > - ldmia ip!, {r0, r1, r2, r3} > - str r0, [r6, #0] @ unaligned > - str r1, [r6, #4] @ unaligned > - str r2, [r6, #8] @ unaligned > - str r3, [r6, #12] @ unaligned > - add sp, sp, #32 > - ldmfd sp!, {r4, r5, r6, r7} > + vldmia r0, {d18-d19} > + vldmia r1, {d16-d17} > + veor q8, q9, q8 > + vstmia r2, {d16-d17} > bx lr > .size test_neon_load_store_alignment, > .-test_neon_load_store_alignment > .section .text.startup,"ax",%progbits > > On Linux, this code SIGBUSes because the kernel can't fix up the first > misaligned access: > > [292105.326391] Alignment trap: not handling instruction ecd02b04 at > [<00008e84>] > [292105.396370] Unhandled fault: alignment exception (0x001) at 0x0008b109 > > Then r191399 changed it again by replacing the vldm/vstm instructions with > vld1.64/vst1.64 instructions: > > --- pr66917.s-r191398 2015-07-19 19:12:15.815583139 +0200 > +++ pr66917.s-r191399 2015-07-19 19:08:36.416037498 +0200 > @@ -20,10 +20,10 @@ > @ args = 0, pretend = 0, frame = 0 > @ frame_needed = 0, uses_anonymous_args = 0 > @ link register save eliminated. > - vldmia r0, {d18-d19} > - vldmia r1, {d16-d17} > + vld1.64 {d18-d19}, [r0:64] > + vld1.64 {d16-d17}, [r1:64] > veor q8, q9, q8 > - vstmia r2, {d16-d17} > + vst1.64 {d16-d17}, [r2:64] > bx lr > .size test_neon_load_store_alignment, > .-test_neon_load_store_alignment > .section .text.startup,"ax",%progbits > > These instructions still fault, but the kernel recognizes them and fixes up > the alignment faults (if suitably configured). > > If I compile the test case for x86_64 w/ -O3 -mavx, the compiler generates > vmovdqu instructions which permit unaligned addresses. So I suspect a > target bug. No, sorry that doesn't suggest a target bug to me at all. The ARM backend is a STRICT_ALIGNMENT target, this implies the optimizers cannot introduce unaligned accesses unless sent through the movmisalign interface AFAIUI. In this testcase somehow at expand time the compiler thinks that the pointer is aligned to 8 bytes and can put out a normal mov insn rather than use the misaligned interface. From the expand dump: (insn 8 5 9 2 (set (reg:V2DI 117) (mem:V2DI (reg/v/f:SI 113 [ ap ]) [0 MEM[(char * {ref-all})ap_2(D)]+0 S16 A64])) /tmp/misalign.c:17 -1 (nil)) (insn 9 8 10 2 (set (reg:V2DI 118) (mem:V2DI (reg/v/f:SI 114 [ bp ]) [0 MEM[(char * {ref-all})bp_3(D)]+0 S16 A64])) /tmp/misalign.c:17 -1 (nil)) Look at the A64.. That's a smoking gun to me and needs further investigation. Wrong code. The slp2 dump at tree level says about the ap pointer:
base_address: ap_2(D)
offset from base address: 0
constant offset from base address: 0
step: 0
aligned to: 64
base_object: MEM[(char * {ref-all})ap_2(D)]
That looks dodgy?
Probably because you access a.u/b.u which is uint64_t and thus the union is laid out as having 8 byte alignment? How do the original GENERIC trees produced look like? (In reply to Richard Biener from comment #6) > Probably because you access a.u/b.u which is uint64_t and thus the union > is laid out as having 8 byte alignment? > > How do the original GENERIC trees produced look like? The 003t.original shows: typedef union { uint64_t u[2]; uint8_t c[16]; } unionunion { uint64_t u[2]; uint8_t c[16]; }; union { uint64_t u[2]; uint8_t c[16]; } a; union { uint64_t u[2]; uint8_t c[16]; } b; union { uint64_t u[2]; uint8_t c[16]; } a; union { uint64_t u[2]; uint8_t c[16]; } b; memcpy ((void * restrict) &a.c, (const void * restrict) ap, 16); memcpy ((void * restrict) &b.c, (const void * restrict) bp, 16); a.u[0] = a.u[0] ^ b.u[0]; a.u[1] = a.u[1] ^ b.u[1]; memcpy ((void * restrict) outp, (const void * restrict) &a.c, 16); The 007t.lower one looks like:
try
{
MEM[(char * {ref-all})&a] = MEM[(char * {ref-all})ap];
MEM[(char * {ref-all})&b] = MEM[(char * {ref-all})bp];
D.5587 = a.u[0];
D.5588 = b.u[0];
D.5589 = D.5587 ^ D.5588;
a.u[0] = D.5589;
D.5590 = a.u[1];
D.5591 = b.u[1];
D.5592 = D.5590 ^ D.5591;
a.u[1] = D.5592;
MEM[(char * {ref-all})outp] = MEM[(char * {ref-all})&a];
}
finally
{
a = {CLOBBER};
b = {CLOBBER};
}
return;
I see the following, 'a' having an alignment of 8 bytes:
<array_ref 0x7ffff68201c0
type <integer_type 0x7ffff67f5d20 uint64_t unsigned DI
size <integer_cst 0x7ffff68ccf18 constant 64>
unit size <integer_cst 0x7ffff68ccf30 constant 8>
align 64 symtab 0 alias set -1 canonical type 0x7ffff68d19d8 precision 64 min <integer_cst 0x7ffff68de210 0> max <integer_cst 0x7ffff68d7500 18446744073709551615> context <translation_unit_decl 0x7ffff7ff10f0 D.5218>
pointer_to_this <pointer_type 0x7ffff68215e8>>
arg 0 <component_ref 0x7ffff681f570
type <array_type 0x7ffff6821540 type <integer_type 0x7ffff67f5d20 uint64_t>
TI
size <integer_cst 0x7ffff68de228 constant 128>
unit size <integer_cst 0x7ffff68de240 constant 16>
align 64 symtab 0 alias set -1 canonical type 0x7ffff6821690 domain <integer_type 0x7ffff6821498>>
arg 0 <var_decl 0x7ffff68d6900 a type <union_type 0x7ffff68213f0>
addressable used BLK file t.c line 12 col 5 size <integer_cst 0x7ffff68de228 128> unit size <integer_cst 0x7ffff68de240 16>
align 64 context <function_decl 0x7ffff67f6e00 test_neon_load_store_alignment> chain <var_decl 0x7ffff68d6990 b>>
arg 1 <field_decl 0x7ffff69b3558 u type <array_type 0x7ffff6821540>
TI file t.c line 10 col 16 size <integer_cst 0x7ffff68de228 128> unit size <integer_cst 0x7ffff68de240 16>
align 64 offset_align 64
offset <integer_cst 0x7ffff68ccee8 constant 0>
bit offset <integer_cst 0x7ffff68ccf48 constant 0> context <union_type 0x7ffff68213f0> chain <field_decl 0x7ffff69b35f0 c>>>
arg 1 <integer_cst 0x7ffff68de2e8 type <integer_type 0x7ffff68d1690 int> constant 1>>
so the smoking dump isn't the reads but the write:
(insn 11 10 0 (set (mem:V2DI (reg/v/f:SI 115 [ outp ]) [0 MEM[(char * {ref-all})outp_7(D)]+0 S16 A64])
(reg:V2DI 116 [ vect__4.13 ])) t.c:18 -1
(nil))
which for some reason also got 8 byte alignment. I suppose this is
SRA at work as -fno-tree-sra fixes this.
Thus confirmed as SRA problem.
Actually SRA is fine. vect_compute_data_ref_alignment is, too, leaving
DR_MISALIGNMENT at -1. The target then tells us via
bool is_packed = false;
tree type = (TREE_TYPE (DR_REF (dr)));
if (!known_alignment_for_access_p (dr))
is_packed = not_size_aligned (DR_REF (dr));
if ((TYPE_USER_ALIGN (type) && !is_packed)
|| targetm.vectorize.
support_vector_misalignment (mode, type,
DR_MISALIGNMENT (dr), is_packed))
return dr_unaligned_supported;
rm_builtin_support_vector_misalignment (mode=V2DImode, type=0x7ffff68280a8,
misalignment=-1, is_packed=true)
at /space/rguenther/tramp3d/trunk/gcc/config/arm/arm.c:27218
27218 if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access)
27217 {
27218 if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access)
27219 {
27220 HOST_WIDE_INT align = TYPE_ALIGN_UNIT (type);
27221
27222 if (is_packed)
(gdb) p unaligned_access
No symbol "unaligned_access" in current context.
(gdb) n
27220 HOST_WIDE_INT align = TYPE_ALIGN_UNIT (type);
(gdb)
27222 if (is_packed)
(gdb)
27223 return align == 1;
oops. That's obviously buggy. Not sure what the ARM hook expects in 'type',
but appearantly it's not what the vectorizer passes in here.
Docs say
This hook should return true if the target supports misaligned vector\n\
store/load of a specific factor denoted in the @var{misalignment}\n\
parameter. The vector store/load should be of machine mode @var{mode} and\n\
the elements in the vectors should be of type @var{type}. @var{is_packed}\n\
parameter is true if the memory access is defined in a packed struct.
but the arm hook seems to say "yes, I can vectorize with V2DImode
for a vector with element type unsigned long long __attribute__((aligned(1)))"
which it obviously cannot.
IMHO the hook should compare required mode alignment with type alignment.
This seems to be the arm hook trying to be too clever.
-> target bug.
Suggested fix:
Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c (revision 226348)
+++ gcc/config/arm/arm.c (working copy)
@@ -27217,7 +27217,8 @@ arm_builtin_support_vector_misalignment
{
if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access)
{
- HOST_WIDE_INT align = TYPE_ALIGN_UNIT (type);
+ HOST_WIDE_INT align
+ = GET_MODE_ALIGNMENT (TYPE_MODE (type)) / BITS_PER_UNIT;
if (is_packed)
return align == 1;
which produces
push {r4}
add r3, r0, #8
add r4, r1, #8
vld1.64 {d17}, [r1]
vld1.64 {d16}, [r4]
ldr r4, [sp], #4
vld1.64 {d19}, [r3]
veor d16, d16, d19
vld1.64 {d18}, [r0]
veor d17, d17, d18
vst1.64 {d17}, [r2]!
vst1.64 {d16}, [r2]
bx lr
> but the arm hook seems to say "yes, I can vectorize with V2DImode
> for a vector with element type unsigned long long __attribute__((aligned(1)))"
> which it obviously cannot.
It can vectorize for this case, but the loads must not be marked as aligned loads. That is, the mid-end must use the misaligned code hooks.
So I don't think this is a target bug.
Mine. On Thu, 30 Jul 2015, rearnsha at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66917
>
> Richard Earnshaw <rearnsha at gcc dot gnu.org> changed:
>
> What |Removed |Added
> ----------------------------------------------------------------------------
> Component|target |tree-optimization
>
> --- Comment #12 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
> > but the arm hook seems to say "yes, I can vectorize with V2DImode
> > for a vector with element type unsigned long long __attribute__((aligned(1)))"
> > which it obviously cannot.
>
> It can vectorize for this case, but the loads must not be marked as aligned
> loads. That is, the mid-end must use the misaligned code hooks.
>
> So I don't think this is a target bug.
Hmm, indeed. It's a vectorizer bug that always communicates at least
element alignment (thus doesn't properly code-gen for the packed_p
case)
For both loads and stores we do
else if (DR_MISALIGNMENT (first_dr) == -1)
{
TREE_TYPE (data_ref)
= build_aligned_type (TREE_TYPE (data_ref),
TYPE_ALIGN (elem_type));
align = TYPE_ALIGN_UNIT (elem_type);
misalign = 0;
but TYPE_ALIGN (elem_type) doesn't always hold.
Created attachment 36099 [details]
patch
I am testing the attached (on x86_64-linux), it fixes the testcase with a cross to arm. Full testing on arm appreciated.
(In reply to Richard Biener from comment #16) > Created attachment 36099 [details] > patch > > I am testing the attached (on x86_64-linux), it fixes the testcase with a > cross to arm. Full testing on arm appreciated. I'll give it a test run (In reply to ktkachov from comment #17) > (In reply to Richard Biener from comment #16) > > Created attachment 36099 [details] > > patch > > > > I am testing the attached (on x86_64-linux), it fixes the testcase with a > > cross to arm. Full testing on arm appreciated. > > I'll give it a test run A bootstrap and make check on arm-none-linux-gnueabihf looks fine Fixed on trunk sofar. Author: rguenth Date: Mon Aug 3 07:13:36 2015 New Revision: 226487 URL: https://gcc.gnu.org/viewcvs?rev=226487&root=gcc&view=rev Log: 2015-08-03 Richard Biener <rguenther@suse.de> PR tree-optimization/66917 * tree-vectorizer.h (struct dataref_aux): Add base_element_aligned field. (DR_VECT_AUX): New macro. (set_dr_misalignment): Adjust. (dr_misalignment): Likewise. * tree-vect-data-refs.c (vect_compute_data_ref_alignment): Compute whether the base is at least element aligned. * tree-vect-stmts.c (ensure_base_align): Adjust. (vectorizable_store): If the base is not element aligned preserve alignment of the original access if misalignment is unknown. (vectorizable_load): Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-vect-data-refs.c trunk/gcc/tree-vect-stmts.c trunk/gcc/tree-vectorizer.h Author: ctice Date: Thu Sep 24 20:56:29 2015 New Revision: 228099 URL: https://gcc.gnu.org/viewcvs?rev=228099&root=gcc&view=rev Log: Backport r226487 from GCC FSF mainline: r226487 | rguenth | 2015-08-03 00:13:36 -0700 (Mon, 03 Aug 2015) | 15 lines 2015-08-03 Richard Biener <rguenther@suse.de> PR tree-optimization/66917 * tree-vectorizer.h (struct dataref_aux): Add base_element_aligned field. (DR_VECT_AUX): New macro. (set_dr_misalignment): Adjust. (dr_misalignment): Likewise. * tree-vect-data-refs.c (vect_compute_data_ref_alignment): Compute whether the base is at least element aligned. * tree-vect-stmts.c (ensure_base_align): Adjust. (vectorizable_store): If the base is not element aligned preserve alignment of the original access if misalignment is unknown. (vectorizable_load): Likewise. Modified: branches/google/gcc-4_9-mobile/gcc/tree-vect-data-refs.c branches/google/gcc-4_9-mobile/gcc/tree-vect-stmts.c branches/google/gcc-4_9-mobile/gcc/tree-vectorizer.h Author: rguenth Date: Mon Sep 28 10:45:55 2015 New Revision: 228199 URL: https://gcc.gnu.org/viewcvs?rev=228199&root=gcc&view=rev Log: 2015-09-28 Richard Biener <rguenther@suse.de> Backport from mainline 2015-08-03 Richard Biener <rguenther@suse.de> PR tree-optimization/66917 * tree-vectorizer.h (struct dataref_aux): Add base_element_aligned field. (DR_VECT_AUX): New macro. (set_dr_misalignment): Adjust. (dr_misalignment): Likewise. * tree-vect-data-refs.c (vect_compute_data_ref_alignment): Compute whether the base is at least element aligned. * tree-vect-stmts.c (ensure_base_align): Adjust. (vectorizable_store): If the base is not element aligned preserve alignment of the original access if misalignment is unknown. (vectorizable_load): Likewise. 2015-09-16 Richard Biener <rguenther@suse.de> PR middle-end/67442 * fold-const.c (extract_muldiv_1): Properly extend multiplication result before builting a tree via wide_int_to_tree. * gcc.dg/torture/pr67442.c: New testcase. Added: branches/gcc-5-branch/gcc/testsuite/gcc.dg/torture/pr67442.c Modified: branches/gcc-5-branch/gcc/ChangeLog branches/gcc-5-branch/gcc/fold-const.c branches/gcc-5-branch/gcc/testsuite/ChangeLog branches/gcc-5-branch/gcc/tree-vect-data-refs.c branches/gcc-5-branch/gcc/tree-vect-stmts.c branches/gcc-5-branch/gcc/tree-vectorizer.h Any plans to backport to 4.9, or should this be closed? Fixed in GCC 5.3+ |