Attached is a simplified test case from dalvik VM. Compile it with options -march=armv7-a -mthumb -Os -finline-functions, gcc generates Dalvik_sun_misc_Unsafe_getObject: push {r0, r1, r2, r4, r5, lr} ldr r5, [r0, #4] mov r4, r1 movs r2, #8 add r1, r0, #8 mov r0, sp bl memcpy ldr r3, [sp, #0] ldr r3, [r5, r3] str r3, [r4, #0] pop {r1, r2, r3, r4, r5, pc} If we inline memcpy, the result can be LDRD r2,r0,[r0,#4] LDR r0,[r2,r0] STR r0,[r1,#0] BX lr There are at least 2 problems prevent it to be inlined for target arm. 1. In function emit_block_move_hints, for some unknown reason, the alignment is wrongly computed as 8 bits. 2. The MOVE_RATIO defined in arm.h is too small. #define MOVE_RATIO(speed) (arm_tune_xscale ? 4 : 2) It means only when we move 1 unit of memory, then the memcpy can get a chance to be inlined.
Created attachment 20435 [details] test case
Created attachment 20438 [details] reduced test case With this reduced test case I see the missed-optimization for both ARM and Thumb modes with gcc 4.5 and 4.4. The cause seems to be the non-zero offset in the src array; if I change memcpy's src operand from &src[1] to &src[0] the memcpy does get inlined. However, even then the allocation of the long long tmp does not get eliminated, leaving two redundant SP adjustments in the final code.
Created attachment 20439 [details] reduced test case, corrected Oops, I attached the wrong version of the test case.
I'm working on this.
Richard, I'm now looking at the failure to inline memcpy() to copy 8 bytes in == unsigned long get_ull(const unsigned int *src) { unsigned long long tmp; __builtin_memcpy(&tmp, &src[0], 8); return tmp; } == One of the underlying problems that prevent inlining of memcpy() is pointer alignment information for SRC. Pointer information for SRC is conservatively initialized in get_ptr_info() and then never made more precise. With conservative alignment of 1-byte expand estimates that it would take 8 instructions to copy the data, so a call to memcpy() is expanded instead. After your patch [*] pointer alignment is primarily handled in tree-ssa-ccp.c. For the above example the SSA_NAMEs for both 'src' and 'tmp' are classified as VARYING, so the alignment stays the conservative 1-byte. IIUC, the only source of improving alignment information for SRC and TMP in the above example is type information. Do you see any other way we could infer better alignment? What do you think would be a good place to improve pointer alignment information for the above case? CCP doesn't seem like the right place as is starts gathering and tracking information on the assignments and not from the function arguments. [*] http://gcc.gnu.org/ml/gcc-patches/2010-08/msg00831.html Thanks.
On Fri, 12 Nov 2010, mkuvyrkov at gcc dot gnu.org wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43814 > > Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |rguenth at gcc dot gnu.org > > --- Comment #5 from Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> 2010-11-12 09:00:56 UTC --- > Richard, > > I'm now looking at the failure to inline memcpy() to copy 8 bytes in > == > unsigned long get_ull(const unsigned int *src) > { > unsigned long long tmp; > > __builtin_memcpy(&tmp, &src[0], 8); > return tmp; > } > == > > One of the underlying problems that prevent inlining of memcpy() is pointer > alignment information for SRC. Pointer information for SRC is conservatively > initialized in get_ptr_info() and then never made more precise. With > conservative alignment of 1-byte expand estimates that it would take 8 > instructions to copy the data, so a call to memcpy() is expanded instead. > > After your patch [*] pointer alignment is primarily handled in tree-ssa-ccp.c. > For the above example the SSA_NAMEs for both 'src' and 'tmp' are classified as > VARYING, so the alignment stays the conservative 1-byte. Correct, that's the most conservative thing we can do. > IIUC, the only source of improving alignment information for SRC and TMP in the > above example is type information. Do you see any other way we could infer > better alignment? No. It is the C language that makes the promise that src points to memory aligned suitable for unsigned int. As the middle-end has to target multiple languages it can't really assess anything from just looking at pointed-to types (and no part of GCC tries to be correct with that, as an example look at the types the C frontend generates for struct X __attribute__((packed)) { int x; }; void foo (struct X *p, int *q) { memcpy(&p->x, q, 4); } - &p->x will have int * type, not int aligned(1) * type. > What do you think would be a good place to improve pointer alignment > information for the above case? CCP doesn't seem like the right place as is > starts gathering and tracking information on the assignments and not from the > function arguments. I considered infering initial alignment from function argument pointers pointed-to types. But I think that's still dangerous. Also alignment can be infered from dereferences, but that doesn't easily fit into the CCP framework (there's no def for the pointer) and is also dangerous as there is a lot of code out there that simply assumes misaligned accesses are ok. In short, it's not easy. I'd consider the pointer argument thing for 4.6, but would it really help in practice (as opposed to simple artificial testcases)? Richard.
(In reply to comment #6) ... < , as an example look at the types the C frontend > generates for struct X __attribute__((packed)) { int x; }; > void foo (struct X *p, int *q) { memcpy(&p->x, q, 4); } - &p->x > will have int * type, not int aligned(1) * type. ... > I considered infering initial alignment from function argument pointers > pointed-to types. But I think that's still dangerous. Also alignment > can be infered from dereferences, but that doesn't easily fit into the > CCP framework (there's no def for the pointer) and is also dangerous > as there is a lot of code out there that simply assumes misaligned > accesses are ok. > > In short, it's not easy. I'd consider the pointer argument thing > for 4.6, Do I understand correctly that alignment of &p->x from the above example can be correctly inferred from argument 'p'? > but would it really help in practice (as opposed to > simple artificial testcases)? The above example was reduced from Dalvik VM, used in Android, and, generally, expanding __builtin_memcpy and company to 4-byte loads/stores instead of 1-byte loads/store is a problem worth fixing. From the front-end point of view, are there any inherit problems that prevent front-ends setting reliable alignment information on the types that middle-end could trust? The information is there, right? It's just that middle-end cannot trust it due to several unfortunate instances when front-end sets it wrong.
On Thu, 2 Dec 2010, mkuvyrkov at gcc dot gnu.org wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43814 > > --- Comment #7 from Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> 2010-12-02 16:42:40 UTC --- > (In reply to comment #6) > ... > < , as an example look at the types the C frontend > > generates for struct X __attribute__((packed)) { int x; }; > > void foo (struct X *p, int *q) { memcpy(&p->x, q, 4); } - &p->x > > will have int * type, not int aligned(1) * type. > ... > > I considered infering initial alignment from function argument pointers > > pointed-to types. But I think that's still dangerous. Also alignment > > can be infered from dereferences, but that doesn't easily fit into the > > CCP framework (there's no def for the pointer) and is also dangerous > > as there is a lot of code out there that simply assumes misaligned > > accesses are ok. > > > > In short, it's not easy. I'd consider the pointer argument thing > > for 4.6, > > Do I understand correctly that alignment of &p->x from the above example can be > correctly inferred from argument 'p'? No, not at the moment. > > but would it really help in practice (as opposed to > > simple artificial testcases)? > > The above example was reduced from Dalvik VM, used in Android, and, generally, > expanding __builtin_memcpy and company to 4-byte loads/stores instead of 1-byte > loads/store is a problem worth fixing. > > From the front-end point of view, are there any inherit problems that prevent > front-ends setting reliable alignment information on the types that middle-end > could trust? The information is there, right? It's just that middle-end > cannot trust it due to several unfortunate instances when front-end sets it > wrong. The frontend does not try to set alignment information on types used as structure members, and those leak through address-taking. That's the main exisiting issue (there may be others, but that's the one I know). We also have the issue that for example Ada uses VIEW_CONVERT_EXPRs where I am not sure if they have correct alignment information on them. In the end we _should_ be able to use alignment information of the types used at the access (that's also more reliable as compared to use alignment information from pointer argument types). But we already do that in MEM_REF expansion: align = MAX (TYPE_ALIGN (TREE_TYPE (exp)), get_object_alignment (exp, BIGGEST_ALIGNMENT)); Richard.
(In reply to comment #8) > In the end we _should_ be able to use alignment information of the > types used at the access (that's also more reliable as compared to > use alignment information from pointer argument types). > > But we already do that in MEM_REF expansion: > > align = MAX (TYPE_ALIGN (TREE_TYPE (exp)), > get_object_alignment (exp, BIGGEST_ALIGNMENT)); Indeed. SRC will be considered 4-byte aligned in == unsigned long long get_ull(const unsigned int *src) { unsigned long long tmp; tmp = src[0]; return tmp; } == , but only 1-byte aligned in == unsigned long long get_ull(const unsigned int *src) { unsigned long long tmp; __builtin_memcpy(&tmp, &src[0], 8); return tmp; } == It seems we should really use for MEM_REFs == align = get_pointer_alignment (exp, BIGGEST_ALIGNMENT); == without relying on type alignment. Unfortunately, that will worsen code generation even more.
On Sat, 4 Dec 2010, mkuvyrkov at gcc dot gnu.org wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43814 > > --- Comment #9 from Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> 2010-12-04 18:09:15 UTC --- > (In reply to comment #8) > > > In the end we _should_ be able to use alignment information of the > > types used at the access (that's also more reliable as compared to > > use alignment information from pointer argument types). > > > > But we already do that in MEM_REF expansion: > > > > align = MAX (TYPE_ALIGN (TREE_TYPE (exp)), > > get_object_alignment (exp, BIGGEST_ALIGNMENT)); > > Indeed. SRC will be considered 4-byte aligned in > == > unsigned long long get_ull(const unsigned int *src) > { > unsigned long long tmp; > > tmp = src[0]; Because here the C language specifies that *src has to be properly aligned. > return tmp; > } > == > , but only 1-byte aligned in > == > unsigned long long get_ull(const unsigned int *src) > { > unsigned long long tmp; > > __builtin_memcpy(&tmp, &src[0], 8); And memcpy has to deal with unaligned memory. > return tmp; > } > == > > It seems we should really use for MEM_REFs > == > align = get_pointer_alignment (exp, BIGGEST_ALIGNMENT); > == > without relying on type alignment. Unfortunately, that will worsen code > generation even more. That was what I did originally ...
I looked into CCP a bit and it seems like no alignment analysis whatsoever is done for either &tmp nor &src[0] in == __builtin_memcpy (&tmp, &src[0], 8); == That is because 1. ccp_initialize() sees __builtin_memcpy, 2. infers that its return value varies, 3. exits without processing the arguments. If the code was structured like so == D1 = &tmp; D2 = &src[0]; __builtin_memcpy (D1, D2, 8); == CCP would have a better chance inferring alignment for D1 and D2. I'm mostly walking in the dark trying improve CCP, so any pointers would be appreciated.
On Sat, 4 Dec 2010, mkuvyrkov at gcc dot gnu.org wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43814 > > --- Comment #11 from Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> 2010-12-04 19:04:29 UTC --- > I looked into CCP a bit and it seems like no alignment analysis whatsoever is > done for either &tmp nor &src[0] in > == > __builtin_memcpy (&tmp, &src[0], 8); > == > > That is because > 1. ccp_initialize() sees __builtin_memcpy, > 2. infers that its return value varies, > 3. exits without processing the arguments. > > If the code was structured like so > == > D1 = &tmp; > D2 = &src[0]; > __builtin_memcpy (D1, D2, 8); > == > CCP would have a better chance inferring alignment for D1 and D2. > > I'm mostly walking in the dark trying improve CCP, so any pointers would be > appreciated. &tmp and &src[0] are irrelevant for CCP. get_{pointer,object}_alignment does a proper job on them already. You can only improve CCP by handling SSA name default defs according to their type (thus, give them initial alignment). But that's dangerous in my opinion. If CCP wants to seed alignment from dereferences like d.134_2 = *p_4; then it needs to insert stuff like assert-exprs like VRP does to have SSA name defs at the point of the dereference. Of course it also needs to retain the so introduced copies. But I haven't wrapped my brain about this enough to say if this approach would make sense or would be beneficial.
Dup. *** This bug has been marked as a duplicate of bug 40060 ***