Bug 43814 - gcc failed to inline memcpy
Summary: gcc failed to inline memcpy
Status: RESOLVED DUPLICATE of bug 40060
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.6.0
: P3 enhancement
Target Milestone: ---
Assignee: Maxim Kuvyrkov
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2010-04-20 09:02 UTC by Carrot
Modified: 2012-01-16 14:01 UTC (History)
6 users (show)

See Also:
Host: i686-linux
Target: arm-eabi
Build: i686-linux
Known to work:
Known to fail: 4.4.3, 4.5.0
Last reconfirmed: 2010-04-20 13:06:31


Attachments
test case (319 bytes, text/x-csrc)
2010-04-20 09:03 UTC, Carrot
Details
reduced test case (108 bytes, text/plain)
2010-04-20 09:39 UTC, Mikael Pettersson
Details
reduced test case, corrected (108 bytes, text/plain)
2010-04-20 09:44 UTC, Mikael Pettersson
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Carrot 2010-04-20 09:02:29 UTC
Attached is a simplified test case from dalvik VM. Compile it with options -march=armv7-a -mthumb -Os -finline-functions, gcc generates

Dalvik_sun_misc_Unsafe_getObject:
        push    {r0, r1, r2, r4, r5, lr}
        ldr     r5, [r0, #4]
        mov     r4, r1
        movs    r2, #8
        add     r1, r0, #8
        mov     r0, sp
        bl      memcpy
        ldr     r3, [sp, #0]
        ldr     r3, [r5, r3]
        str     r3, [r4, #0]
        pop     {r1, r2, r3, r4, r5, pc}

If we inline memcpy, the result can be

        LDRD     r2,r0,[r0,#4]
        LDR      r0,[r2,r0]
        STR      r0,[r1,#0]
        BX       lr


There are at least 2 problems prevent it to be inlined for target arm.

1. In function emit_block_move_hints, for some unknown reason, the alignment is wrongly computed as 8 bits.

2. The MOVE_RATIO defined in arm.h is too small.

#define MOVE_RATIO(speed) (arm_tune_xscale ? 4 : 2)

It means only when we move 1 unit of memory, then the memcpy can get a chance to be inlined.
Comment 1 Carrot 2010-04-20 09:03:02 UTC
Created attachment 20435 [details]
test case
Comment 2 Mikael Pettersson 2010-04-20 09:39:34 UTC
Created attachment 20438 [details]
reduced test case

With this reduced test case I see the missed-optimization for both ARM and Thumb modes with gcc 4.5 and 4.4.  The cause seems to be the non-zero offset in the src array; if I change memcpy's src operand from &src[1] to &src[0] the memcpy does get inlined.  However, even then the allocation of the long long tmp does not get eliminated, leaving two redundant SP adjustments in the final code.
Comment 3 Mikael Pettersson 2010-04-20 09:44:25 UTC
Created attachment 20439 [details]
reduced test case, corrected

Oops, I attached the wrong version of the test case.
Comment 4 Maxim Kuvyrkov 2010-11-12 07:55:15 UTC
I'm working on this.
Comment 5 Maxim Kuvyrkov 2010-11-12 09:00:56 UTC
Richard,

I'm now looking at the failure to inline memcpy() to copy 8 bytes in
==
unsigned long get_ull(const unsigned int *src)
{
    unsigned long long tmp;

    __builtin_memcpy(&tmp, &src[0], 8);
    return tmp;
}
==

One of the underlying problems that prevent inlining of memcpy() is pointer alignment information for SRC.  Pointer information for SRC is conservatively initialized in get_ptr_info() and then never made more precise.  With conservative alignment of 1-byte expand estimates that it would take 8 instructions to copy the data, so a call to memcpy() is expanded instead.

After your patch [*] pointer alignment is primarily handled in tree-ssa-ccp.c.  For the above example the SSA_NAMEs for both 'src' and 'tmp' are classified as VARYING, so the alignment stays the conservative 1-byte.

IIUC, the only source of improving alignment information for SRC and TMP in the above example is type information.  Do you see any other way we could infer better alignment?

What do you think would be a good place to improve pointer alignment information for the above case?  CCP doesn't seem like the right place as is starts gathering and tracking information on the assignments and not from the function arguments.

[*] http://gcc.gnu.org/ml/gcc-patches/2010-08/msg00831.html

Thanks.
Comment 6 rguenther@suse.de 2010-11-12 11:23:51 UTC
On Fri, 12 Nov 2010, mkuvyrkov at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43814
> 
> Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |rguenth at gcc dot gnu.org
> 
> --- Comment #5 from Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> 2010-11-12 09:00:56 UTC ---
> Richard,
> 
> I'm now looking at the failure to inline memcpy() to copy 8 bytes in
> ==
> unsigned long get_ull(const unsigned int *src)
> {
>     unsigned long long tmp;
> 
>     __builtin_memcpy(&tmp, &src[0], 8);
>     return tmp;
> }
> ==
> 
> One of the underlying problems that prevent inlining of memcpy() is pointer
> alignment information for SRC.  Pointer information for SRC is conservatively
> initialized in get_ptr_info() and then never made more precise.  With
> conservative alignment of 1-byte expand estimates that it would take 8
> instructions to copy the data, so a call to memcpy() is expanded instead.
> 
> After your patch [*] pointer alignment is primarily handled in tree-ssa-ccp.c. 
> For the above example the SSA_NAMEs for both 'src' and 'tmp' are classified as
> VARYING, so the alignment stays the conservative 1-byte.

Correct, that's the most conservative thing we can do.

> IIUC, the only source of improving alignment information for SRC and TMP in the
> above example is type information.  Do you see any other way we could infer
> better alignment?

No.  It is the C language that makes the promise that src points to
memory aligned suitable for unsigned int.  As the middle-end has to
target multiple languages it can't really assess anything from just
looking at pointed-to types (and no part of GCC tries to be
correct with that, as an example look at the types the C frontend
generates for struct X __attribute__((packed)) { int x; };
void foo (struct X *p, int  *q) { memcpy(&p->x, q, 4); } - &p->x
will have int * type, not int aligned(1) * type.

> What do you think would be a good place to improve pointer alignment
> information for the above case?  CCP doesn't seem like the right place as is
> starts gathering and tracking information on the assignments and not from the
> function arguments.

I considered infering initial alignment from function argument pointers
pointed-to types.  But I think that's still dangerous.  Also alignment
can be infered from dereferences, but that doesn't easily fit into the
CCP framework (there's no def for the pointer) and is also dangerous
as there is a lot of code out there that simply assumes misaligned
accesses are ok.

In short, it's not easy.  I'd consider the pointer argument thing
for 4.6, but would it really help in practice (as opposed to
simple artificial testcases)?

Richard.
Comment 7 Maxim Kuvyrkov 2010-12-02 16:42:40 UTC
(In reply to comment #6)
...
< , as an example look at the types the C frontend
> generates for struct X __attribute__((packed)) { int x; };
> void foo (struct X *p, int  *q) { memcpy(&p->x, q, 4); } - &p->x
> will have int * type, not int aligned(1) * type.
...
> I considered infering initial alignment from function argument pointers
> pointed-to types.  But I think that's still dangerous.  Also alignment
> can be infered from dereferences, but that doesn't easily fit into the
> CCP framework (there's no def for the pointer) and is also dangerous
> as there is a lot of code out there that simply assumes misaligned
> accesses are ok.
> 
> In short, it's not easy.  I'd consider the pointer argument thing
> for 4.6,

Do I understand correctly that alignment of &p->x from the above example can be correctly inferred from argument 'p'?

> but would it really help in practice (as opposed to
> simple artificial testcases)?

The above example was reduced from Dalvik VM, used in Android, and, generally, expanding __builtin_memcpy and company to 4-byte loads/stores instead of 1-byte loads/store is a problem worth fixing.

From the front-end point of view, are there any inherit problems that prevent front-ends setting reliable alignment information on the types that middle-end could trust?  The information is there, right?  It's just that middle-end cannot trust it due to several unfortunate instances when front-end sets it wrong.
Comment 8 rguenther@suse.de 2010-12-02 17:01:08 UTC
On Thu, 2 Dec 2010, mkuvyrkov at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43814
> 
> --- Comment #7 from Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> 2010-12-02 16:42:40 UTC ---
> (In reply to comment #6)
> ...
> < , as an example look at the types the C frontend
> > generates for struct X __attribute__((packed)) { int x; };
> > void foo (struct X *p, int  *q) { memcpy(&p->x, q, 4); } - &p->x
> > will have int * type, not int aligned(1) * type.
> ...
> > I considered infering initial alignment from function argument pointers
> > pointed-to types.  But I think that's still dangerous.  Also alignment
> > can be infered from dereferences, but that doesn't easily fit into the
> > CCP framework (there's no def for the pointer) and is also dangerous
> > as there is a lot of code out there that simply assumes misaligned
> > accesses are ok.
> > 
> > In short, it's not easy.  I'd consider the pointer argument thing
> > for 4.6,
> 
> Do I understand correctly that alignment of &p->x from the above example can be
> correctly inferred from argument 'p'?

No, not at the moment.

> > but would it really help in practice (as opposed to
> > simple artificial testcases)?
> 
> The above example was reduced from Dalvik VM, used in Android, and, generally,
> expanding __builtin_memcpy and company to 4-byte loads/stores instead of 1-byte
> loads/store is a problem worth fixing.
>
> From the front-end point of view, are there any inherit problems that prevent
> front-ends setting reliable alignment information on the types that middle-end
> could trust?  The information is there, right?  It's just that middle-end
> cannot trust it due to several unfortunate instances when front-end sets it
> wrong.

The frontend does not try to set alignment information on types used as
structure members, and those leak through address-taking.  That's the
main exisiting issue (there may be others, but that's the one I know).

We also have the issue that for example Ada uses VIEW_CONVERT_EXPRs
where I am not sure if they have correct alignment information on them.

In the end we _should_ be able to use alignment information of the
types used at the access (that's also more reliable as compared to
use alignment information from pointer argument types).

But we already do that in MEM_REF expansion:

        align = MAX (TYPE_ALIGN (TREE_TYPE (exp)),
                     get_object_alignment (exp, BIGGEST_ALIGNMENT));

Richard.
Comment 9 Maxim Kuvyrkov 2010-12-04 18:09:15 UTC
(In reply to comment #8)

> In the end we _should_ be able to use alignment information of the
> types used at the access (that's also more reliable as compared to
> use alignment information from pointer argument types).
> 
> But we already do that in MEM_REF expansion:
> 
>         align = MAX (TYPE_ALIGN (TREE_TYPE (exp)),
>                      get_object_alignment (exp, BIGGEST_ALIGNMENT));

Indeed.  SRC will be considered 4-byte aligned in
==
unsigned long long get_ull(const unsigned int *src)
{
    unsigned long long tmp;

    tmp = src[0];
    return tmp;
}
==
, but only 1-byte aligned in
==
unsigned long long get_ull(const unsigned int *src)
{
    unsigned long long tmp;

    __builtin_memcpy(&tmp, &src[0], 8);
    return tmp;
}
==

It seems we should really use for MEM_REFs
==
align = get_pointer_alignment (exp, BIGGEST_ALIGNMENT);
==
without relying on type alignment.  Unfortunately, that will worsen code generation even more.
Comment 10 rguenther@suse.de 2010-12-04 18:49:34 UTC
On Sat, 4 Dec 2010, mkuvyrkov at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43814
> 
> --- Comment #9 from Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> 2010-12-04 18:09:15 UTC ---
> (In reply to comment #8)
> 
> > In the end we _should_ be able to use alignment information of the
> > types used at the access (that's also more reliable as compared to
> > use alignment information from pointer argument types).
> > 
> > But we already do that in MEM_REF expansion:
> > 
> >         align = MAX (TYPE_ALIGN (TREE_TYPE (exp)),
> >                      get_object_alignment (exp, BIGGEST_ALIGNMENT));
> 
> Indeed.  SRC will be considered 4-byte aligned in
> ==
> unsigned long long get_ull(const unsigned int *src)
> {
>     unsigned long long tmp;
> 
>     tmp = src[0];

Because here the C language specifies that *src has to be properly
aligned.

>     return tmp;
> }
> ==
> , but only 1-byte aligned in
> ==
> unsigned long long get_ull(const unsigned int *src)
> {
>     unsigned long long tmp;
> 
>     __builtin_memcpy(&tmp, &src[0], 8);

And memcpy has to deal with unaligned memory.

>     return tmp;
> }
> ==
> 
> It seems we should really use for MEM_REFs
> ==
> align = get_pointer_alignment (exp, BIGGEST_ALIGNMENT);
> ==
> without relying on type alignment.  Unfortunately, that will worsen code
> generation even more.

That was what I did originally ...
Comment 11 Maxim Kuvyrkov 2010-12-04 19:04:29 UTC
I looked into CCP a bit and it seems like no alignment analysis whatsoever is done for either &tmp nor &src[0] in
==
__builtin_memcpy (&tmp, &src[0], 8);
==

That is because
1. ccp_initialize() sees __builtin_memcpy,
2. infers that its return value varies,
3. exits without processing the arguments.

If the code was structured like so
==
D1 = &tmp;
D2 = &src[0];
__builtin_memcpy (D1, D2, 8);
==
CCP would have a better chance inferring alignment for D1 and D2.

I'm mostly walking in the dark trying improve CCP, so any pointers would be appreciated.
Comment 12 rguenther@suse.de 2010-12-04 21:12:35 UTC
On Sat, 4 Dec 2010, mkuvyrkov at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43814
> 
> --- Comment #11 from Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> 2010-12-04 19:04:29 UTC ---
> I looked into CCP a bit and it seems like no alignment analysis whatsoever is
> done for either &tmp nor &src[0] in
> ==
> __builtin_memcpy (&tmp, &src[0], 8);
> ==
> 
> That is because
> 1. ccp_initialize() sees __builtin_memcpy,
> 2. infers that its return value varies,
> 3. exits without processing the arguments.
> 
> If the code was structured like so
> ==
> D1 = &tmp;
> D2 = &src[0];
> __builtin_memcpy (D1, D2, 8);
> ==
> CCP would have a better chance inferring alignment for D1 and D2.
> 
> I'm mostly walking in the dark trying improve CCP, so any pointers would be
> appreciated.

&tmp and &src[0] are irrelevant for CCP.  get_{pointer,object}_alignment
does a proper job on them already.

You can only improve CCP by handling SSA name default defs according
to their type (thus, give them initial alignment).  But that's
dangerous in my opinion.

If CCP wants to seed alignment from dereferences like

 d.134_2 = *p_4;

then it needs to insert stuff like assert-exprs like VRP does
to have SSA name defs at the point of the dereference.  Of course
it also needs to retain the so introduced copies.

But I haven't wrapped my brain about this enough to say if this
approach would make sense or would be beneficial.
Comment 13 Richard Biener 2012-01-16 14:01:48 UTC
Dup.

*** This bug has been marked as a duplicate of bug 40060 ***