The expansion of __builtin_mempcpy is inefficient on many targets (eg. AArch64, ARM, PPC). The issue is due to not using the same expansion options that memcpy uses in builtins.c. As a result GCC6 produces for __builtin_mempcpy(x, y, 32): PPC: 0: 38 a0 00 20 li r5,32 4: 48 00 00 00 b 4 <foo+0x4> 4: R_PPC_REL24 mempcpy 8: 60 00 00 00 nop c: 60 42 00 00 ori r2,r2,0 AArch64: mov x2, 32 b mempcpy A second issue is that GCC always calls mempcpy. mempcpy is not supported or implemented efficiently in many (if not most) library/target combinations. GLIBC only has 3 targets which implement an optimized mempcpy, so GLIBC currently inlines mempcpy into memcpy by default unless a target explicitly disables this. It seems better to do this in GCC so it works for all libraries.
I'll do that for GCC8.
I've just taken look at that and please confirm that I understand that correctly: 1) we want to ideally a same function for expansion of memcpy and mempcpy, where for later one we'll append calculation of return value (dest + n)? 2) I'm bit confused with 'GLIBC currently inlines mempcpy into memcpy'. Do is mean that you basically do not want to emit any call to mempcpy and prefer rather: int my_mempcpy(void) { return __builtin_mempcpy (a, b, SIZE); } $ ./xgcc -B. /tmp/mem.c -O2 -S -DSIZE=100000 -o /dev/stdout my_mempcpy: .LFB1: .cfi_startproc subq $8, %rsp .cfi_def_cfa_offset 16 movq b(%rip), %rsi movq a(%rip), %rdi movl $100000, %edx call memcpy addq $8, %rsp .cfi_def_cfa_offset 8 addq $100000, %rax ret .cfi_endproc ? Thanks
(In reply to Martin Liška from comment #2) > I've just taken look at that and please confirm that I understand that > correctly: > > 1) we want to ideally a same function for expansion of memcpy and mempcpy, > where for later one we'll append calculation of return value (dest + n)? Yes, I think the current movmem expander does not deal with the return value so could be reused for mempcpy expansion. > 2) I'm bit confused with 'GLIBC currently inlines mempcpy into memcpy'. Do > is mean that you basically do not want to emit any call to mempcpy and > prefer rather: Yes. Ignoring GLIBC internals, mempcpy is used quite infrequently. As a result not many targets have added highly optimized implementations. The targets that do have the issue that mempcpy is typically not in the cache when called, while memcpy is more likely already cached. If there are targets that do not want this, we could add a target define. Note that I'm not concerned about the corner case of a mempcpy being tailcalled which isn't possible anymore after we optimize mempcpy into memcpy. In almost all targets mempcpy is a C function that calls memcpy, ie. you already have extra overhead on every mempcpy call. See glibc/string/mempcpy.c: void * MEMPCPY (void *dest, const void *src, size_t len) { return memcpy (dest, src, len) + len; }
(In reply to Wilco from comment #3) > Yes. Ignoring GLIBC internals, mempcpy is used quite infrequently. As a > result not many targets have added highly optimized implementations. The > targets that do have the issue that mempcpy is typically not in the cache > when called, while memcpy is more likely already cached. > FWIW, in x86-64 glibc, memcpy and mempcpy share the same function body: 0000000000000010 <__mempcpy>: 10: 48 89 f8 mov %rdi,%rax 13: 48 01 d0 add %rdx,%rax 16: eb 1b jmp 33 <__memcpy+0x3>
In r249278 GCC was changed to transform early on calls to bcmp, bcopy and bzero to memcmp, memcpy, and memset. The main motivation was to simplify their handling and ensure they are optimized equivalently by the rest of the compiler. (This was suggested here: https://gcc.gnu.org/ml/gcc-patches/2017-06/msg00114.html). It might make sense to take the same approach with mempcpy. (The second part of r249278, i.e., the removal of the bcmp, bcopy and bzero handling, is yet to be done.)
Created attachment 41772 [details] Patch candidate I'm going to prepare some test-cases for that. Does it look good?
(In reply to Martin Liška from comment #6) > Created attachment 41772 [details] > Patch candidate > > I'm going to prepare some test-cases for that. Does it look good? Yes, it now inlines small constant sizes. However large and variable sized copies have the wrong return value: void *f1(void *p, void *q) { return __builtin_mempcpy(p, q, 256); } f1: mov x2, 256 b memcpy
(In reply to Wilco from comment #7) > (In reply to Martin Liška from comment #6) > > Created attachment 41772 [details] > > Patch candidate > > > > I'm going to prepare some test-cases for that. Does it look good? > > Yes, it now inlines small constant sizes. However large and variable sized > copies have the wrong return value: > > void *f1(void *p, void *q) { return __builtin_mempcpy(p, q, 256); } > > f1: > mov x2, 256 > b memcpy Yep, I've noticed. It's strange for me why it's not working. I've just asked at GCC ML: https://gcc.gnu.org/ml/gcc/2017-07/msg00144.html
(In reply to Martin Liška from comment #8) > (In reply to Wilco from comment #7) > > (In reply to Martin Liška from comment #6) > > > Created attachment 41772 [details] > > > Patch candidate > > > > > > I'm going to prepare some test-cases for that. Does it look good? > > > > Yes, it now inlines small constant sizes. However large and variable sized > > copies have the wrong return value: > > > > void *f1(void *p, void *q) { return __builtin_mempcpy(p, q, 256); } > > > > f1: > > mov x2, 256 > > b memcpy > > Yep, I've noticed. It's strange for me why it's not working. I've just asked > at GCC ML: https://gcc.gnu.org/ml/gcc/2017-07/msg00144.html It's marked as a tailcall so anything you generate afterwards will be ignored: (call_insn/j 13 12 14 2 (parallel [ (set (reg:DI 0 x0) (call (mem:DI (symbol_ref:DI ("memcpy") [flags 0x41] <function_decl 0xffffb7acc700 __builtin_memcpy>) [0 __builtin_memcpy S8 A8]) (const_int 0 [0]))) (return) ]) "mempcpy.c":3 -1 Also check this case: void f4(void *p, void *q, int i) { __builtin_mempcpy(p, q, i); }
> > > > Yep, I've noticed. It's strange for me why it's not working. I've just asked > > at GCC ML: https://gcc.gnu.org/ml/gcc/2017-07/msg00144.html > > It's marked as a tailcall so anything you generate afterwards will be > ignored: Thanks, knowing that it was easy to fix ;) I'm going to test the patch and will send it to mailing list.
(In reply to Martin Liška from comment #10) > > > > > > Yep, I've noticed. It's strange for me why it's not working. I've just asked > > > at GCC ML: https://gcc.gnu.org/ml/gcc/2017-07/msg00144.html > > > > It's marked as a tailcall so anything you generate afterwards will be > > ignored: > > Thanks, knowing that it was easy to fix ;) I'm going to test the patch and > will send it to mailing list. Well the new patch looks great - I tried it on GLIBC and it ends up generating better code than inlining mempcpy in the string header. In total 111 calls to mempcpy are turned into memcpy, and although there are 166 extra instructions (most add/mov to deal with the return value), there are just 12 extra callee-saves.
Author: marxin Date: Tue Aug 1 11:59:27 2017 New Revision: 250771 URL: https://gcc.gnu.org/viewcvs?rev=250771&root=gcc&view=rev Log: Make mempcpy more optimal (PR middle-end/70140). 2017-08-01 Martin Liska <mliska@suse.cz> PR middle-end/70140 * gcc.dg/string-opt-1.c: Adjust test-case to scan for memcpy. 2017-08-01 Martin Liska <mliska@suse.cz> PR middle-end/70140 * builtins.c (expand_builtin_memcpy_args): Remove. (expand_builtin_memcpy): Call newly added function expand_builtin_memory_copy_args. (expand_builtin_memcpy_with_bounds): Likewise. (expand_builtin_mempcpy): Remove last argument. (expand_builtin_mempcpy_with_bounds): Likewise. (expand_builtin_memory_copy_args): New function created from expand_builtin_mempcpy_args with small modifications. (expand_builtin_mempcpy_args): Remove. (expand_builtin_stpcpy): Remove unused argument. (expand_builtin): Likewise. (expand_builtin_with_bounds): Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/builtins.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/string-opt-1.c
Fixed.
On Fedora 26/x86-64, I got FAIL: gcc.c-torture/execute/builtins/mempcpy-2.c execution, -O0 FAIL: gcc.c-torture/execute/builtins/mempcpy-2.c execution, -O1 FAIL: gcc.c-torture/execute/builtins/mempcpy-2.c execution, -O1 FAIL: gcc.c-torture/execute/builtins/mempcpy-2.c execution, -O2 FAIL: gcc.c-torture/execute/builtins/mempcpy-2.c execution, -O2 FAIL: gcc.c-torture/execute/builtins/mempcpy-2.c execution, -O2 -flto -fno-use-linker-plugin -flto-partition=none FAIL: gcc.c-torture/execute/builtins/mempcpy-2.c execution, -O2 -flto -fno-use-linker-plugin -flto-partition=none FAIL: gcc.c-torture/execute/builtins/mempcpy-2.c execution, -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects FAIL: gcc.c-torture/execute/builtins/mempcpy-2.c execution, -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects FAIL: gcc.c-torture/execute/builtins/mempcpy-2.c execution, -O3 -g FAIL: gcc.c-torture/execute/builtins/mempcpy-2.c execution, -O3 -g FAIL: gcc.c-torture/execute/builtins/mempcpy-2.c execution, -Og -g FAIL: gcc.c-torture/execute/builtins/mempcpy-2.c execution, -Og -g FAIL: gcc.c-torture/execute/builtins/mempcpy-2.c execution, -Os FAIL: gcc.c-torture/execute/builtins/mempcpy-2.c execution, -Os FAIL: gcc.c-torture/execute/builtins/mempcpy.c compilation, -O1 (internal compiler error) FAIL: gcc.c-torture/execute/builtins/mempcpy.c compilation, -O2 -flto -fno-use-linker-plugin -flto-partition=none (internal compiler error) FAIL: gcc.c-torture/execute/builtins/mempcpy.c compilation, -O2 (internal compiler error) FAIL: gcc.c-torture/execute/builtins/mempcpy.c compilation, -O3 -g (internal compiler error) FAIL: gcc.c-torture/execute/builtins/mempcpy.c compilation, -Og -g (internal compiler error) FAIL: gcc.c-torture/execute/builtins/mempcpy-chk.c execution, -O1 FAIL: gcc.c-torture/execute/builtins/mempcpy-chk.c execution, -O1 FAIL: gcc.c-torture/execute/builtins/mempcpy-chk.c execution, -O2 FAIL: gcc.c-torture/execute/builtins/mempcpy-chk.c execution, -O2 FAIL: gcc.c-torture/execute/builtins/mempcpy-chk.c execution, -O2 -flto -fno-use-linker-plugin -flto-partition=none FAIL: gcc.c-torture/execute/builtins/mempcpy-chk.c execution, -O2 -flto -fno-use-linker-plugin -flto-partition=none FAIL: gcc.c-torture/execute/builtins/mempcpy-chk.c execution, -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects FAIL: gcc.c-torture/execute/builtins/mempcpy-chk.c execution, -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects FAIL: gcc.c-torture/execute/builtins/mempcpy-chk.c execution, -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions FAIL: gcc.c-torture/execute/builtins/mempcpy-chk.c execution, -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions FAIL: gcc.c-torture/execute/builtins/mempcpy-chk.c execution, -O3 -g FAIL: gcc.c-torture/execute/builtins/mempcpy-chk.c execution, -O3 -g FAIL: gcc.c-torture/execute/builtins/mempcpy-chk.c execution, -Og -g FAIL: gcc.c-torture/execute/builtins/mempcpy-chk.c execution, -Og -g FAIL: gcc.c-torture/execute/builtins/mempcpy-chk.c execution, -Os FAIL: gcc.c-torture/execute/builtins/mempcpy-chk.c execution, -Os FAIL: gcc.c-torture/execute/builtins/pr23484-chk.c execution, -O1 FAIL: gcc.c-torture/execute/builtins/pr23484-chk.c execution, -O2 FAIL: gcc.c-torture/execute/builtins/pr23484-chk.c execution, -O2 FAIL: gcc.c-torture/execute/builtins/pr23484-chk.c execution, -O2 -flto -fno-use-linker-plugin -flto-partition=none FAIL: gcc.c-torture/execute/builtins/pr23484-chk.c execution, -O2 -flto -fno-use-linker-plugin -flto-partition=none FAIL: gcc.c-torture/execute/builtins/pr23484-chk.c execution, -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects FAIL: gcc.c-torture/execute/builtins/pr23484-chk.c execution, -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects FAIL: gcc.c-torture/execute/builtins/pr23484-chk.c execution, -O3 -g FAIL: gcc.c-torture/execute/builtins/pr23484-chk.c execution, -O3 -g FAIL: gcc.c-torture/execute/builtins/pr23484-chk.c execution, -Og -g FAIL: gcc.c-torture/execute/builtins/pr23484-chk.c execution, -Os FAIL: gcc.c-torture/execute/builtins/pr23484-chk.c execution, -Os FAIL: gcc.c-torture/execute/builtins/strpcpy-2.c execution, -Os FAIL: gcc.dg/builtin-stringop-chk-4.c (internal compiler error) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for excess errors) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 196) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 211) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 213) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 215) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 216) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 217) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 222) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 237) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 239) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 241) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 242) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 243) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 248) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 263) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 302) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 305) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 319) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 320) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 321) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 322) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 323) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 324) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 326) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 328) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 346) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 347) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 353) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 354) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 361) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 362) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 363) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 369) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 370) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 395) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 401) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 403) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 405) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 421) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 425) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 427) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 428) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 429) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 430) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 436) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 437) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 471) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 472) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 473) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 475) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 479) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 480) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 493) FAIL: gcc.dg/builtin-stringop-chk-4.c (test for warnings, line 503) FAIL: gcc.dg/string-opt-1.c scan-assembler memcpy FAIL: gcc.dg/strlenopt-14g.c execution test FAIL: gcc.dg/strlenopt-14gf.c execution test FAIL: gcc.dg/tree-prof/val-prof-7.c compilation, -fprofile-generate -D_PROFILE_GENERATE (internal compiler error) FAIL: gcc.dg/tree-prof/val-prof-7.c compilation, -fprofile-generate -D_PROFILE_GENERATE (internal compiler error) FAIL: gcc.target/i386/chkp-stropt-16.c (internal compiler error) FAIL: gcc.target/i386/chkp-stropt-16.c (test for excess errors)
Sorry for the breakage, I'm going to take a look.
So I accidentally installed an old version of patch, reverted in r250788.
Author: marxin Date: Tue Aug 1 17:21:29 2017 New Revision: 250789 URL: https://gcc.gnu.org/viewcvs?rev=250789&root=gcc&view=rev Log: Make mempcpy more optimal (PR middle-end/70140). 2017-08-01 Martin Liska <mliska@suse.cz> PR middle-end/70140 * gcc.dg/string-opt-1.c: Adjust test-case to scan for memcpy. 2017-08-01 Martin Liska <mliska@suse.cz> PR middle-end/70140 * builtins.c (expand_builtin_memcpy_args): Remove. (expand_builtin_memcpy): Call newly added function expand_builtin_memory_copy_args. (expand_builtin_memcpy_with_bounds): Likewise. (expand_builtin_mempcpy): Remove last argument. (expand_builtin_mempcpy_with_bounds): Likewise. (expand_builtin_memory_copy_args): New function created from expand_builtin_mempcpy_args with small modifications. (expand_builtin_mempcpy_args): Remove. (expand_builtin_stpcpy): Remove unused argument. (expand_builtin): Likewise. (expand_builtin_with_bounds): Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/builtins.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/string-opt-1.c
Now it should be fine.
Author: aldyh Date: Wed Sep 13 16:13:04 2017 New Revision: 252219 URL: https://gcc.gnu.org/viewcvs?rev=252219&root=gcc&view=rev Log: Make mempcpy more optimal (PR middle-end/70140). 2017-08-01 Martin Liska <mliska@suse.cz> PR middle-end/70140 * gcc.dg/string-opt-1.c: Adjust test-case to scan for memcpy. 2017-08-01 Martin Liska <mliska@suse.cz> PR middle-end/70140 * builtins.c (expand_builtin_memcpy_args): Remove. (expand_builtin_memcpy): Call newly added function expand_builtin_memory_copy_args. (expand_builtin_memcpy_with_bounds): Likewise. (expand_builtin_mempcpy): Remove last argument. (expand_builtin_mempcpy_with_bounds): Likewise. (expand_builtin_memory_copy_args): New function created from expand_builtin_mempcpy_args with small modifications. (expand_builtin_mempcpy_args): Remove. (expand_builtin_stpcpy): Remove unused argument. (expand_builtin): Likewise. (expand_builtin_with_bounds): Likewise. Modified: branches/range-gen2/gcc/ChangeLog branches/range-gen2/gcc/builtins.c branches/range-gen2/gcc/testsuite/ChangeLog branches/range-gen2/gcc/testsuite/gcc.dg/string-opt-1.c
Author: aldyh Date: Wed Sep 13 16:16:21 2017 New Revision: 252233 URL: https://gcc.gnu.org/viewcvs?rev=252233&root=gcc&view=rev Log: Make mempcpy more optimal (PR middle-end/70140). 2017-08-01 Martin Liska <mliska@suse.cz> PR middle-end/70140 * gcc.dg/string-opt-1.c: Adjust test-case to scan for memcpy. 2017-08-01 Martin Liska <mliska@suse.cz> PR middle-end/70140 * builtins.c (expand_builtin_memcpy_args): Remove. (expand_builtin_memcpy): Call newly added function expand_builtin_memory_copy_args. (expand_builtin_memcpy_with_bounds): Likewise. (expand_builtin_mempcpy): Remove last argument. (expand_builtin_mempcpy_with_bounds): Likewise. (expand_builtin_memory_copy_args): New function created from expand_builtin_mempcpy_args with small modifications. (expand_builtin_mempcpy_args): Remove. (expand_builtin_stpcpy): Remove unused argument. (expand_builtin): Likewise. (expand_builtin_with_bounds): Likewise. Modified: branches/range-gen2/gcc/ChangeLog branches/range-gen2/gcc/builtins.c branches/range-gen2/gcc/testsuite/ChangeLog branches/range-gen2/gcc/testsuite/gcc.dg/string-opt-1.c