[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's
gpnuma at centaurean dot com
gcc-bugzilla@gcc.gnu.org
Mon Mar 5 18:37:00 GMT 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719
--- Comment #5 from gpnuma at centaurean dot com ---
Which gcc and which clang ?
Because on my platform, in the above code, if you isolate 3 bytes at a time and
5 bytes at a time it is way slower than clang (by doing manual unrolling).
Or maybe it's the interaction with the bit masking that causes a problem ?
(In reply to H.J. Lu from comment #4)
> I compared __builtin_memcpy one size at a time. Here are results in
> cycles:
>
> clang 1 bytes: 17193410146
> gcc 1 bytes: 15440244966
> clang 2 bytes: 8997535880
> gcc 2 bytes: 8147449530
> clang 3 bytes: 6002276628
> gcc 3 bytes: 5430387704
> clang 4 bytes: 4497121282
> gcc 4 bytes: 4069604454
> clang 5 bytes: 3644879742
> gcc 5 bytes: 3258094970
> clang 6 bytes: 3045612708
> gcc 6 bytes: 2728410608
> clang 7 bytes: 2574110178
> gcc 7 bytes: 2330365680
> clang 8 bytes: 969894432
> gcc 8 bytes: 6436950208
>
> GCC is faster except for 8 byte size.
More information about the Gcc-bugs
mailing list