[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

Mon Mar 5 18:37:00 GMT 2018

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719

--- Comment #5 from gpnuma at centaurean dot com ---
Which gcc and which clang ?
Because on my platform, in the above code, if you isolate 3 bytes at a time and
5 bytes at a time it is way slower than clang (by doing manual unrolling).
Or maybe it's the interaction with the bit masking that causes a problem ?

(In reply to H.J. Lu from comment #4)
> I compared __builtin_memcpy one size at a time.  Here are results in
> cycles:
> 
> clang 1 bytes: 17193410146
> gcc   1 bytes: 15440244966
> clang 2 bytes: 8997535880
> gcc   2 bytes: 8147449530
> clang 3 bytes: 6002276628
> gcc   3 bytes: 5430387704
> clang 4 bytes: 4497121282
> gcc   4 bytes: 4069604454
> clang 5 bytes: 3644879742
> gcc   5 bytes: 3258094970
> clang 6 bytes: 3045612708
> gcc   6 bytes: 2728410608
> clang 7 bytes: 2574110178
> gcc   7 bytes: 2330365680
> clang 8 bytes: 969894432
> gcc   8 bytes: 6436950208
> 
> GCC is faster except for 8 byte size.