The logic in sh-mem.cc does not suppress inlining of memcpy when the size is constant but large. This suppresses use of a library memcpy which may be much faster than the inline version once the threshold of function call overhead is passed. At present the reason this is problematic on J2 is that the cache is direct mapped, so that when source and dest are aligned mod large powers of two (typical when page-aligned), each write to dest evicts src from the cache, making memcpy 4-5x slower than it should be. A library memcpy can handle this by copying cache line size or larger at a time, but the inline memcpy can't. Even if we have a set-associative cache on J-core in the future, I plan to have Linux provide a vdso memcpy function that can use DMA transfers, which are several times faster than what you can achieve with any cpu-driven memcpy and which free up the cpu for other work. However it's impossible to for such a function to get called as long as gcc is inlining it. Using -fno-builtin-memcpy is not desirable because we certainly want inline memcpy for small transfers that would be dominated by function call time (or where the actual memory accesses can be optimized out entirely and the copy performed in registers, like memcpy for type punning), just not for large copies.
(In reply to Rich Felker from comment #0) Do you have any idea for reasonable size? Or do you prefer some options like -mno-prefer-expand-block-move / -mexpand-block-move-threshold= ?
Unless you expect the inline memcpy to be a size savings (and it does not seem to be), the size threshold can just be chosen such that function call time is negligible compared to copying time. I suspect that's already true around 256 bytes or so. I'm testing a patch where I used 256 as the limit and it made the Linux kernel very slightly faster (~1-2%) and does not seem to hurt anywhere. Major differences are unlikely to be seen unless the library memcpy does something fancy like DMA (or just avoiding aliasing in direct-mapped caches).
Created attachment 39642 [details] patch to add default threshold 256 sounds reasonable to me. Does the attached patch work for you?
(In reply to Rich Felker from comment #0) > > Even if we have a set-associative cache on J-core in the future, I plan to > have Linux provide a vdso memcpy function that can use DMA transfers, which > are several times faster than what you can achieve with any cpu-driven > memcpy and which free up the cpu for other work. However it's impossible to > for such a function to get called as long as gcc is inlining it. Just a note on the side... the above can also be done on a off-the-shelf SH MCU. However, it is only going to be beneficial for large memory blocks, since you'd have to synchronize (i.e. flush) the data cache lines of the memcpy'ed regions. For small blocks DMA packet setup time will dominate, unless you've got one dedicated DMA channel sitting around just waiting for memcpy commands. Normally it's better to avoid copying large memory blocks at all and use reference-counted buffers or something like that instead. That is of course, unless you've got some special cache coherent DMA machinery ready at hand and memory is very fast :) (In reply to Rich Felker from comment #2) > I'm testing a patch where I used 256 as the limit and it made the Linux > kernel very slightly faster (~1-2%) and does not seem > to hurt anywhere. > I'm curious, how did you measure this performance of the kernel? Which part in particular got faster in which situation?
Of course, fancy memcpy in general is only a win beyond a certain size. For DMA I did not mean I want to use DMA for any size beyond gcc's proposed function-call threshold. Rather, the vdso-provided function would choose what to do appropriately for the hardware. But on J2 (nommu, no special kernel mode) I suspect DMA could be a win at sizes as low as 256 bytes, with spin-to-completion and a lock shared between user (vdso) and kernel rather than using a syscall (not sure this is justified, though). Using a syscall with sleep-during-dma would have a significantly larger threshold before it's worthwhile. Regarding how I measured kernel performance increase, I was just looking at boot timing with printk timestamps enabled. The main time consumer is unpacking initramfs.
(In reply to Rich Felker from comment #5) > Of course, fancy memcpy in general is only a win beyond a certain size. For > DMA I did not mean I want to use DMA for any size beyond gcc's proposed > function-call threshold. Rather, the vdso-provided function would choose > what to do appropriately for the hardware. But on J2 (nommu, no special > kernel mode) I suspect DMA could be a win at sizes as low as 256 bytes, with > spin-to-completion and a lock shared between user (vdso) and kernel rather > than using a syscall (not sure this is justified, though). Using a syscall > with sleep-during-dma would have a significantly larger threshold before > it's worthwhile. I see. Anyway, I agree that something like attachment 39642 [details] is useful. > Regarding how I measured kernel performance increase, I was just looking at > boot timing with printk timestamps enabled. The main time consumer is > unpacking initramfs. Ah right, that sounds like copying memory around. Do you happen to have any other runtime measurements?