77610 – [sh] memcpy is wrongly inlined even for large copies

Bug 77610 - [sh] memcpy is wrongly inlined even for large copies

Summary: [sh] memcpy is wrongly inlined even for large copies

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	unknown

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2016-09-15 23:21 UTC by Rich Felker
Modified:	2016-09-21 05:58 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	sh---
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
patch to add default threshold (589 bytes, patch) 2016-09-19 07:18 UTC, Kazumoto Kojima	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Rich Felker 2016-09-15 23:21:15 UTC

The logic in sh-mem.cc does not suppress inlining of memcpy when the size is constant but large. This suppresses use of a library memcpy which may be much faster than the inline version once the threshold of function call overhead is passed.

At present the reason this is problematic on J2 is that the cache is direct mapped, so that when source and dest are aligned mod large powers of two (typical when page-aligned), each write to dest evicts src from the cache, making memcpy 4-5x slower than it should be. A library memcpy can handle this by copying cache line size or larger at a time, but the inline memcpy can't.

Even if we have a set-associative cache on J-core in the future, I plan to have Linux provide a vdso memcpy function that can use DMA transfers, which are several times faster than what you can achieve with any cpu-driven memcpy and which free up the cpu for other work. However it's impossible to for such a function to get called as long as gcc is inlining it.

Using -fno-builtin-memcpy is not desirable because we certainly want inline memcpy for small transfers that would be dominated by function call time (or where the actual memory accesses can be optimized out entirely and the copy performed in registers, like memcpy for type punning), just not for large copies.

Comment 1 Kazumoto Kojima 2016-09-19 04:37:45 UTC

(In reply to Rich Felker from comment #0)

Do you have any idea for reasonable size?  Or do you prefer some options
like -mno-prefer-expand-block-move / -mexpand-block-move-threshold= ?

Comment 2 Rich Felker 2016-09-19 04:55:58 UTC

Unless you expect the inline memcpy to be a size savings (and it does not seem to be), the size threshold can just be chosen such that function call time is negligible compared to copying time. I suspect that's already true around 256 bytes or so. I'm testing a patch where I used 256 as the limit and it made the Linux kernel very slightly faster (~1-2%) and does not seem to hurt anywhere.

Major differences are unlikely to be seen unless the library memcpy does something fancy like DMA (or just avoiding aliasing in direct-mapped caches).

Comment 3 Kazumoto Kojima 2016-09-19 07:18:49 UTC

Created attachment 39642 [details]
patch to add default threshold

256 sounds reasonable to me.  Does the attached patch work for you?

Comment 4 Oleg Endo 2016-09-20 13:06:41 UTC

(In reply to Rich Felker from comment #0)
> 
> Even if we have a set-associative cache on J-core in the future, I plan to
> have Linux provide a vdso memcpy function that can use DMA transfers, which
> are several times faster than what you can achieve with any cpu-driven
> memcpy and which free up the cpu for other work. However it's impossible to
> for such a function to get called as long as gcc is inlining it.

Just a note on the side... the above can also be done on a off-the-shelf SH MCU.  However, it is only going to be beneficial for large memory blocks, since you'd have to synchronize (i.e. flush) the data cache lines of the memcpy'ed regions.  For small blocks DMA packet setup time will dominate, unless you've got one dedicated DMA channel sitting around just waiting for memcpy commands.  Normally it's better to avoid copying large memory blocks at all and use reference-counted buffers or something like that instead.   That is of course, unless you've got some special cache coherent DMA machinery ready at hand and memory is very fast :)


(In reply to Rich Felker from comment #2)
> I'm testing a patch where I used 256 as the limit and it made the Linux
> kernel very slightly faster (~1-2%) and does not seem
> to hurt anywhere.
> 

I'm curious, how did you measure this performance of the kernel?  Which part in particular got faster in which situation?

Comment 5 Rich Felker 2016-09-21 05:23:06 UTC

Of course, fancy memcpy in general is only a win beyond a certain size. For DMA I did not mean I want to use DMA for any size beyond gcc's proposed function-call threshold. Rather, the vdso-provided function would choose what to do appropriately for the hardware. But on J2 (nommu, no special kernel mode) I suspect DMA could be a win at sizes as low as 256 bytes, with spin-to-completion and a lock shared between user (vdso) and kernel rather than using a syscall (not sure this is justified, though). Using a syscall with sleep-during-dma would have a significantly larger threshold before it's worthwhile.

Regarding how I measured kernel performance increase, I was just looking at boot timing with printk timestamps enabled. The main time consumer is unpacking initramfs.

Comment 6 Oleg Endo 2016-09-21 05:58:30 UTC

(In reply to Rich Felker from comment #5)
> Of course, fancy memcpy in general is only a win beyond a certain size. For
> DMA I did not mean I want to use DMA for any size beyond gcc's proposed
> function-call threshold. Rather, the vdso-provided function would choose
> what to do appropriately for the hardware. But on J2 (nommu, no special
> kernel mode) I suspect DMA could be a win at sizes as low as 256 bytes, with
> spin-to-completion and a lock shared between user (vdso) and kernel rather
> than using a syscall (not sure this is justified, though). Using a syscall
> with sleep-during-dma would have a significantly larger threshold before
> it's worthwhile.

I see.  Anyway, I agree that something like attachment 39642 [details] is useful.

> Regarding how I measured kernel performance increase, I was just looking at
> boot timing with printk timestamps enabled. The main time consumer is
> unpacking initramfs.

Ah right, that sounds like copying memory around.  Do you happen to have any other runtime measurements?