This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug target/77610] [sh] memcpy is wrongly inlined even for large copies

From: "olegendo at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Tue, 20 Sep 2016 13:06:41 +0000
Subject: [Bug target/77610] [sh] memcpy is wrongly inlined even for large copies
Auto-submitted: auto-generated
References: <bug-77610-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77610

Oleg Endo <olegendo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |olegendo at gcc dot gnu.org

--- Comment #4 from Oleg Endo <olegendo at gcc dot gnu.org> ---
(In reply to Rich Felker from comment #0)
> 
> Even if we have a set-associative cache on J-core in the future, I plan to
> have Linux provide a vdso memcpy function that can use DMA transfers, which
> are several times faster than what you can achieve with any cpu-driven
> memcpy and which free up the cpu for other work. However it's impossible to
> for such a function to get called as long as gcc is inlining it.

Just a note on the side... the above can also be done on a off-the-shelf SH
MCU.  However, it is only going to be beneficial for large memory blocks, since
you'd have to synchronize (i.e. flush) the data cache lines of the memcpy'ed
regions.  For small blocks DMA packet setup time will dominate, unless you've
got one dedicated DMA channel sitting around just waiting for memcpy commands. 
Normally it's better to avoid copying large memory blocks at all and use
reference-counted buffers or something like that instead.   That is of course,
unless you've got some special cache coherent DMA machinery ready at hand and
memory is very fast :)


(In reply to Rich Felker from comment #2)
> I'm testing a patch where I used 256 as the limit and it made the Linux
> kernel very slightly faster (~1-2%) and does not seem
> to hurt anywhere.
> 

I'm curious, how did you measure this performance of the kernel?  Which part in
particular got faster in which situation?

References:
- [Bug target/77610] New: [sh] memcpy is wrongly inlined even for large copies
  - From: bugdal at aerifal dot cx

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]