This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Variable sized i386 string operations


> On Thu, Jan 27, 2000 at 06:53:05PM +0100, Jan Hubicka wrote:
> > Weak spot of this approach is code size. The generic memcpy is roughtly
> > 40 bytes for -mi386 (copared to 50 bytes used by glibc) and faster for
> > small counts.  With -mother_cpu the single string operations are expanded
> > and memcpy gets up to 100 bytes.
> 
> Ouch.  Something that large probably hurts more than it helps.  At
> some point you should just give up and call the library routine.  
> I think glibc is wrong to do as much inline expansion as it does
> by default as well.
> 
> > I've also measured 40% speedup over library memcpy and 60% speedup
> > over glibc's inline on XaoS benchmarks.
> 
> Is this due to algorithmic changes or due to extra information
> available to the compiler (eg alignment)? 
> 
> If it's not due to extra information, I would limit inline expansion
> to "rep stos" and otherwise fix the library.

There are some notable speedups that caused by alignment. Basically
gcc is able to prove later, that values are always alligned and that no
ending is necesary on truecolor copies (since all values are multiples of 4).

Speedups for other pixel sizes are basically by avoiding function call overhead
in library version and for using rep movsl instead of rep movsb for short
version. Problem is that this is perfect example, where information is not
visible to the expander.

I also agree that inlining the generic case is too costy and thats why
I didn't integrated this part to any of earlier cases.
The 100 bytes is maximum (i.e unknown alignment and size) and I don't have
much idea what is the average size.

Perhaps we can develop se customizability to this. (switch to enable full inline
and switch to disable alignment handling for instance) and default
to something like rep movsl + maximally two branches.
Last extension I was thinking of is to implement assembly language version
of memcpy in the libgcc2 and use non-standard API to call it. But this is
probably too involved.

Note that our strlen unrollel falls into same problem with the code size.

Honza
> 
> 
> 
> r~

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]