i386 alignment tweaks...

Fri Jan 14 00:37:00 GMT 2000

In article < 20000113132733.A24240@atrey.karlin.mff.cuni.cz >,
Jan Hubicka  <hubicka@atrey.karlin.mff.cuni.cz> wrote:
>
>I've been experimenting a bit with using fild/fist instruction to
>move DImode values and using them in memset/memcpy expanders.

Please don't do this.

This is a CLASSIC example of doing something that makes a targeted
benchmark run faster, and makes everything else slower in ways that are
REALLY hard to see and understand. 

Right now, very few programs use floating point regularly. Which means
that pretty much every operating system on the x86 does lazy FP state
saves and restores, or at least optimizes them away for processes that
do not touch their FP state during a timeslice. 

If you start sprinking random FP state into binaries, you suddenly get
inexplicable slowdowns in critical areas - yet your benchmarks that you
use to "validate" your work will clearly show that it's a win. 
Especially if you run such a benchmark on a system where the "normal"
binaries do not use FP.

Then, a year later, nobody will understand why things have slowed down..
Just because in the meantime, now _everybody_ is dirtying their FP
state, and suddenly it really shows up.

It's also going to just suck _incredibly_ badly on old i386 machines
with coprocessor emulation etc. 

In order for it to make sense using FP, you have to basically prove that
the time you win by using a 64-bit copy is more than what you lose on
doing a full FP context reload.  Which simply won't be true until you
hit the several kilobyte mark, if even then. 

Note that on many P6 cores, a simple "rep movsl" is _faster_ than a
clever FP copy. The "rep movsl" gets optimized by the microcode engine
to do the right thing wrt cache lines etc (ie it avoids reading in a
cache line that gets fully overwritten, and things like that). You have
to work quite hard at making FP or MMX be faster than a plain simple
"rep movsl", and you have the I$ footprint to consider too.

(Many people think that I$ footprints don't matter, because in many
small benchmark cores they really don't - the whole dang benchmark is in
the cacheanyway. REAL APPLICATIONS DO NOT WORK THAT WAY!)

		Linus