This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: Unreviewed patch
- To: Jan Hubicka <jh at suse dot cz>
- Subject: Re: Unreviewed patch
- From: Daniel Berlin <dberlin at cygnus dot com>
- Date: Sat, 30 Dec 2000 20:45:39 -0800 (PST)
- cc: rth at cygnus dot com, gcc-patches at gcc dot gnu dot org, patches at x86-64 dot org
On Sat, 30 Dec 2000, Jan Hubicka wrote:
> Hi
> For the x86_64 port I need to implement SSE/SSE-2 based floating point
> arithmetics.
You'll need to fix up the stack alignment stuff i'm about to send to
gcc-patches, if you want to be able to use movaps.
It does the proper 16 byte stack alignment.
Two problems.
1. It's not quite correct, i'm forgetting to do something, and we crash
without -fomit-frame-pointer.
2. I can't do what intel suggests, and have both an aligned, and
unaligned entry point. Doing the aligned entry point requires emitting a
label, which really pisses off the scheduler, and other things.
I'm guessing we are just doing prologue's/epilogue's too late.
It's quite clear it's trying to schedule a label in the dumps ("{code
label}"). This screws a check to make sure we scheduled everything in a
region. I removed the check, and then we crash in flow.
I gave up, and just removed the extra aligned entry point, and it all went
away.
You also need a small modification of i386.md to allow pushing
"eliminable" registers after reload, since we want to use esi as our new
argpointer, since we need to align the stack, and thus, need to store the
old stack somewhere to access the arguments.
I also cleaned up the xmmintrin.h and mmintrin.h, whcih have never been
seen in public before.
Assuming i'm allowed, i'll submit these too (I think they have been meant
to be submitted, just no one had the time.).
The P4, however, would most benefit from
A. not doing strength reduction, if we do (IIRC, we don't for x86, i
could be misremembering). The latencies for shift changed so it's cheaper
to do adds now.
B. Scheduling for the function units, rather than the decoder. There is
only one decoder on the P4.
I already took the existing SSE intrinsics, grouped them logically
(sse_packed, sse_single, sse_logical, etc), then added a function unit for
SSE , with the right latencies/throughput for each, just to get some
semblance of scheduling.
> I would like to do so on the existing P4 hardware and i386 port
> first and then merge it to the mainline before rest of x86_64 stuff, since it
> will be most likely benefical optimization for P4 too.
>
> Before branching the tree for this project, I would like to floating point get
> into sync with mainline. There is one other FP patch pending that makes long
> double 128bit. This is required by x86_64 ABI as well as speeds up for
> i386/PPro noticeably. Even when it's usability on the current i386 backend is
> limited, may be possible to review it soon? This can avoid me from having 3
> independent development branches. Thank you very much!
>
> The patch is:
> http://gcc.gnu.org/ml/gcc-patches/2000-12/msg00719.html
>
> Honza
>