Incidentally, I've thought that it might be interesting to experiment
with an x86_64 ABI which used ILP32, but which used the extra eight
registers, passed parameters in registers, and only preserved the low
32-bits of the caller-saved registers. For applications which could
live in a 32-bit address space, that would save the memory traffic
required to save and restore 64-bit registers on function entry and
exit. The result would be pretty similar to i386 code, but with 8
more registers. It should definitely run faster than i386 code, and I
think that due to improved memory traffic it would also run faster
than standard x86_64 code. But I haven't actually tried.