This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/82267] New: x32: unnecessary address-size prefixes. Why isn't -maddress-mode=long the default?
- From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Wed, 20 Sep 2017 05:06:23 +0000
- Subject: [Bug target/82267] New: x32: unnecessary address-size prefixes. Why isn't -maddress-mode=long the default?
- Auto-submitted: auto-generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82267
Bug ID: 82267
Summary: x32: unnecessary address-size prefixes. Why isn't
-maddress-mode=long the default?
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: ABI, missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*
x32 defaults to using 32-bit address-size everywhere, it seems. (Apparently
introduced by rev 185396 for bug 50797, which introduced -maddress-mode=short
and made it the default.)
This takes an extra 1-byte prefix on every instruction with a memory operand.
It's not just code-size; this is potentially a big throughput problem on Intel
Silvermont where more than 3 prefixes (including mandatory prefixes and 0F
escape bytes for SSE and other instructions) cause a stall. These are exactly
the systems where a memory-saving ABI might be most useful. (I'm not building
one, I just think x32 is a good idea if implemented optimally.)
long long doublederef(long long **p){
return **p;
}
// https://godbolt.org/g/NHbURq
gcc8 -mx32 -O3
movl (%edi), %eax # 0x67 prefix
movq (%eax), %rax # 0x67 prefix
ret
The second instruction is 1 byte longer for no reason: it needs a 0x67
address-size prefix to encode.
But we know for certain that the address is already zero-extended into %rax,
because we just put it there. Also, the ABI requires p to be zero-extended to
64 bits, so it would be safe to use `movl (%rdi), %eax` as the first
instruction.
Even (%rsp) is avoided for some reason, even though -mx32 still uses
push/pop/call/ret which use the full %rsp, so it has to be valid.
int stackuse(void) {
volatile int foo = 2;
return foo * 3;
}
movl $2, -4(%esp) # 0x67 prefix
movl -4(%esp), %eax # 0x67 prefix
leal (%rax,%rax,2), %eax # no prefixes
ret
Compiling with -maddress-mode=long appears to generate optimal code for all the
simple test cases I looked at, e.g.
movl $2, -4(%rsp) # no prefixes
movl -4(%rsp), %eax # no prefixes
leal (%rax,%rax,2), %eax # no prefixes
ret
-maddress-mode=long still uses an address-size prefix instead of an LEA to make
sure addresses wrap at 4G, and to ignore high garbage in registers:
long long fooi(long long *arr, int offset){
return arr[offset];
}
movq (%edi,%esi,8), %rax # same for mode=short or long.
ret
Are there still cases where -maddress-mode=long makes worse code?
----
Is it really necessary for an unsigned offset to be wrap at 4G? Does ISO C or
GNU C guarantee that large unsigned values work like negative signed integers
when used for pointer arithmetic?
// 64-bit offset so it won't have high garbage
long long fooull(long long *arr, unsigned long long offset){
return arr[offset];
}
movq (%edi,%esi,8), %rax # but couldn't this be (%rdi,%rsi,8)
ret
Allowing 64-bit addressing modes with unsigned indexes could potentially save
significant code-size, couldn't it?
address-mode=long already allows constant offsets to go outside 4G, for
example:
foo_constant: # return arr[123456];
movq 987648(%rdi), %rax
ret
But it does treat the offset as signed, so 0xffffffffULL will movq -8(%rdi),
%rax.
The ABI doc (https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI) doesn't
specify anything about C pointer-wrapping semantics, and I don't know where
else to look to find out what behaviour is required/guaranteed and what is just
how the current implementation happens to work.
Anyway, this is a side-track from the issue of not using address-size prefixes
in single-pointer cases where it's already zero extended.
---------
SSSE3 and later instructions need 66 0F 3A/38 before the opcode, so an
address-size or REX prefix will cause a decode stall on Silvermont. With the
default x32 behaviour, even SSE2 instructions (66 0F opcode) will cause decode
stalls with a REX and address-size prefix. e.g. paddb (%r8d), %xmm8 or even
movdqa (but not movaps or other SSE1 instructions). Fortunately KNL isn't
really affected: VEX/EVEX is fine unless there's a segment prefix before it,
but Agner Fog seems to be saying that other prefixes are fine.
In integer code, REX + operand-size + address-size + a 0F escape byte would be
a problem for Silvermont/KNL, e.g. imul (%edi), %r10w needs all 4. movbe %ax,
(%edi) has 4 prefixes, including the 2 mandatory escape bytes: 67 66 0f 38 f1
07.
In-order Atom also has "severe delays" (according to
http://agner.org/optimize/) with more than 3 prefixes, but unlike Silvermont,
that apparently doesn't include mandatory prefixes for SSE instructions.
Similarly, Bulldozer-family has a 3-prefix limit, but doesn't count escape
bytes, and VEX only counts as 0 or 1 (for 2/3 byte VEX).