[Bug target/85038] New: x32: unnecessary address-size prefix when a pointer register is already zero-extended

Thu Mar 22 12:12:00 GMT 2018

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85038

            Bug ID: 85038
           Summary: x32: unnecessary address-size prefix when a pointer
                    register is already zero-extended
           Product: gcc
           Version: 8.0.1
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

Bug 82267 was fixed for RSP only.  (Or interpreted narrowly as only being about
RSP vs. ESP).

This bug is about the general case of using address-size prefixes in cases
where we could prove they're not needed.  Either because out-of-bounds is UB so
we don't care about wrap vs. going outside 4GiB, or (simpler) the
single-register case when we know the pointer is already zero-extended.  Maybe
we want separate bugs to track parts of this that can be fixed with separate
patches, but I won't consider this fixed until -mx32 emits optimal code for all
the cases listed here.

I realize this won't be any time soon, but it's still code-size (and thus
indirectly performance) that gcc is leaving on the table.  Being smarter about
using 64-bit address-size is even more useful for AArch64 -mabi=ilp32, because
it doesn't have 32-bit address-size overrrides, so it always costs an extra
instruction every time we fail to prove that 64-bit is safe.  (And AArch64
ILP32 may get more use than x32 these days).  I intended this bug to be about
x32, though.

--------

Useless 0x67 address-size override prefixes hurt code-size and thus performance
on everything, with more serious problems on some CPUs that have trouble with
more than 3 prefixes (especially Silvermont).  See Bug 82267 for the details
which I won't repeat.

We still have tons of useless 0x67 prefixes in the default -maddress-mode=short
mode (for every memory operand other than RSP, or RIP-relative), and
-maddress-mode=long has lots of missed optimizations resulting in wasted LEA
instructions, so neither one is good.

float doublederef(float **p){
        return **p;
}
 // https://godbolt.org/g/exb74t
 // gcc 8.0.1 (trunk) -O3 -mx32 -march=haswell -maddress-mode=short
        movl    (%edi), %eax
        vmovss  (%eax), %xmm0    # could/should be (%rax)
        ret

-maddress-mode=long gets that right, using (%rax), and also (%rdi) because the
ABI doc specifies that x32 passes pointers zero-extended.  mode=short still
ensures that, so failure to take advantage is still a missed-opt.

Note that clang -mx32 violates that ABI guarantee by compiling
pass_arg(unsigned long long ptr) { ext_func((void*)ptr); } to just a tailcall
(while gcc does zero-extend).  See output in the godbolt link above.  IDK if we
care about being bug-compatible with clang for that corner case for this rare
ABI, though.  A less contrived case would be a struct arg or return value
packed into a register passed on as just a pointer.

-----

// arr+offset*4 is strictly within the low 32 bits because of range limits

float safe_offset(float *arr, unsigned offset){
    unsigned tmp = (unsigned)arr;
    arr = (void*)(tmp & -4096);  // round down to a page
    offset &= 0xf;
    return arr[offset];
}
   // on the above godbolt link
    #mode=short
        andl    $-4096, %edi
        andl    $15, %esi
        vmovss  (%edi,%esi,4), %xmm0
        # (%rdi,%rsi,4) would have been safe, but that's maybe not worth
looking for.
        # most cases have less pointer alignment than offset range

    #mode=long
        andl    $-4096, %edi
        andl    $15, %esi
        leal    (%rdi,%rsi,4), %eax
        vmovss  (%eax), %xmm0         # 32-bit addrmode after using a separate
LEA

So mode=long is just braindead here.  It gets the worst of both worlds, using a
separate LEA but then not taking advantage of the zero-extended pointer.  The
only way this could be worse is the LEA operand-size was 64-bit.

Without the masking, both modes just use  vmovss (%edi,%esi,4), %xmm0, but the
extra operations defeat mode=long's attempts to recognize this case, and it
picks an LEA instead of (or as well as?!?) an address-size prefix.

-------

With a 64-bit offset, and a pointer that's definitely zero-extended to 64 bits:

                   // same for signed or unsigned
float ptr_and_offset_zext(float **p, unsigned long long offset){
        float *arr = *p;
        return arr[offset];
}

    # mode=short
        movl    (%edi), %eax          # mode=long uses (%rdi) here
        vmovss  (%eax,%esi,4), %xmm0  # but still 32-bit here.
        ret

Why are we using address-size prefixes to stop a base+index from going outside
4G on out of bounds UB?  (%rax,%rsi,4) should work for a signed / unsigned
64-bit offset when the pointer is known to be zero-extended.

ISO C11 says that pointer+integer produces a result of pointer type, with UB if
the result goes outside the array.  It does *not* say that the integer has to
be truncated to pointer width *first*

> n1570 6.5.6 Additive operators, point 8:
> ...
> If both the pointer operand and the result point to elements of the same
> array object, or one past the last element of the array object, the
> evaluation shall not produce an overflow;
>   **otherwise, the behavior is undefined.**

So it's perfectly valid to use all the bits of a wide integer array index,
because it is UB if the high bits take the result outside of any object, even
if truncating the input or the result to 32 bits would have produced a valid
pointer.

This allows optimizations with 64-bit offsets, and with 32-bit or narrower
offsets that are correctly extended to 64 bits.  (But I suspect that gcc
internals makes it hard to take advantage, if we don't have a concept of
"correctly extended to 64 bits in a register" before we lose the signed vs.
unsigned info.)

Richard Biener pointed out
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82267#c1) that although
wrap-around of pointer-math is not required, RTL doesn't know whether 32-bit
offsets are signed or unsigned and thus has to consider the case of a signed
32-bit offset (where 64-bit address size with 32-bit signed values
zero-extended in 64-bit registers wouldn't work).

But here the pointer and offset are already *64* bits (with mode=long) or
correctly extended to 64 (with mode=short), so (%eax,%esi,4) is the same
effective-address as (%rax,%rsi,4) for any offset that doesn't cause UB by
going outside an object.  i.e. we can assume that array + offset fits in 32
bits, even if offset is a 64-bit negative integer (if array is a pointer
zero-extended to 64-bit).  If the resulting 64-bit address is outside the low
32, it was UB because we know there are no objects there, so we don't have to
care about such inputs.

Thus it doesn't matter whether we use 64-bit address size and let addressing
mode generate a valid address with the upper 32 bits zero, or whether we
truncate the address calculation to 32 bits.  The only difference will be for
out-of-bounds offset values, which could wrap back to a valid address on
truncation to 32 bits, instead of faulting on an attempt to access far beyond
the end of an array.

----

Estimate of the code-size impact: maybe 4% machine-code size for pointer-heavy
code like gcc's own cc1 executable.

Looking at a binary compiled with -m64, how much worse would it be with -mx32. 
(Arch Linux doesn't support x32, so I don't have any binaries sitting around.)

objdump -drwC -Mintel --section=.text
/usr/lib/gcc/x86_64-pc-linux-gnu/7.3.0/cc1 | egrep -v '^[^
]|^$|\Wnop\W|\Wlea\W' | egrep ' .*(\[r[^is]|\[rsi)' -c

545101 instructions with a memory operand that gcc -mx32 would use a prefix
for, and thus 545101 bytes of addr32 prefixes.

(out of ~3380113 total instructions in the .text section)

The first grep filters out non-instruction lines, and NOP / LEA.  The 2nd grep
counts matches for register addressing modes other than [rip+... and [rsp+...
which current gcc knows not to use an address-size prefix for.

On the over-optimistic assumption that every address-size prefix could be
avoided, this 23MiB compiler executable (from gcc7.3 on Arch Linux) would have
~0.5MiB of address-size prefixes, or ~2% of the total size of the executable
(which I think is mostly code, not data).  Or 4% of the .text section (13695266
bytes).

Not accounting for getting smaller from fewer REX prefixes, which is an error
in the other direction from assuming that every addr32 can be avoided.

----

There are two approaches to improve the situation:

* teach -maddress-mode=long to use 32-bit addressing modes instead of extra
instructions whenever it can't prove that a 64-bit address-size is safe, and
make it the default mode.

* teach -maddress-mode=short to look for cases where it *can* prove that 64-bit
address-size is safe, and omit the 0x67 prefix in that case.

Are either of these feasible?  Is gcc just completely not designed for ILP32
ABIs on 64-bit CPUs?

One possible peephole is RBP+constant addressing modes when RBP is a frame
pointer.  (related: mode=long uses mov %rsp, %rbp instead of mov %esp, %ebp). 
Is info on whether RBP is a frame pointer or not available at the same point
the RSP peephole check is done?  If not, then the implementation would have to
be very different.

But really -mx32 should at least avoid prefixes on single-register addressing
modes.  That should be easy to prove correct without reasoning about negative
integers.  It's *very* common that a pointer in a register is already
zero-extended.