[Bug target/82668] New: could use BMI2 rorx for unpacking struct { int a,b }; from a register (SysV ABI)

Mon Oct 23 03:05:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82668

            Bug ID: 82668
           Summary: could use BMI2 rorx for unpacking struct { int a,b };
                    from a register (SysV ABI)
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*

struct twoint {
        int a, b;
};

int bar(struct twoint s) {
        return s.a + s.b;
}

https://godbolt.org/g/4ygAMm

        movq    %rdi, %rax
        sarq    $32, %rax
        addl    %edi, %eax
        ret

But we could have used

    rorx   $32, %rdi, %rax       # 1 uop 1c latency
    add    $edi, %eax
    ret

rorxq is only 1 uop, vs. 2 for mov + sar.  It also saves a byte a 3 byte MOV +
a 4 byte SAR with a 6 byte rorx.

Without BMI2, we can shorten critical path if mov isn't zero latency, from 3 to
2 cycles (and save a byte on the REX prefix for the mov):

        movl    %edi, %eax
        sarq    $32, %rdi
        addl    %edi, %eax
        ret

This would be a better choice in general, especially for tune=generic.

Also related (let me know if I should report separately, or if gcc knowing how
to use rotate to swap struct members would fix this too):

// only needs one call-preserved reg and a rotate.
long foo(int a /* edi */, int b /* esi */)
{
    struct_arg ( (struct twoint){a,b});
    struct_arg ( (struct twoint){b,a});
    return 0;
}

gcc saves two call-preserved registers so it can save a and b separately, and
shift+OR them together each time.

        pushq   %rbp
        movl    %edi, %ebp
        pushq   %rbx
        movl    %esi, %ebx
        movq    %rbx, %rdi
        salq    $32, %rdi
        subq    $8, %rsp
        orq     %rbp, %rdi
        call    struct_arg
        movq    %rbp, %rdi
        salq    $32, %rdi
        orq     %rbx, %rdi
        call    struct_arg
        addq    $8, %rsp
        xorl    %eax, %eax
        popq    %rbx
        popq    %rbp
        ret

This is sub-optimal in two ways: first, on Intel SnB-family (but not silvermont
or any AMD), SHRD is efficient (1 uop, 1c latency, runs on port1 only instead
of p06 for other shifts/rotates).  SHL + SHRD may be better than mov + shl +
or.

Second, because instead of redoing the creation of the struct, we can rotate
the first one.  Even writing it as a swap of the members of a struct (instead
of creation of a new struct) doesn't help.

Anyway, I think this would be better

        pushq   %rbx
        shl     $32, %rdi
        shrd    $32, %rsi, %rdi   # SnB-family alternative to mov+shl+or

        rorx    $32, %rdi, %rbx   # arg for 2nd call
        call    struct_arg
        movq    %rbx, %rdi
        call    struct_arg

        xorl    %eax, %eax
        popq    %rbx
        ret

I didn't check whether I got the correct arg as the high half, but that's not
the point.