Bug 57529

Summary: Redundant masking of zero-extended values
Product: gcc Reporter: Jeremiah Willcock <jewillco>
Component: targetAssignee: Not yet assigned to anyone <unassigned>
Status: UNCONFIRMED ---    
Severity: normal Keywords: missed-optimization
Priority: P3    
Version: 4.9.0   
Target Milestone: ---   
Host: Target: x86_64-linux-gnu
Build: Known to work:
Known to fail: Last reconfirmed:

Description Jeremiah Willcock 2013-06-04 18:36:36 UTC
Using version "gcc (GCC) 4.9.0 20130519 (experimental)" with target "x86_64-unknown-linux-gnu" and the flags "-Ofast -std=gnu99 -march=bdver1", the following code:

#include <stdint.h>

void foo(const uint16_t* restrict indexes, const uint64_t* restrict bits, unsigned int* restrict sum, int count) {
  for (int i = 0; i < count; ++i) {
    unsigned int val = indexes[i];
    if (bits[val / 64] & (1UL << (val % 64))) {sum[val] += 1;}
  }
}

produces two shifts to implement the "val / 64" operation instead of one, seemingly because the compiler is trying to mask val to 16 bits even though it was loaded with movzwl and thus was already masked and zero-extended.  Here is the assembly for the function body:

        testl   %ecx, %ecx      # count
        movl    %ecx, %r9d      # count, count
        jle     .L8     #,
        xorl    %eax, %eax      # ivtmp.5
        .p2align 4,,10
        .p2align 3
.L4:
        movzwl  (%rdi,%rax,2), %ecx     # MEM[base: indexes_8(D), index: ivtmp.5_52, step: 2, offset: 0B], D.2242
        movq    %rcx, %r8       # D.2242, D.2244
# **************** Redundant masking operation:
        salq    $48, %r8        #, D.2244
        shrq    $54, %r8        #, D.2244
# ****************
        movq    (%rsi,%r8,8), %r8       # *_16, D.2244
# ++++++++++++++++
        shrq    %cl, %r8        # D.2242, D.2244
        andl    $1, %r8d        #, D.2244
# ++++++++++++++++
        je      .L3     #,
# xxxxxxxxxxxxxxxx
        movzwl  %cx, %r8d       # D.2242, D.2244
# xxxxxxxxxxxxxxxx
        incl    (%rdx,%r8,4)    # *_25
.L3:
        incq    %rax    # ivtmp.5
        cmpl    %eax, %r9d      # ivtmp.5, count
        jg      .L4     #,
.L8:
        rep; ret

The seemingly-unnecessary operation is marked with stars; a single shrq by 6 should do the unsigned division operation correctly, while two instructions are used to both mask the value to 16 bits and shift it.  The zero-extension inside x's is also unnecessary (%rcx could have been used directly in the index expression).  On a somewhat unrelated issue, the code marked in +'s seems to be sub-optimal as well, and could probably be replaced by a bt instruction (GCC 4.4.7 uses "btq" there using -O3 and the same -march flag).