Bug 57529

Summary:	Redundant masking of zero-extended values
Product:	gcc	Reporter:	Jeremiah Willcock <jewillco>
Component:	target	Assignee:	Not yet assigned to anyone <unassigned>
Status:	UNCONFIRMED ---
Severity:	normal	Keywords:	missed-optimization
Priority:	P3
Version:	4.9.0
Target Milestone:	---
Host:		Target:	x86_64-linux-gnu
Build:		Known to work:
Known to fail:		Last reconfirmed:

Description Jeremiah Willcock 2013-06-04 18:36:36 UTC

Using version "gcc (GCC) 4.9.0 20130519 (experimental)" with target "x86_64-unknown-linux-gnu" and the flags "-Ofast -std=gnu99 -march=bdver1", the following code:

#include <stdint.h>

void foo(const uint16_t* restrict indexes, const uint64_t* restrict bits, unsigned int* restrict sum, int count) {
  for (int i = 0; i < count; ++i) {
    unsigned int val = indexes[i];
    if (bits[val / 64] & (1UL << (val % 64))) {sum[val] += 1;}
  }
}

produces two shifts to implement the "val / 64" operation instead of one, seemingly because the compiler is trying to mask val to 16 bits even though it was loaded with movzwl and thus was already masked and zero-extended.  Here is the assembly for the function body:

        testl   %ecx, %ecx      # count
        movl    %ecx, %r9d      # count, count
        jle     .L8     #,
        xorl    %eax, %eax      # ivtmp.5
        .p2align 4,,10
        .p2align 3
.L4:
        movzwl  (%rdi,%rax,2), %ecx     # MEM[base: indexes_8(D), index: ivtmp.5_52, step: 2, offset: 0B], D.2242
        movq    %rcx, %r8       # D.2242, D.2244
# **************** Redundant masking operation:
        salq    $48, %r8        #, D.2244
        shrq    $54, %r8        #, D.2244
# ****************
        movq    (%rsi,%r8,8), %r8       # *_16, D.2244
# ++++++++++++++++
        shrq    %cl, %r8        # D.2242, D.2244
        andl    $1, %r8d        #, D.2244
# ++++++++++++++++
        je      .L3     #,
# xxxxxxxxxxxxxxxx
        movzwl  %cx, %r8d       # D.2242, D.2244
# xxxxxxxxxxxxxxxx
        incl    (%rdx,%r8,4)    # *_25
.L3:
        incq    %rax    # ivtmp.5
        cmpl    %eax, %r9d      # ivtmp.5, count
        jg      .L4     #,
.L8:
        rep; ret

The seemingly-unnecessary operation is marked with stars; a single shrq by 6 should do the unsigned division operation correctly, while two instructions are used to both mask the value to 16 bits and shift it.  The zero-extension inside x's is also unnecessary (%rcx could have been used directly in the index expression).  On a somewhat unrelated issue, the code marked in +'s seems to be sub-optimal as well, and could probably be replaced by a bt instruction (GCC 4.4.7 uses "btq" there using -O3 and the same -march flag).