Bug 53133 - XOR AL,AL to zero lower 8 bits of EAX/RAX causes partial register stall (Intel Core 2)
Summary: XOR AL,AL to zero lower 8 bits of EAX/RAX causes partial register stall (Inte...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.7.0
: P3 normal
Target Milestone: 6.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2012-04-27 03:42 UTC by Adam Warner
Modified: 2021-08-15 05:20 UTC (History)
4 users (show)

See Also:
Host:
Target: i?86-*-* x86_64-*-*
Build:
Known to work: 6.1.0
Known to fail: 4.3.6, 4.6.2, 4.7.0, 5.5.0
Last reconfirmed: 2012-04-27 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Warner 2012-04-27 03:42:16 UTC
Processor is Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz

#include <stdint.h>
#include <stdio.h>

uint32_t mem = 0;

int main(void) {
  uint64_t sum=0;
  for (uint32_t i=3000000000; i>0; --i) {
    asm volatile ("" : : : "memory"); //load data from memory each time
    uint64_t data = mem;

    //partial register stall
    sum += (data & UINT64_C(0xFFFFFFFFFFFFFF00)) >> 2;

    //no partial register stall
    //sum += (data >> 2) & UINT64_C(0xFFFFFFFFFFFFFFC0);
  }
  printf("sum is %llu\n", sum);
}

$ gcc-4.7 -O3 -std=gnu99 partial_register_stall.c && time ./a.out 
sum is 0

real	0m4.504s
user	0m4.500s
sys	0m0.000s

Each loop iteration is 4.5 cycles.

Relevant assembly code:

  400410:       8b 05 ee 04 20 00       mov    eax,DWORD PTR [rip+0x2004ee]        # 600904 <mem>
  400416:       30 c0                   xor    al,al
  400418:       48 c1 e8 02             shr    rax,0x2
  40041c:       48 01 c6                add    rsi,rax
  40041f:       83 ea 01                sub    edx,0x1
  400422:       75 ec                   jne    400410 <main+0x10>

mem is zero-extended into RAX. The lower 8 bits of RAX are zeroed via XOR AL, AL. The result is shifted down by two.

An equivalent way of computing this is to first shift down by two and then mask the lower six bits to zero. That is, replace the line:
   sum += (data & UINT64_C(0xFFFFFFFFFFFFFF00)) >> 2;
with:
   sum += (data >> 2) & UINT64_C(0xFFFFFFFFFFFFFFC0);

$ gcc-4.7 -O3 -std=gnu99 partial_register_stall.c && time ./a.out 
sum is 0

real	0m2.002s
user	0m2.000s
sys	0m0.000s

Each loop iteration is now 2 cycles.

Relevant assembly code:

  400410:       8b 05 fe 04 20 00       mov    eax,DWORD PTR [rip+0x2004fe]        # 600914 <mem>
  400416:       48 c1 e8 02             shr    rax,0x2
  40041a:       48 83 e0 c0             and    rax,0xffffffffffffffc0
  40041e:       48 01 c6                add    rsi,rax
  400421:       83 ea 01                sub    edx,0x1
  400424:       75 ea                   jne    400410 <main+0x10>
Comment 1 Richard Biener 2012-04-27 09:14:45 UTC
Confirmed.
Comment 2 Uroš Bizjak 2012-04-30 13:19:56 UTC
This is due to following splitter in i386.md:

(define_split
  [(set (match_operand 0 "ext_register_operand")
	(and (match_dup 0)
	     (const_int -256)))
   (clobber (reg:CC FLAGS_REG))]
  "(!TARGET_PARTIAL_REG_STALL || optimize_function_for_size_p (cfun))
   && reload_completed"
  [(set (strict_low_part (match_dup 1)) (const_int 0))]
  "operands[1] = gen_lowpart (QImode, operands[0]);")

However, Core architecture is not listed under X86_TUNE_PARTIAL_REG_STALL, although my documentation says that following latency should be added due to partial reg stall:

PPro, P2, P3  : 5
Core          : 1-5
Core2, Corei7 : 1-6

H.J., should we consider these processors as affected by partial reg stall?
Comment 3 H.J. Lu 2012-04-30 13:26:32 UTC
(In reply to comment #2)
> 
> H.J., should we consider these processors as affected by partial reg stall?

We will investigate.
Comment 4 H.J. Lu 2012-05-01 16:42:32 UTC
(In reply to comment #2)
> However, Core architecture is not listed under X86_TUNE_PARTIAL_REG_STALL,
> although my documentation says that following latency should be added due to
> partial reg stall:
> 
> PPro, P2, P3  : 5
> Core          : 1-5
> Core2, Corei7 : 1-6
> 
> H.J., should we consider these processors as affected by partial reg stall?

8bit/16bit load ops need to save and restore the upper bits when
updating the lower 8bits/16bits.  They are expensive ops on Intel
Core, Core 2 and Core i7 processors.  We will check the overall
impact of X86_TUNE_PARTIAL_REG_STALL on Core i7.
Comment 5 Andrew Pinski 2021-08-15 05:20:50 UTC
.L2:
        movl    mem(%rip), %eax
        shrq    $2, %rax
        andq    %rcx, %rax
        addq    %rax, %rsi
        subl    $1, %edx
        jne     .L2

So both versions now match up because of r6-3841.

So closing as fixed.