Created attachment 27233 [details] small C file reproducing the issue The below output of: gcc -O2 -S gcc-bug.c, shows gcc generating pointless mask instructions, an and with all bits set. .file "gcc-bug.c" .text .p2align 4,,15 .globl mult_u128 .type mult_u128, @function mult_u128: .LFB1: .cfi_startproc movq %rdi, %r8 movq %rsi, %rcx andl $4294967295, %edi shrq $32, %r8 shrq $32, %rcx andl $4294967295, %esi movq %rcx, %rax movq %rsi, %rdx imulq %rdi, %rcx imulq %r8, %rsi imulq %r8, %rax salq $32, %rcx salq $32, %rsi imulq %rdi, %rdx addq %rsi, %rcx adcq $0, %rax addq %rcx, %rdx adcq $0, %rax ret .cfi_endproc .LFE1: .size mult_u128, .-mult_u128 .ident "GCC: (GNU) 4.7.1 20120425 (prerelease)" .section .note.GNU-stack,"",@progbits
movq %rdi, %r8 movq %rsi, %rcx andl $4294967295, %edi I'm not sure it's pointless - %rdi contains a 64bit value and we want to access the zero-extended %rdi later: imulq %rdi, %rcx it could have used a move to another register which implicitely zero-extends or maybe a zero-extension on the same register (if that exists). The and is only 3 bytes in size. A mov %edi, %edi is 2 bytes but I'm not sure that implicitely zero-extends.
I'll have to let Linus and Peter Anvin argue this (they're on CC), this report is the direct result of their complaints: "If you can *ever* get gcc to generate those andl instructions on x86, then please file a gcc bug report: the constant is 0xffffffff and those are plain zero-extension instructions which is much better done with mov." https://lkml.org/lkml/2012/4/24/524 "What insane version of gcc is that? Can you please make a gcc bug-report?" https://lkml.org/lkml/2012/4/24/549 I'm just the sorry sod who wrote the code... :-)
For comparison this is what clang generates: .file "gcc-bug.c" .text .globl mult_u128 .align 16, 0x90 .type mult_u128,@function mult_u128: # @mult_u128 .cfi_startproc # BB#0: movabsq $-4294967296, %rdx # imm = 0xFFFFFFFF00000000 andq %rdi, %rdx movl %esi, %ecx movq %rdi, %rax shrq $32, %rax shrq $32, %rsi imulq %rsi, %rax imulq %rcx, %rdx imull %edi, %esi shlq $32, %rsi addq %rdx, %rsi adcq $0, %rax movl %edi, %edx imulq %rcx, %rdx addq %rsi, %rdx adcq $0, %rax ret .Ltmp0: .size mult_u128, .Ltmp0-mult_u128 .cfi_endproc .section ".note.GNU-stack","",@progbits
I'm not sure what the function computes, but for u128 mult_u128(u64 a, u64 b) { u128 t1; __uint128_t r = (__uint128_t)a * (__uint128_t)b; memcpy (&t1, &r, sizeof (u128)); return t1; } we generate mult_u128: .LFB1: .cfi_startproc movq %rsi, %rax mulq %rdi movq %rax, -56(%rsp) movq %rdx, -48(%rsp) movq -48(%rsp), %rdx ret
Yes that's what it computes.. and no the function won't ever get used on x86_64, but I ran it through the compiler anyway :-) Thing is we need u128 to work on all archs linux supports on all gcc version we support, so we can't use __int128. Anyway, if you're interested in what we're doing and want to see the current patches click: https://lkml.org/lkml/2012/4/25/149
OK rectification, that's what it _should_ compute, I just noticed add_u128() is missing: a.hi += b.hi; Anyway, that's all besides the point, the issue is that gcc shouldn't generate those andl 0xFFFFFFFF, %r thingies, regardless if the C makes sense or not.
Created attachment 27235 [details] gcc48-pr53110.patch Totally untested patch. We already have a splitter to handle and by 0xffffffff, 0xffff and 0xff, but it only triggers if the register is different and IMHO it is quite late anyway. This attempts to optimize this already during expansion, perhaps giving the register allocator more freedom (unfortunately it doesn't seem to improve the code in this case and the movl %edi, %edi (and for %esi too) stays. Ideally if RA swapped the two, it could movl %edi, %r8d and shrl $32, %rdi instead of movq %rdi, %r8; shrl $32, %r8; movl %edi, %edi).
Jakub's patch seems to improve the situation: --- gcc-bug-4.7.s 2012-04-25 14:58:21.494815266 +0200 +++ gcc-bug-4.7+.s 2012-04-25 15:14:13.784243427 +0200 @@ -22,12 +22,12 @@ .cfi_startproc movq %rdi, %r8 movq %rsi, %r9 - andl $4294967295, %esi + movl %esi, %esi shrq $32, %r9 shrq $32, %r8 movq %rsi, %rdx imulq %r8, %rdx - andl $4294967295, %edi + movl %edi, %edi movq %r9, %rax imulq %rdi, %rax imulq %r9, %r8
Author: jakub Date: Wed Apr 25 19:40:31 2012 New Revision: 186839 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=186839 Log: PR target/53110 * config/i386/i386.md (and<mode>3): For andq $0xffffffff, reg instead expand it as zero extension. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.md
There still seems to be a redundant copy in there, but that's pretty common in gcc-generated code; the movl %esi, %esi could get completely elided and the zero extend folded into "movq %rsi, %rdx" by changing it to movl %esi, %edx. This is a second-order issue though...
CCing Vlad for the RA decision.
Author: uros Date: Sun Jul 15 08:13:47 2012 New Revision: 189491 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=189491 Log: PR target/53961 Backport from mainline 2012-04-25 Jakub Jelinek <jakub@redhat.com> PR target/53110 * config/i386/i386.md (and<mode>3): For andq $0xffffffff, reg instead expand it as zero extension. Modified: branches/gcc-4_7-branch/gcc/ChangeLog branches/gcc-4_7-branch/gcc/config/i386/i386.md
Author: tejohnson Date: Wed Jul 25 20:11:13 2012 New Revision: 189866 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=189866 Log: Backport the following patches from trunk to ensure that "andw $0xff, $reg" is always converted to a zero extend to avoid LCP stalls on core2/corei7 (b/6615073): r184891, r186839, r186979, r186993, r188630, r188634, r188648 r184891: 2012-03-04 Uros Bizjak <ubizjak@gmail.com> * config/i386/constraints.md (Ya): New internal constraint. * config/i386/i386.md (zero_extendsidi2): Remove expansion. (*zero_extendsidi2_rex64): Add x,x alternative. (*zero_extendsidi2): Ditto. Add o,0 alternative. Remove flags reg clobber. Adjust corresponding splits. (zero_extend<mode>si2): Macroize expander from zero_extendhisi2 and zero_extendqisi2 expanders using SWI12 mode iterator. (zero_extend<mode>si2_and): Macroize insn from zero_extendhisi2_and and zero_extendqisi2_and. Merge corresponding splitters. (*zero_extend<mode>si2): Macroize insn from *zero_extendhisi2_movzbl and *zero_extendqisi2_movzbl. (*zero_extend*2_movzbl_and): Remove insn patterns. (zero_extendqihi2_and): Merge corresponding splitter. (*zero_extendqihi2): Rename from *zero_extendqihi2_movzbl. (*zero_extend*2_movzbl_and): Remove insn patterns. (*anddi_1): Split TYPE_IMOVX instructions. (*andsi_1): Use Ya for alternative 2. Split TYPE_IMOVX instructions. (*andhi_1): Ditto. (and->zext splitter): Add splitter pattern. (zero extend with andsi3 splitter): Adjust zero_extend pattern. r186839: 2012-04-25 Jakub Jelinek <jakub@redhat.com> PR target/53110 * config/i386/i386.md (and<mode>3): For andq $0xffffffff, reg instead expand it as zero extension. r186979: 2012-04-30 Uros Bizjak <ubizjak@gmail.com> * config/i386/i386.md (and<mode>3): Expand masking operations with 0xff, 0xffff or 0xffffffff immediates to corresponding zero_extend RTX. (and splitter): Split to DImode zero_extend RTX for DImode operand[0]. r186993: 2012-04-30 Uros Bizjak <ubizjak@gmail.com> * config/i386/i386.md (and<mode>3): Change runtime operand mode checks to compile-time "mode == <MODE>mode" checks. (and splitter): Ditto. r188630: 2012-06-14 Uros Bizjak <ubizjak@gmail.com> * config/i386/i386.md (*zero_extendsidi2): Remove x,x alternative. (*zero_extendsidi2_rex64): Ditto. Remove isa attribute. r188634: 2012-06-14 Uros Bizjak <ubizjak@gmail.com> Fix my previous commit to: * config/i386/i386.md (*zero_extendsidi2): Remove x,x alternative. (*zero_extendsidi2_rex64): Ditto. Remove isa attribute. r188648: 2012-06-14 Uros Bizjak <ubizjak@gmail.com> (*zero_extendsidi2_rex64): Remove isa attribute. Modified: branches/google/gcc-4_7/gcc/ChangeLog.google-4_7 branches/google/gcc-4_7/gcc/config/i386/constraints.md branches/google/gcc-4_7/gcc/config/i386/i386.md
The trunk no longer generates masking with andl for this testcase. I didn't try to track it down any further than that.