[hjl@gnu-tools-1 bitwise-1]$ cat z.i void foo (long long ixi) { if (ixi != 14348907) __builtin_abort (); } [hjl@gnu-tools-1 bitwise-1]$ make z.s z1.s /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -O2 -m32 -S -o z.s z.i /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -O2 -m32 -mno-stv -S -o z1.s z.i [hjl@gnu-tools-1 bitwise-1]$ cat z.s .file "z.i" .text .p2align 4,,15 .globl foo .type foo, @function foo: .LFB0: .cfi_startproc subl $12, %esp .cfi_def_cfa_offset 16 movl 20(%esp), %edx movl 16(%esp), %eax xorb $0, %dh xorl $14348907, %eax movl %edx, %ecx orl %eax, %ecx jne .L5 addl $12, %esp .cfi_remember_state .cfi_def_cfa_offset 4 ret .L5: .cfi_restore_state call abort .cfi_endproc .LFE0: .size foo, .-foo .ident "GCC: (GNU) 6.0.0 20160318 (experimental)" .section .note.GNU-stack,"",@progbits [hjl@gnu-tools-1 bitwise-1]$ cat z1.s .file "z.i" .text .p2align 4,,15 .globl foo .type foo, @function foo: .LFB0: .cfi_startproc subl $12, %esp .cfi_def_cfa_offset 16 movl 16(%esp), %eax xorl $14348907, %eax orl 20(%esp), %eax jne .L5 addl $12, %esp .cfi_remember_state .cfi_def_cfa_offset 4 ret .L5: .cfi_restore_state call abort .cfi_endproc .LFE0: .size foo, .-foo .ident "GCC: (GNU) 6.0.0 20160318 (experimental)" .section .note.GNU-stack,"",@progbits [hjl@gnu-tools-1 bitwise-1]$ STV generates: movl 20(%esp), %edx movl 16(%esp), %eax xorb $0, %dh xorl $14348907, %eax movl %edx, %ecx orl %eax, %ecx vs movl 16(%esp), %eax xorl $14348907, %eax orl 20(%esp), %eax
"and" is also less optimized: [hjl@gnu-tools-1 bitwise-1]$ cat and.i extern long long x; void foo (long long ixi) { x = ixi & 14348907; } [hjl@gnu-tools-1 bitwise-1]$ make and.s /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -fno-asynchronous-unwind-tables -O2 -m32 -S -o and.s and.i [hjl@gnu-tools-1 bitwise-1]$ make and1.s /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -fno-asynchronous-unwind-tables -O2 -m32 -mno-stv -S -o and1.s and.i [hjl@gnu-tools-1 bitwise-1]$ cat and.s .file "and.i" .text .p2align 4,,15 .globl foo .type foo, @function foo: movl 4(%esp), %eax movl 8(%esp), %edx andl $14348907, %eax andl $0, %edx movl %eax, x movl %edx, x+4 ret .size foo, .-foo .ident "GCC: (GNU) 6.0.0 20160318 (experimental)" .section .note.GNU-stack,"",@progbits [hjl@gnu-tools-1 bitwise-1]$ cat and1.s .file "and.i" .text .p2align 4,,15 .globl foo .type foo, @function foo: movl 4(%esp), %eax movl $0, x+4 andl $14348907, %eax movl %eax, x ret .size foo, .-foo .ident "GCC: (GNU) 6.0.0 20160318 (experimental)" .section .note.GNU-stack,"",@progbits [hjl@gnu-tools-1 bitwise-1]$
or and xor have the same issue: [hjl@gnu-tools-1 bitwise-1]$ cat or.i extern long long x; void foo (long long ixi) { x = ixi | 14348907; } [hjl@gnu-tools-1 bitwise-1]$ make or.s /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -fno-asynchronous-unwind-tables -O2 -m32 -S -o or.s or.i [hjl@gnu-tools-1 bitwise-1]$ make or1.s /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -fno-asynchronous-unwind-tables -O2 -m32 -mno-stv -S -o or1.s or.i [hjl@gnu-tools-1 bitwise-1]$ cat or.s .file "or.i" .text .p2align 4,,15 .globl foo .type foo, @function foo: movl 4(%esp), %eax movl 8(%esp), %edx orl $14348907, %eax orb $0, %dh movl %eax, x movl %edx, x+4 ret .size foo, .-foo .ident "GCC: (GNU) 6.0.0 20160318 (experimental)" .section .note.GNU-stack,"",@progbits [hjl@gnu-tools-1 bitwise-1]$ cat or1.s .file "or.i" .text .p2align 4,,15 .globl foo .type foo, @function foo: movl 4(%esp), %eax orl $14348907, %eax movl %eax, x movl 8(%esp), %eax movl %eax, x+4 ret .size foo, .-foo .ident "GCC: (GNU) 6.0.0 20160318 (experimental)" .section .note.GNU-stack,"",@progbits [hjl@gnu-tools-1 bitwise-1]$
Should STV split DImode early if STV is known to not profitable?
Created attachment 38057 [details] gcc6-pr70321.patch I'm afraid for GCC 6.x we can't do much more than attached patch, which improves the generated code, but there is no combiner post RA that would improve it even more. For GCC 6.x, I think it is out of question to move STV pass somewhere else. For stage1, I wonder if it can't move earlier, say before the combiner. If it could, then we could split the TARGET_STV patterns that weren't changed into vector insns, and let the combiner deal with that. Is there a reason why it is done after the combiner now?
(In reply to Jakub Jelinek from comment #4) > For stage1, I wonder if it can't move earlier, say before the combiner. If > it could, then we could split the TARGET_STV patterns that weren't changed > into vector insns, and let the combiner deal with that. Is there a reason > why it is done after the combiner now? Yes, STV searches for ANDN pattern and also for DI comparison with zero executed via OR. Those are produced by combiner. I believe some STV tests would fail if we move it before combine.
(In reply to Ilya Enkovich from comment #5) > (In reply to Jakub Jelinek from comment #4) > > For stage1, I wonder if it can't move earlier, say before the combiner. If > > it could, then we could split the TARGET_STV patterns that weren't changed > > into vector insns, and let the combiner deal with that. Is there a reason > > why it is done after the combiner now? > > Yes, STV searches for ANDN pattern and also for DI comparison with zero > executed via OR. Those are produced by combiner. I believe some STV tests > would fail if we move it before combine. Why couldn't STV just "vectorize" AND and NOT patterns and let the combiner combine that in the vectorized code?
(In reply to Jakub Jelinek from comment #6) > Why couldn't STV just "vectorize" AND and NOT patterns and let the combiner > combine that in the vectorized code? I think the only thing we miss for that is corresponding NOT pattern in DI mode. For comparison we have {r94:SI=r89:DI#4|r89:DI#0;clobber flags:CC;} flags:CCZ=cmp(r94:SI,0) combined into {flags:CCZ=cmp(r89:DI#4|r89:DI#0,0);clobber scratch;} which is converted into PTEST by STV.
(In reply to Ilya Enkovich from comment #7) > (In reply to Jakub Jelinek from comment #6) > > Why couldn't STV just "vectorize" AND and NOT patterns and let the combiner > > combine that in the vectorized code? > > I think the only thing we miss for that is corresponding NOT pattern in DI > mode. It is PR 70322.
Author: jakub Date: Wed Mar 23 09:49:12 2016 New Revision: 234416 URL: https://gcc.gnu.org/viewcvs?rev=234416&root=gcc&view=rev Log: PR target/70321 * config/i386/i386.md (*anddi3_doubleword, *<code>di3_doubleword): Optimize TARGET_STV splitters, if high or low word of last argument is 0 or -1. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.md
Are the changes I've made sufficient enough that we can postpone the rest of this for stage1?
(In reply to Jakub Jelinek from comment #10) > Are the changes I've made sufficient enough that we can postpone the rest of > this for stage1? Now we got [hjl@gnu-6 pr70321]$ cat z.i void foo (long long ixi) { if (ixi != 14348907) __builtin_abort (); } [hjl@gnu-6 pr70321]$ make z.s /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -fno-asynchronous-unwind-tables -O2 -m32 -S -o z.s z.i [hjl@gnu-6 pr70321]$ make z1.s /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -fno-asynchronous-unwind-tables -O2 -m32 -mno-stv -S -o z1.s z.i [hjl@gnu-6 pr70321]$ cat z.s .file "z.i" .text .p2align 4,,15 .globl foo .type foo, @function foo: subl $12, %esp movl 16(%esp), %eax movl 20(%esp), %edx xorl $14348907, %eax movl %edx, %ecx orl %eax, %ecx jne .L5 addl $12, %esp ret .L5: call abort .size foo, .-foo .ident "GCC: (GNU) 6.0.0 20160323 (experimental)" .section .note.GNU-stack,"",@progbits [hjl@gnu-6 pr70321]$ cat z1.s .file "z.i" .text .p2align 4,,15 .globl foo .type foo, @function foo: subl $12, %esp movl 16(%esp), %eax xorl $14348907, %eax orl 20(%esp), %eax jne .L5 addl $12, %esp ret .L5: call abort .size foo, .-foo .ident "GCC: (GNU) 6.0.0 20160323 (experimental)" .section .note.GNU-stack,"",@progbits It is an improvement. But we should fix it properly for GCC 7.
Ok, deferring the rest for GCC 7.
So shall we still consider moving 32-bit stv pass before the combiner instead of after it? The argument about andn is no longer relevant, we have a one_cmpldi3_doubleword pattern now.
(In reply to Jakub Jelinek from comment #13) > So shall we still consider moving 32-bit stv pass before the combiner > instead of after it? The argument about andn is no longer relevant, we have > a one_cmpldi3_doubleword pattern now. Previously moving the pass caused regressions. If no regressions appear now then we should do it. It might enable some additional optimizations.
I think it is too late to change this for GCC7, we shall try to do that in early stage1.
We didn't do that in stage1, stage4 is too risky for that. Can somebody from Intel try that for GCC9 stage1? I'm willing to help, but will have lots of work on OpenMP 5.0, so can't drive that.
(In reply to Jakub Jelinek from comment #16) > We didn't do that in stage1, stage4 is too risky for that. Can somebody > from Intel try that for GCC9 stage1? > I'm willing to help, but will have lots of work on OpenMP 5.0, so can't > drive that. We will look into it. Thanks.
(In reply to H.J. Lu from comment #17) > (In reply to Jakub Jelinek from comment #16) > > We didn't do that in stage1, stage4 is too risky for that. Can somebody > > from Intel try that for GCC9 stage1? > > I'm willing to help, but will have lots of work on OpenMP 5.0, so can't > > drive that. > > We will look into it. Thanks. It's gcc9 stage1 now...
GCC 9.1 has been released.
GCC 9.2 has been released.
GCC 9.3.0 has been released, adjusting target milestone.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
Patch proposed at https://gcc.gnu.org/pipermail/gcc-patches/2022-April/593174.html (waiting for GCC 13 stage 1).
GCC 9 branch is being closed
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>: https://gcc.gnu.org/g:43201f2c2173894bf7c423cad6da1c21567e06c0 commit r13-855-g43201f2c2173894bf7c423cad6da1c21567e06c0 Author: Roger Sayle <roger@nextmovesoftware.com> Date: Mon May 30 21:20:09 2022 +0100 PR target/70321: Split double word equality/inequality after STV on x86. This patch resolves the last piece of PR target/70321 a code quality (P2 regression) affecting mainline. Currently, for HJ's testcase: void foo (long long ixi) { if (ixi != 14348907) __builtin_abort (); } GCC with -m32 -O2 generates four instructions for the comparison: movl 16(%esp), %eax movl 20(%esp), %edx xorl $14348907, %eax orl %eax, %edx but with this patch it now requires only three, making better use of x86's addressing modes: movl 16(%esp), %eax xorl $14348907, %eax orl 20(%esp), %eax The solution is to expand "doubleword" equality/inequality expressions using flag setting COMPARE instructions for the early RTL passes, and then split them during split1, after STV and before reload. Hence on x86_64, we now see/allow things like: (insn 11 8 12 2 (set (reg:CCZ 17 flags) (compare:CCZ (reg/v:TI 84 [ x ]) (reg:TI 96))) "cmpti.c":2:43 30 {*cmpti_doubleword} This allows the STV pass to decide whether it's preferrable to perform this comparison using vector operations, i.e. a pxor/ptest sequence, or as scalar integer operations, i.e. a xor/xor/or sequence. Alas this required tweaking of the STV pass to recognize the "new" form of these comparisons and split out the pxor operation itself. To confirm this still works as expected I've added a new STV test case: long long a[1024]; long long b[1024]; int foo() { for (int i=0; i<1024; i++) { long long t = (a[i]<<8) | (b[i]<<24); if (t == 0) return 1; } return 0; } where with -m32 -O2 -msse4.1 the above comparison with zero should look like: punpcklqdq %xmm0, %xmm0 ptest %xmm0, %xmm0 Although this patch includes one or two minor tweaks to provide all the necessary infrastructure to support conversion of TImode comparisons to V1TImode (and SImode comparisons to V4SImode), STV doesn't yet implement these transformations, but this is something that can be considered after stage 4. Indeed the new convert_compare functionality is split out into a method to simplify its potential reuse by the timode_scalar_chain class. 2022-05-30 Roger Sayle <roger@nextmovesoftware.com> gcc/ChangeLog PR target/70321 * config/i386/i386-expand.cc (ix86_expand_branch): Don't decompose DI mode equality/inequality using XOR here. Instead generate a COMPARE for doubleword modes (DImode on !TARGET_64BIT or TImode). * config/i386/i386-features.cc (gen_gpr_to_xmm_move_src): Use gen_rtx_SUBREG when NUNITS is 1, i.e. for TImode to V1TImode. (general_scalar_chain::convert_compare): New function to convert scalar equality/inequality comparison into vector operations. (general_scalar_chain::convert_insn) [COMPARE]: Refactor. Call new convert_compare helper method. (convertible_comparion_p): Update to match doubleword COMPARE of two register, memory or integer constant operands. * config/i386/i386-features.h (general_scalar_chain::convert_compare): Prototype/declare member function here. * config/i386/i386.md (cstore<mode>4): Change mode to SDWIM, but only allow new doubleword modes for EQ and NE operators. (*cmp<dwi>_doubleword): New define_insn_and_split, to split a doubleword comparison into a pair of XORs followed by an IOR to set the (zero) flags register, optimizing the XORs if possible. * config/i386/sse.md (V_AVX): Include V1TI and V2TI in mode iterator; V_AVX is (currently) only used by ptest. (sse4_1 mode attribute): Update to support V1TI and V2TI. gcc/testsuite/ChangeLog PR target/70321 * gcc.target/i386/pr70321.c: New test case. * gcc.target/i386/sse4_1-stv-1.c: New test case.
This should now be fixed on mainline.
The master branch has been updated by Rainer Orth <ro@gcc.gnu.org>: https://gcc.gnu.org/g:ee7011eab67d8272f88e954bbab2e42bf6353441 commit r14-8690-gee7011eab67d8272f88e954bbab2e42bf6353441 Author: Rainer Orth <ro@CeBiTec.Uni-Bielefeld.DE> Date: Thu Feb 1 12:48:57 2024 +0100 testsuite: i386: Fix gcc.target/i386/pr70321.c on 32-bit Solaris/x86 gcc.target/i386/pr70321.c FAILs on 32-bit Solaris/x86 since its introduction in commit 43201f2c2173894bf7c423cad6da1c21567e06c0 Author: Roger Sayle <roger@nextmovesoftware.com> Date: Mon May 30 21:20:09 2022 +0100 PR target/70321: Split double word equality/inequality after STV on x86. FAIL: gcc.target/i386/pr70321.c scan-assembler-times mov 1 The failure happens because 32-bit Solaris/x86 defaults to -fno-omit-frame-pointer. Fixed by specifying -fomit-frame-pointer explicitly. Tested on i386-pc-solaris2.11 and i686-pc-linux-gnu. 2024-01-23 Rainer Orth <ro@CeBiTec.Uni-Bielefeld.DE> gcc/testsuite: * gcc.target/i386/pr70321.c: Add -fomit-frame-pointer to dg-options.