#define N 1024 long a[N], b[N], c[N]; int d[N], e[N], f[N]; void foo (void) { int i; for (i = 0; i < N; i++) a[i] = b[i] & ~c[i]; } void bar (void) { int i; for (i = 0; i < N; i++) d[i] = e[i] & ~f[i]; } doesn't use *andn* insns (e.g. vandnp[sd] for -O3 -mavx). The problem is that combiner doesn't help here, because (insn 42 18 33 2 (set (reg:V4DI 94) (mem/u/c:V4DI (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [2 S32 A256])) -1 (expr_list:REG_EQUAL (const_vector:V4DI [ (const_int -1 [0xffffffffffffffff]) (const_int -1 [0xffffffffffffffff]) (const_int -1 [0xffffffffffffffff]) (const_int -1 [0xffffffffffffffff]) ]) (nil))) is before the loop and thus in a different bb, so the combiner doesn't substitute the all ones constant into the xor (which should fail, i?86 doesn't have a *not* SSE/AVX insn) and later on when the xor is substituted into the and (at that point it could figure that and (xor x -1) y is andn). Wonder if we should change the combiner somehow for the cases where REG_N_SETS == 1 pseudo has REG_EQUAL note, or if we want instead to handle this during expansion (introduce optional andnotM3 standard patterns?).
I'd humbly suggest adding a "not"-recognizer anon insn-and-split pattern with a clear comment "combine needs this as a stepping stone to combine into the andnot". BTW, this PR seems target-specific; I don't see it for my vector back-end (using 4.7.1:ish, has "and" and "not" and "andnot" V4SI).
Such a def_insn_and_split isn't going to work well, because the hw supported alternative (xor with all ones vector) needs the vector constant loaded into memory, which is much preferrable to be done before loop, and nothing post combine is going to move it before the loop again. The combiner can already look at the REG_EQUAL note: (insn 25 21 27 3 (set (reg:V4DI 90 [ vect_var_.18 ]) (xor:V4DI (mem:V4DI (plus:DI (reg:DI 78 [ ivtmp.28 ]) (symbol_ref:DI ("c") <var_decl 0x7f09fb364280 c>)) [2 MEM[symbol: c, index: ivtmp.28_16, offset: 0B]+0 S32 A256]) (reg:V4DI 94))) v2.c:10 1587 {*xorv4di3} (expr_list:REG_EQUAL (not:V4DI (mem:V4DI (plus:DI (symbol_ref:DI ("c") <var_decl 0x7f09fb364280 c>) (reg:DI 78 [ ivtmp.28 ])) [2 MEM[symbol: c, index: ivtmp.28_16, offset: 0B]+0 S32 A256])) (nil))) (insn 27 25 28 3 (set (reg:V4DI 93 [ vect_var_.19 ]) (and:V4DI (reg:V4DI 90 [ vect_var_.18 ]) (mem:V4DI (plus:DI (reg:DI 78 [ ivtmp.28 ]) (symbol_ref:DI ("b") <var_decl 0x7f09fb3641e0 b>)) [2 MEM[symbol: b, index: ivtmp.28_16, offset: 0B]+0 S32 A256]))) v2.c:10 1585 {*andv4di3} (expr_list:REG_DEAD (reg:V4DI 90 [ vect_var_.18 ]) (nil))) but doesn't use that. The additional complication here is that both the XOR (and REG_EQUAL not note) and the other AND operand are both MEMs, while andn on x86_64/i?86 only supports one of the operands as MEM. The combiner would then need to split that into a load followed by andn (in place of the 3 insns (one load before the loop, xor and and).
Ran into this again in context of PR91204, there is another case that isn't matched for a different reason: int a, b, c[64]; void foo (void) { int i; for (i = 0; i < 64; i++) c[i] = ~c[i] & b; } In this case the loop has been unrolled and combiner even tries to match (set (reg:V4SI 137 [ vect__4.8 ]) (and:V4SI (not:V4SI (mem/c:V4SI (symbol_ref:DI ("c") [flags 0x2] <var_decl 0x7f4a11107bd0 c>) [1 MEM <vector(4) int> [(int *)&c]+0 S16 A128])) (reg:V4SI 132))) but doesn't match that as memory operand is not allowed in the andnot patterns (perhaps it should and we should just wait for reload to cure it up).
(In reply to Jakub Jelinek from comment #3) > Ran into this again in context of PR91204, there is another case that isn't > matched for a different reason: > int a, b, c[64]; > > void > foo (void) > { > int i; > for (i = 0; i < 64; i++) > c[i] = ~c[i] & b; > } > In this case the loop has been unrolled and combiner even tries to match > (set (reg:V4SI 137 [ vect__4.8 ]) > (and:V4SI (not:V4SI (mem/c:V4SI (symbol_ref:DI ("c") [flags 0x2] > <var_decl 0x7f4a11107bd0 c>) [1 MEM <vector(4) int> [(int *)&c]+0 S16 A128])) > (reg:V4SI 132))) > but doesn't match that as memory operand is not allowed in the andnot > patterns (perhaps it should and we should just wait for reload to cure it > up). It should also accept memory operand, this is the way we trick combiner in several other places.
It might work a lot better if it didn't have to load that all-ones vector in a separate insn. Because it does, you need to do a 3->3 combination (which we do not currently support) if you need to do the memory load in a separate insn, as well the the insn needed to keep the constant load (it isn't dead yet, later insns use that same value again)). So that would mean having insns (that split) for doing a NOT.
*** Bug 56876 has been marked as a duplicate of this bug. ***
Simplified testcase from PR 56876: typedef unsigned long long vec __attribute__((vector_size(16))); vec g; vec f1(vec a, vec b){ return ~a&b; } vec f2(vec a, vec b){ return ~g&b; } f2 is similar to the testcase referenced in comment #0.
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:4bb79e27c02c5cd57d5781bef20e70982d898c40 commit r12-6428-g4bb79e27c02c5cd57d5781bef20e70982d898c40 Author: Haochen Jiang <haochen.jiang@intel.com> Date: Thu Dec 30 15:47:58 2021 +0800 Extend predicate of operands[1] from register_operand to vector_operand for andnot insn. This can do optimization like - pcmpeqd %xmm0, %xmm0 - pxor g(%rip), %xmm0 - pand %xmm1, %xmm0 + movdqa g(%rip), %xmm0 + pandn %xmm1, %xmm0 gcc/ChangeLog: PR target/53652 * config/i386/sse.md (*andnot<mode>3): Extend predicate of operands[1] from register_operand to vector_operand. gcc/testsuite/ChangeLog: PR target/53652 * gcc.target/i386/pr53652-1.c: New test.
Fixed in GCC12.
Fixed.