53652 – *andn* does not always get used with simd and loading from memory

Bug 53652 - *andn* does not always get used with simd and loading from memory

Summary: *andn* does not always get used with simd and loading from memory

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.7.0

Importance:	P3 enhancement
Target Milestone:	12.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Duplicates (1):	56876 (view as bug list)
Depends on:
Blocks:

Reported:	2012-06-13 06:49 UTC by Jakub Jelinek
Modified:	2022-01-11 02:25 UTC (History)
CC List:	7 users (show)

See Also:
Host:
Target:	x86_64-linux
Build:
Known to work:
Known to fail:
Last reconfirmed:	2021-12-27 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jakub Jelinek 2012-06-13 06:49:06 UTC

#define N 1024
long a[N], b[N], c[N];
int d[N], e[N], f[N];

void
foo (void)
{
  int i;
  for (i = 0; i < N; i++)
    a[i] = b[i] & ~c[i];
}

void
bar (void)
{
  int i;
  for (i = 0; i < N; i++)
    d[i] = e[i] & ~f[i];
}

doesn't use *andn* insns (e.g. vandnp[sd] for -O3 -mavx).  The problem is that
combiner doesn't help here, because
(insn 42 18 33 2 (set (reg:V4DI 94)
        (mem/u/c:V4DI (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [2 S32 A256])) -1
     (expr_list:REG_EQUAL (const_vector:V4DI [
                (const_int -1 [0xffffffffffffffff])
                (const_int -1 [0xffffffffffffffff])
                (const_int -1 [0xffffffffffffffff])
                (const_int -1 [0xffffffffffffffff])
            ])
        (nil)))
is before the loop and thus in a different bb,
so the combiner doesn't substitute the all ones constant into the xor (which should fail, i?86 doesn't have a *not* SSE/AVX insn) and later on when the xor is substituted into the and (at that point it could figure that and (xor x -1) y
is andn).  Wonder if we should change the combiner somehow for the cases where REG_N_SETS == 1 pseudo has REG_EQUAL note, or if we want instead to handle this during expansion (introduce optional andnotM3 standard patterns?).

Comment 1 Hans-Peter Nilsson 2012-06-14 03:45:59 UTC

I'd humbly suggest adding a "not"-recognizer anon insn-and-split pattern with a clear comment "combine needs this as a stepping stone to combine into the andnot".
BTW, this PR seems target-specific; I don't see it for my vector back-end (using 4.7.1:ish, has "and" and "not" and "andnot" V4SI).

Comment 2 Jakub Jelinek 2012-06-14 10:23:31 UTC

Such a def_insn_and_split isn't going to work well, because the hw supported alternative (xor with all ones vector) needs the vector constant loaded into memory, which is much preferrable to be done before loop, and nothing post combine is going to move it before the loop again.
The combiner can already look at the REG_EQUAL note:
(insn 25 21 27 3 (set (reg:V4DI 90 [ vect_var_.18 ])
        (xor:V4DI (mem:V4DI (plus:DI (reg:DI 78 [ ivtmp.28 ])
                    (symbol_ref:DI ("c")  <var_decl 0x7f09fb364280 c>)) [2 MEM[symbol: c, index: ivtmp.28_16, offset: 0B]+0 S32 A256])
            (reg:V4DI 94))) v2.c:10 1587 {*xorv4di3}
     (expr_list:REG_EQUAL (not:V4DI (mem:V4DI (plus:DI (symbol_ref:DI ("c")  <var_decl 0x7f09fb364280 c>)
                    (reg:DI 78 [ ivtmp.28 ])) [2 MEM[symbol: c, index: ivtmp.28_16, offset: 0B]+0 S32 A256]))
        (nil)))

(insn 27 25 28 3 (set (reg:V4DI 93 [ vect_var_.19 ])
        (and:V4DI (reg:V4DI 90 [ vect_var_.18 ])
            (mem:V4DI (plus:DI (reg:DI 78 [ ivtmp.28 ])
                    (symbol_ref:DI ("b")  <var_decl 0x7f09fb3641e0 b>)) [2 MEM[symbol: b, index: ivtmp.28_16, offset: 0B]+0 S32 A256]))) v2.c:10 1585 {*andv4di3}
     (expr_list:REG_DEAD (reg:V4DI 90 [ vect_var_.18 ])
        (nil)))
but doesn't use that.  The additional complication here is that both the XOR (and REG_EQUAL not note) and the other AND operand are both MEMs, while andn on x86_64/i?86 only supports one of the operands as MEM.  The combiner would then need to split that into a load followed by andn (in place of the 3 insns (one load before the loop, xor and and).

Comment 3 Jakub Jelinek 2019-07-19 08:45:11 UTC

Ran into this again in context of PR91204, there is another case that isn't matched for a different reason:
int a, b, c[64];

void
foo (void)
{
  int i;
  for (i = 0; i < 64; i++)
    c[i] = ~c[i] & b;
}
In this case the loop has been unrolled and combiner even tries to match
(set (reg:V4SI 137 [ vect__4.8 ])
    (and:V4SI (not:V4SI (mem/c:V4SI (symbol_ref:DI ("c") [flags 0x2]  <var_decl 0x7f4a11107bd0 c>) [1 MEM <vector(4) int> [(int *)&c]+0 S16 A128]))
        (reg:V4SI 132)))
but doesn't match that as memory operand is not allowed in the andnot patterns (perhaps it should and we should just wait for reload to cure it up).

Comment 4 Uroš Bizjak 2019-07-19 09:06:03 UTC

(In reply to Jakub Jelinek from comment #3)
> Ran into this again in context of PR91204, there is another case that isn't
> matched for a different reason:
> int a, b, c[64];
> 
> void
> foo (void)
> {
>   int i;
>   for (i = 0; i < 64; i++)
>     c[i] = ~c[i] & b;
> }
> In this case the loop has been unrolled and combiner even tries to match
> (set (reg:V4SI 137 [ vect__4.8 ])
>     (and:V4SI (not:V4SI (mem/c:V4SI (symbol_ref:DI ("c") [flags 0x2] 
> <var_decl 0x7f4a11107bd0 c>) [1 MEM <vector(4) int> [(int *)&c]+0 S16 A128]))
>         (reg:V4SI 132)))
> but doesn't match that as memory operand is not allowed in the andnot
> patterns (perhaps it should and we should just wait for reload to cure it
> up).

It should also accept memory operand, this is the way we trick combiner in several other places.

Comment 5 Segher Boessenkool 2019-07-19 15:44:37 UTC

It might work a lot better if it didn't have to load that all-ones vector
in a separate insn.  Because it does, you need to do a 3->3 combination
(which we do not currently support) if you need to do the memory load in
a separate insn, as well the the insn needed to keep the constant load
(it isn't dead yet, later insns use that same value again)).

So that would mean having insns (that split) for doing a NOT.

Comment 6 Andrew Pinski 2021-12-27 05:47:26 UTC

*** Bug 56876 has been marked as a duplicate of this bug. ***

Comment 7 Andrew Pinski 2021-12-27 05:48:34 UTC

Simplified testcase from PR 56876:

typedef unsigned long long vec __attribute__((vector_size(16)));
vec g;
vec f1(vec a, vec b){
  return ~a&b;
}
vec f2(vec a, vec b){
  return ~g&b;
}

f2 is similar to the testcase referenced in comment #0.

Comment 8 GCC Commits 2022-01-11 02:13:46 UTC

The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:4bb79e27c02c5cd57d5781bef20e70982d898c40

commit r12-6428-g4bb79e27c02c5cd57d5781bef20e70982d898c40
Author: Haochen Jiang <haochen.jiang@intel.com>
Date:   Thu Dec 30 15:47:58 2021 +0800

    Extend predicate of operands[1] from register_operand to vector_operand for andnot insn.
    
    This can do optimization like
    
    -       pcmpeqd %xmm0, %xmm0
    -       pxor    g(%rip), %xmm0
    -       pand    %xmm1, %xmm0
    +       movdqa  g(%rip), %xmm0
    +       pandn   %xmm1, %xmm0
    
    gcc/ChangeLog:
    
            PR target/53652
            * config/i386/sse.md (*andnot<mode>3): Extend predicate of
            operands[1] from register_operand to vector_operand.
    
    gcc/testsuite/ChangeLog:
    
            PR target/53652
            * gcc.target/i386/pr53652-1.c: New test.

Comment 9 Hongtao.liu 2022-01-11 02:19:10 UTC

Fixed in GCC12.

Comment 10 Andrew Pinski 2022-01-11 02:25:43 UTC

Fixed.