107093 – AVX512 mask operations not simplified in fully masked loop

Bug 107093 - AVX512 mask operations not simplified in fully masked loop

Summary: AVX512 mask operations not simplified in fully masked loop

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	13.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2022-09-30 09:06 UTC by Richard Biener
Modified:	2023-07-24 08:21 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:	x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
prototype for WHILE_ULT (2.06 KB, patch) 2022-09-30 09:09 UTC, Richard Biener	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Richard Biener 2022-09-30 09:06:43 UTC

Trying to implement WHILE_ULT for AVX512 I run into optimization issues.  Consider

double a[1024], b[1024];

void foo (int n)
{
  for (int i = 0; i < n; ++i)
    a[i] = b[i] * 3.;
}

compiled with -O3 -march=cascadelake --param vect-partial-vector-usage=2

I get snippets like

        kxnorb  %k1, %k1, %k1
        kortestb        %k1, %k1
        je      .L11

or

        kxorb   %k1, %k1, %k1
        kxnorb  %k1, %k1, %k1

where we fail to simplify the operations.  Looking at the RTL it looks
like missed jump threading, but I do see the ops being

(insn 18 72 74 5 (parallel [
            (set (reg:QI 69 k1 [orig:86 loop_mask_15 ] [86])
                (not:QI (xor:QI (reg:QI 69 k1 [orig:86 loop_mask_15 ] [86])
                        (reg:QI 69 k1 [orig:86 loop_mask_15 ] [86]))))
            (unspec [
                    (const_int 0 [0])
                ] UNSPEC_MASKOP)
        ]) 1912 {kxnorqi}
     (expr_list:REG_EQUAL (const_int -1 [0xffffffffffffffff])
        (nil)))

thus having an UNSPEC in them.  When emitting a SET from constm1 I end up
with mask<->GPR moves and if-converted code which isn't optimal either.
When doing -fno-if-conversion I get

.L7:
        vmovapd b(%rax), %ymm1{%k1}
        addl    $4, %ecx
        movl    %edi, %edx
        vmulpd  %ymm2, %ymm1, %ymm0
        subl    %ecx, %edx
        vmovapd %ymm0, a(%rax){%k1}
        kxnorb  %k1, %k1, %k1
        cmpl    $4, %edx
        jge     .L5
        vpbroadcastd    %edx, %xmm0
        vpcmpd  $1, %xmm0, %xmm3, %k1
.L5:
        addq    $32, %rax
        kortestb        %k1, %k1
        jne     .L7

which also doesn't have the desired short-cut from the cmpl $4, %edx.

Comment 1 Richard Biener 2022-09-30 09:09:47 UTC

Created attachment 53645 [details]
prototype for WHILE_ULT

I'm playing with the attached.  Note it requires the third operand patch for WHILE_ULT (but it's basically approved).

Comment 2 Hongtao.liu 2022-10-10 01:50:05 UTC

For UNSPEC part, we can create a new define_insn with genenral operation and accept both gpr and mask alternatives just like other logic patterns.
For gpr version, we can split it to xor + not after reload.
For mask version, we can split it to kxnor with UNSPEC after reload.

It should help general simplication for xor and not.

Comment 3 GCC Commits 2022-10-11 09:23:08 UTC

The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:498ad738690b3c464f901d63dcd4d0f49a50dd00

commit r13-3218-g498ad738690b3c464f901d63dcd4d0f49a50dd00
Author: liuhongt <hongtao.liu@intel.com>
Date:   Mon Oct 10 11:31:48 2022 +0800

    Add define_insn_and_split to support general version of "kxnor".
    
    For genereal_reg_operand, it will be splitted into xor + not.
    For mask_reg_operand, it will be splitted with UNSPEC_MASK_OP just
    like what we did for other logic operations.
    
    The patch will optimize xor+not to kxnor when possible.
    
    gcc/ChangeLog:
    
            PR target/107093
            * config/i386/i386.md (*notxor<mode>_1): New post_reload
            define_insn_and_split.
            (*notxorqi_1): Ditto.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/i386/pr107093.c: New test.

Comment 4 Hongtao.liu 2022-10-11 10:05:13 UTC

change "*k, CBC" to "?k, CBC", in *mov{qi,hi,si,di}_internal.
then RA works good to choose kxnor for setting constm1_rtx to mask register, and i got below with your attached patch(change #if 0 to #if 1), seems better than orginal patch.

 6foo:
 7.LFB0:
 8        .cfi_startproc
 9        testl   %edi, %edi
10        jle     .L9
11        kxnorb  %k1, %k1, %k1
12        cmpl    $4, %edi
13        jl      .L11
14.L3:
15        vbroadcastsd    .LC2(%rip), %ymm3
16        vmovdqa .LC0(%rip), %xmm2
17        xorl    %eax, %eax
18        xorl    %ecx, %ecx
19        .p2align 4,,10
20        .p2align 3
21.L7:
22        vmovapd b(%rax), %ymm0{%k1}
23        addl    $4, %ecx
24        movl    %edi, %edx
25        vmulpd  %ymm3, %ymm0, %ymm1
26        subl    %ecx, %edx
27        cmpl    $4, %edx
28        vmovapd %ymm1, a(%rax){%k1}
29        vpbroadcastd    %edx, %xmm1
30        movl    $-1, %edx
31        vpcmpd  $1, %xmm1, %xmm2, %k1
32        kmovb   %k1, %esi
33        cmovge  %edx, %esi
34        addq    $32, %rax
35        kmovb   %esi, %k1
36        kortestb        %k1, %k1
37        jne     .L7
38        vzeroupper
39.L9:
40        ret
41        .p2align 4,,10
42        .p2align 3
43.L11:
44        vmovdqa .LC0(%rip), %xmm2
45        vpbroadcastd    %edi, %xmm1
46        vpcmpd  $1, %xmm1, %xmm2, %k1
47        jmp     .L3
48        .cfi_endproc

Comment 5 Hongtao.liu 2022-10-11 10:13:25 UTC

Also i think masked epilog(--param=vect-partial-vector-usage=1) should be good for general cases under AVX512, espicially when main loop's vector width is 512, and the remain tripcount is not enough for 256-bit vectorization but ok for 128-bit vectorization.

Comment 6 Richard Biener 2022-10-11 10:51:06 UTC

(In reply to Hongtao.liu from comment #4)
> change "*k, CBC" to "?k, CBC", in *mov{qi,hi,si,di}_internal.
> then RA works good to choose kxnor for setting constm1_rtx to mask register,
> and i got below with your attached patch(change #if 0 to #if 1), seems
> better than orginal patch.
> 
>  6foo:
>  7.LFB0:
>  8        .cfi_startproc
>  9        testl   %edi, %edi
> 10        jle     .L9
> 11        kxnorb  %k1, %k1, %k1
> 12        cmpl    $4, %edi
> 13        jl      .L11
> 14.L3:
> 15        vbroadcastsd    .LC2(%rip), %ymm3
> 16        vmovdqa .LC0(%rip), %xmm2
> 17        xorl    %eax, %eax
> 18        xorl    %ecx, %ecx
> 19        .p2align 4,,10
> 20        .p2align 3
> 21.L7:
> 22        vmovapd b(%rax), %ymm0{%k1}
> 23        addl    $4, %ecx
> 24        movl    %edi, %edx
> 25        vmulpd  %ymm3, %ymm0, %ymm1
> 26        subl    %ecx, %edx
> 27        cmpl    $4, %edx
> 28        vmovapd %ymm1, a(%rax){%k1}
> 29        vpbroadcastd    %edx, %xmm1
> 30        movl    $-1, %edx
> 31        vpcmpd  $1, %xmm1, %xmm2, %k1
> 32        kmovb   %k1, %esi
> 33        cmovge  %edx, %esi

not sure if the round-trip to GPRs for the sake
of a cmovge is worth, I guess a branch would be
better.

Comment 7 Richard Biener 2022-10-11 10:59:33 UTC

(In reply to Hongtao.liu from comment #5)
> Also i think masked epilog(--param=vect-partial-vector-usage=1) should be
> good for general cases under AVX512, espicially when main loop's vector
> width is 512, and the remain tripcount is not enough for 256-bit
> vectorization but ok for 128-bit vectorization.

Yes, for the fully masked variant I was mostly targeting -O2 with its
very-cheap (size wise) cost model.  Since we don't vectorize the
epilogue of a vectorized epilogue (yet) going fully masked there
should indeed help.  Also when we start to use the unroll hint the
vectorized epilogue might get full width iterations to handle as well.

One downside for a fully masked body is that we're using masked stores
which usually have higher latency due to the "merge" semantics which
means an extra memory input + merge operation.  Not sure if modern
uArchs can optimize the all-ones mask case, the vectorizer, for
.MASK_STORE, still has the code to change those to emit a mask
compare against all-zeros and only conditionally doing a .MASK_STORE.
That could be enhanced to single out the all-ones case, at least for
the .MASK_STOREs in a main fully masked loop when the mask is only
from the iteration (rather than conditional execution).

Comment 8 Hongtao.liu 2022-10-11 11:08:00 UTC

> 
> One downside for a fully masked body is that we're using masked stores
> which usually have higher latency due to the "merge" semantics which
> means an extra memory input + merge operation.  Not sure if modern
> uArchs can optimize the all-ones mask case, the vectorizer, for
Also I guess mask store won't be store forward even load is inside the mask store.

Comment 9 rguenther@suse.de 2022-10-11 11:14:31 UTC

On Tue, 11 Oct 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093
> 
> --- Comment #8 from Hongtao.liu <crazylht at gmail dot com> ---
> 
> > 
> > One downside for a fully masked body is that we're using masked stores
> > which usually have higher latency due to the "merge" semantics which
> > means an extra memory input + merge operation.  Not sure if modern
> > uArchs can optimize the all-ones mask case, the vectorizer, for
> Also I guess mask store won't be store forward even load is inside the mask
> store.

I guess the masking of the store is resolved in the load-store unit
and not by splitting the operation into a load, modify, store because
that cannot easily hide exceptions.  So yes, a masked store in the
store buffer likely cannot act as forwarding source (though the
actual mask should be fully resolved there) since the actual
merging will take place later.

Comment 10 Richard Biener 2023-07-24 08:21:37 UTC

icelake is able to forward a masked store with a all-ones mask, Zen4 isn't able to do that.  Other masked stores indeed do not forward.

There's a related problem also when an outer loop causes a low trip inner loop
to use masked load/store to then overlapping vectors:

outer iteration 1
   ... = .MASK_LOAD (p, {-1, -1, -1, -1, 0, 0, 0, 0});
   ...
   .MASK_STORE (p, val, {-1, -1, -1, -1, 0, 0, 0, 0});

outer iteration 2
   ... = .MASK_LOAD (p + delta, {-1, -1, -1, -1, 0, 0, 0, 0});
   ...
   .MASK_STORE (p + delta, val, {-1, -1, -1, -1, 0, 0, 0, 0});

with delta causing the next outer iteration to access the masked out values
from the previous iteration.  That gets a STLF failure (obviously) but
we now also need to wait for the masked store to retire before the masked
load of iteration 2 can be carried out.

We are hitting this case in SPEC CPU 2017 with masked epilogues (the
inner loop just iterates 4 times, vectorized with V8DFmode vectors).

Ideally the implementation (the CPU) would "shorten" loads/stores for
trailing sequences of zeros so this hazard doesn't occur.  Not sure if
that would be allowed by the x86 memory model though (I didn't find
anything specific there with respect to load/store masking).  ISTR store
buffer entries are usually assigned at instruction issue time, I'm not
sure if the mask is resolved there or whether the size of the store could
be adjusted later when it is.  The implementation could also somehow
ignore the "conflict".

Note I didn't yet fully benchmark masked epilogues with -mpreferred-vector-width=512 on icelake or sapphire rapids, maybe Intel CPUs are not affected
by this issue.

The original issue in the description seems solved we now generate the
following with the code generation variant that's now on trunk:

.L3:
        vmovapd b(%rax), %ymm0{%k1}
        movl    %edx, %ecx
        subl    $4, %edx
        kmovw   %edx, %k0
        vmulpd  %ymm3, %ymm0, %ymm1{%k1}{z}
        vmovapd %ymm1, a(%rax){%k1}
        vpbroadcastmw2d %k0, %xmm1
        addq    $32, %rax
        vpcmpud $6, %xmm2, %xmm1, %k1
        cmpw    $4, %cx
        ja      .L3

that avoids using the slow mask ops for loop control.  It oddly does

        subl    $4, %edx
        kmovw   %edx, %k0
        vpbroadcastmw2d %k0, %xmm1

with -march=cascadelake - with -march=znver4 I get the expected

        subl    $8, %edx
        vpbroadcastw    %edx, %xmm1

but I can reproduce the mask register "spill" with -mprefer-vector-width=256.

We expand to

(insn 14 13 15 (set (reg:V4SI 96)
        (vec_duplicate:V4SI (reg:SI 93 [ _27 ]))) 8167 {*avx512vl_vec_dup_gprv4si}
     (nil))

I'll file a separate bugreport for this.