Bug 105504 - Fails to break dependency for vcvtss2sd xmm, xmm, mem
Summary: Fails to break dependency for vcvtss2sd xmm, xmm, mem
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 12.1.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2022-05-06 09:12 UTC by Alexander Monakov
Modified: 2023-08-12 13:50 UTC (History)
4 users (show)

See Also:
Host:
Target: x86_64-*-* i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
testcase (1.74 KB, text/x-csrc)
2022-05-06 09:12 UTC, Alexander Monakov
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alexander Monakov 2022-05-06 09:12:40 UTC
Created attachment 52933 [details]
testcase

Hit by core-math team at https://gcc.gnu.org/pipermail/gcc-help/2022-May/141480.html

Compile the attached testcase with -O2 -march=haswell (other AVX-capable Intel families except Alderlake are affected too) and observe that the big basic block begins with

.L6:
        vcvtss2sd       xmm1, xmm1, DWORD PTR [rsp-4]

This creates a false dependency on the previous assignment into xmm1, resulting in wildly varying (and suboptimal) throughput figures depending on how long the CPU stalls waiting for the previous assignment to complete.

GCC has code to emit such instructions in a manner that avoids false dependencies (see e.g. PR89071), but here it doesn't seem to work.


Also there's a potentially related issue that GCC copies the initial xmm0 value to eax via stack in the beginning of the function:

cr_exp10f:
        vmovss  DWORD PTR [rsp-4], xmm0
        mov     eax, DWORD PTR [rsp-4]

This seems wrong since xmm-reg moves on Haswell are 1 cycle afaict.
Comment 1 Hongtao.liu 2022-05-07 03:02:48 UTC
Pass remove_partial_avx_dependency is before RA, which we have

(insn 128 127 129 22 (set (reg/v:DF 99 [ z ])
        (float_extend:DF (reg/v:SF 117 [ x ]))) "test.c":43:10 163 {*extendsfdf2}

and attr avx_partial_xmm_update is false, since we don't need to handle

vcvtss2sd %xmm, %xmm, %xmm

But after RA, 117 is allocate as mem which causes partial xmm dependency.
Comment 2 Hongtao.liu 2022-05-07 03:16:06 UTC
After set remove_partial_avx_dependency to true for register alternative, we get

       vxorps  %xmm3, %xmm3, %xmm3
        vmovsd  .LC16(%rip), %xmm6
        vmovsd  .LC14(%rip), %xmm5
        vcvtss2sd       %xmm0, %xmm3, %xmm0
        vmulsd  .LC12(%rip), %xmm0, %xmm1
        vroundsd        $9, %xmm1, %xmm3, %xmm2
        vcvttsd2siq     %xmm2, %rax
        vsubsd  %xmm2, %xmm1, %xmm1
        vfmadd231sd     .LC13(%rip), %xmm0, %xmm1
        vfmadd213sd     .LC17(%rip), %xmm1, %xmm6
        vmovsd  .LC18(%rip), %xmm0
        vfmadd213sd     .LC19(%rip), %xmm1, %xmm0
        vfmadd213sd     .LC15(%rip), %xmm1, %xmm5
        movq    %rax, %rdx
        sarq    $4, %rax
        vmulsd  %xmm1, %xmm1, %xmm4
        addq    $1023, %rax
        andl    $15, %edx
        salq    $52, %rax
        vmovq   %rax, %xmm7
        vmulsd  tb.1(,%rdx,8), %xmm7, %xmm2
        vfmadd132sd     %xmm4, %xmm6, %xmm0
        vmulsd  %xmm2, %xmm1, %xmm1
        vfmadd132sd     %xmm4, %xmm5, %xmm0
        vfmadd132sd     %xmm1, %xmm2, %xmm0
        vcvtsd2ss       %xmm0, %xmm3, %xmm0

Also  
>Also there's a potentially related issue that GCC copies the initial xmm0 value >to eax via stack in the beginning of the function:

This issue disappears(should still be latent after fix).
Comment 3 Hongtao.liu 2022-05-07 03:36:12 UTC
The overhead for setting attr remove_partial_avx_dependency to true for register alternative is:

For simple 
double
foo (float a)
{
        return a;
}


Now we generates:

foo:
.LFB0:
        .cfi_startproc
        vmovaps %xmm0, %xmm1
        vxorps  %xmm0, %xmm0, %xmm0
        vcvtss2sd       %xmm1, %xmm0, %xmm0
Comment 4 Hongtao.liu 2022-05-07 04:04:45 UTC
Another possible solution is add a little bit dislike for "m" alternative(like ?m) to avoid potential spill.
Comment 5 Alexander Monakov 2022-05-07 08:28:18 UTC
The strange xmm0 spill issue may affect more code, so I reported an isolated testcase: PR 105513 (regression vs. gcc-8, the complete testcase in this PR also does not spill with gcc-8).
Comment 6 GCC Commits 2022-06-08 03:24:53 UTC
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:5e005393d4ff0a428c5f55b9ba7f65d6078a7cf5

commit r13-1009-g5e005393d4ff0a428c5f55b9ba7f65d6078a7cf5
Author: liuhongt <hongtao.liu@intel.com>
Date:   Mon May 30 15:30:51 2022 +0800

    Disparages SSE_REGS alternatives sligntly with ?v instead of *v in *mov{si,di}_internal.
    
    So alternative v won't be igored in record_reg_classess.
    
    Similar for *r alternatives in some vector patterns.
    
    It helps testcase in the PR, also RA now makes better decisions for
    gcc.target/i386/extract-insert-combining.c
    
            movd    %esi, %xmm0
            movd    %edi, %xmm1
    -       movl    %esi, -12(%rsp)
            paddd   %xmm0, %xmm1
            pinsrd  $0, %esi, %xmm0
            paddd   %xmm1, %xmm0
    
    The patch has no big impact on SPEC2017 for both O2 and Ofast
    march=native run.
    
    And I noticed there's some changes in SPEC2017 from code like
    
    mov mem, %eax
    vmovd %eax, %xmm0
    ..
    mov %eax, 64(%rsp)
    
    to
    
    vmovd mem, %xmm0
    ..
    vmovd %xmm0, 64(%rsp)
    
    Which should be exactly what we want?
    
    gcc/ChangeLog:
    
            PR target/105513
            PR target/105504
            * config/i386/i386.md (*movsi_internal): Change alternative
            from *v to ?v.
            (*movdi_internal): Ditto.
            * config/i386/sse.md (vec_set<mode>_0): Change alternative *r
            to ?r.
            (*vec_extractv4sf_mem): Ditto.
            (*vec_extracthf): Ditto.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/i386/pr105513-1.c: New test.
            * gcc.target/i386/extract-insert-combining.c: Add new
            scan-assembler-not for spill.
Comment 7 Eric Gallager 2023-08-05 15:05:41 UTC
(In reply to CVS Commits from comment #6)
> The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:
> 
> https://gcc.gnu.org/g:5e005393d4ff0a428c5f55b9ba7f65d6078a7cf5
> 
> commit r13-1009-g5e005393d4ff0a428c5f55b9ba7f65d6078a7cf5
> Author: liuhongt <hongtao.liu@intel.com>
> Date:   Mon May 30 15:30:51 2022 +0800
> 
>     Disparages SSE_REGS alternatives sligntly with ?v instead of *v in
> *mov{si,di}_internal.
>     
>     So alternative v won't be igored in record_reg_classess.
>     
>     Similar for *r alternatives in some vector patterns.
>     
>     It helps testcase in the PR, also RA now makes better decisions for
>     gcc.target/i386/extract-insert-combining.c
>     
>             movd    %esi, %xmm0
>             movd    %edi, %xmm1
>     -       movl    %esi, -12(%rsp)
>             paddd   %xmm0, %xmm1
>             pinsrd  $0, %esi, %xmm0
>             paddd   %xmm1, %xmm0
>     
>     The patch has no big impact on SPEC2017 for both O2 and Ofast
>     march=native run.
>     
>     And I noticed there's some changes in SPEC2017 from code like
>     
>     mov mem, %eax
>     vmovd %eax, %xmm0
>     ..
>     mov %eax, 64(%rsp)
>     
>     to
>     
>     vmovd mem, %xmm0
>     ..
>     vmovd %xmm0, 64(%rsp)
>     
>     Which should be exactly what we want?
>     
>     gcc/ChangeLog:
>     
>             PR target/105513
>             PR target/105504
>             * config/i386/i386.md (*movsi_internal): Change alternative
>             from *v to ?v.
>             (*movdi_internal): Ditto.
>             * config/i386/sse.md (vec_set<mode>_0): Change alternative *r
>             to ?r.
>             (*vec_extractv4sf_mem): Ditto.
>             (*vec_extracthf): Ditto.
>     
>     gcc/testsuite/ChangeLog:
>     
>             * gcc.target/i386/pr105513-1.c: New test.
>             * gcc.target/i386/extract-insert-combining.c: Add new
>             scan-assembler-not for spill.

Did this fix it?
Comment 8 Hongtao.liu 2023-08-07 01:55:26 UTC
(In reply to Eric Gallager from comment #7)
> (In reply to CVS Commits from comment #6)
> > The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:
> > 
> > https://gcc.gnu.org/g:5e005393d4ff0a428c5f55b9ba7f65d6078a7cf5
> > 
> > commit r13-1009-g5e005393d4ff0a428c5f55b9ba7f65d6078a7cf5
> > Author: liuhongt <hongtao.liu@intel.com>
> > Date:   Mon May 30 15:30:51 2022 +0800
> > 
> >     Disparages SSE_REGS alternatives sligntly with ?v instead of *v in
> > *mov{si,di}_internal.
> >     
> >     So alternative v won't be igored in record_reg_classess.
> >     
> >     Similar for *r alternatives in some vector patterns.
> >     
> >     It helps testcase in the PR, also RA now makes better decisions for
> >     gcc.target/i386/extract-insert-combining.c
> >     
> >             movd    %esi, %xmm0
> >             movd    %edi, %xmm1
> >     -       movl    %esi, -12(%rsp)
> >             paddd   %xmm0, %xmm1
> >             pinsrd  $0, %esi, %xmm0
> >             paddd   %xmm1, %xmm0
> >     
> >     The patch has no big impact on SPEC2017 for both O2 and Ofast
> >     march=native run.
> >     
> >     And I noticed there's some changes in SPEC2017 from code like
> >     
> >     mov mem, %eax
> >     vmovd %eax, %xmm0
> >     ..
> >     mov %eax, 64(%rsp)
> >     
> >     to
> >     
> >     vmovd mem, %xmm0
> >     ..
> >     vmovd %xmm0, 64(%rsp)
> >     
> >     Which should be exactly what we want?
> >     
> >     gcc/ChangeLog:
> >     
> >             PR target/105513
> >             PR target/105504
> >             * config/i386/i386.md (*movsi_internal): Change alternative
> >             from *v to ?v.
> >             (*movdi_internal): Ditto.
> >             * config/i386/sse.md (vec_set<mode>_0): Change alternative *r
> >             to ?r.
> >             (*vec_extractv4sf_mem): Ditto.
> >             (*vec_extracthf): Ditto.
> >     
> >     gcc/testsuite/ChangeLog:
> >     
> >             * gcc.target/i386/pr105513-1.c: New test.
> >             * gcc.target/i386/extract-insert-combining.c: Add new
> >             scan-assembler-not for spill.
> 
> Did this fix it?

Yes.
Comment 9 Eric Gallager 2023-08-12 13:50:02 UTC
(In reply to Hongtao.liu from comment #8)
> (In reply to Eric Gallager from comment #7)
> > 
> > Did this fix it?
> 
> Yes.

OK, closing, then