114987 – [14/15 Regression] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms

Bug 114987 - [14/15 Regression] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms

Summary: [14/15 Regression] floating point vector regression, x86, between gcc 14 and ...

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	14.0

Importance:	P3 normal
Target Milestone:	14.2
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2024-05-08 14:44 UTC by Colin Ian King
Modified:	2024-05-16 01:50 UTC (History)
CC List:	6 users (show)

See Also:	115002 107916
Host:
Target:	x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2024-05-10 00:00:00

Attachments
reproducer.c source code (941 bytes, text/x-csrc) 2024-05-08 14:44 UTC, Colin Ian King	Details
gcc-13 disassembly (20.43 KB, text/plain) 2024-05-08 14:45 UTC, Colin Ian King	Details
gcc-14 disassembly (21.87 KB, text/plain) 2024-05-08 14:45 UTC, Colin Ian King	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Colin Ian King 2024-05-08 14:44:31 UTC

Created attachment 58126 [details]
reproducer.c source code

I'm seeing a ~10% performance regression in gcc-14 compared to gcc-13, using gcc on Ubuntu 24.04:

Versions:
gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) 
gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f] (Ubuntu 14-20240412-0ubuntu1) 

king@skylake:~$ CFLAGS="" gcc-13 reproducer.c; ./a.out  
4.92 secs duration, 2130.379 Mfp-ops/sec
cking@skylake:~$ CFLAGS="" gcc-14 reproducer.c; ./a.out  
5.46 secs duration, 1921.799 Mfp-ops/sec

The original issue appeared when regression testing stress-ng vecfp stressor [1] using the floating point vector 16 add stressor method. I've managed to extract the attached reproducer (reproducer.c) from the original code.

Salient points to focus on:

1. The issue is dependant on the OPTIMIZE3 macro in the reproducer being __attribute__((optimize("-O3")))
2. The issue is also dependant on the TARGET_CLONES macro being defined as __attribute__((target_clones("mmx,avx,default")))  - the avx target clones seems to be an issue in reproducing this problem.

Attached are the reproducer.c C source and disassembled object code. The stress_vecfp_float_add_16.avx from gcc-13 is significantly different from the gcc-14 code.

References: [1] https://github.com/ColinIanKing/stress-ng/blob/master/stress-vecfp.c

Comment 1 Colin Ian King 2024-05-08 14:45:18 UTC

Created attachment 58127 [details]
gcc-13 disassembly

Comment 2 Colin Ian King 2024-05-08 14:45:41 UTC

Created attachment 58128 [details]
gcc-14 disassembly

Comment 3 Colin Ian King 2024-05-08 15:00:12 UTC

perf report from gcc-13 of stress_vecfp_float_add_16.avx of compute loop:

 57.93 │200:   vaddps       0xc0(%rsp),%ymm3,%ymm5                        
 11.11 │       vaddps       0xe0(%rsp),%ymm2,%ymm6                        
  0.02 │       vmovaps      %ymm5,0x60(%rsp)                              
  2.92 │       mov          0x60(%rsp),%rax                               
       │       mov          0x68(%rsp),%rdx                               
  0.37 │       vmovaps      %ymm6,0x40(%rsp)                              
       │       vmovaps      %ymm5,0x80(%rsp)                              
  6.30 │       vmovq        %rax,%xmm1                                    
  4.11 │       mov          0x40(%rsp),%rax                               
       │       vmovdqa      0x90(%rsp),%xmm5                              
       │       vmovaps      %ymm6,0xa0(%rsp)                              
  3.27 │       vpinsrq      $0x1,%rdx,%xmm1,%xmm1                         
       │       mov          0x48(%rsp),%rdx                               
       │       vmovdqa      0xb0(%rsp),%xmm6                              
  3.22 │       vmovdqa      %xmm1,0xc0(%rsp)                              
  0.42 │       vmovq        %rax,%xmm0                                    
       │       vmovdqa      %xmm5,0xd0(%rsp)                              
  6.80 │       vpinsrq      $0x1,%rdx,%xmm0,%xmm0                         
  3.52 │       vmovdqa      %xmm0,0xe0(%rsp)                              
       │       vmovdqa      %xmm6,0xf0(%rsp)                              
       │       sub          $0x1,%ecx                                     
       │     ↑ jne          200    

perf report from gcc-14 of stress_vecfp_float_add_16.avx of compute loop:

 65.79 │200:   vaddps       0xc0(%rsp),%ymm3,%ymm5                        
  3.26 │       vaddps       0xe0(%rsp),%ymm2,%ymm6                        
  0.00 │       vmovaps      %ymm5,0x60(%rsp)                              
  9.25 │       mov          0x60(%rsp),%rax                               
  0.00 │       mov          0x68(%rsp),%rdx                               
       │       vmovaps      %ymm6,0x40(%rsp)                              
       │       vmovaps      %ymm5,0x80(%rsp)                              
  6.49 │       vmovq        %rax,%xmm1                                    
  0.00 │       mov          0x40(%rsp),%rax                               
  0.00 │       vmovaps      %ymm6,0xa0(%rsp)                              
  3.02 │       vpinsrq      $0x1,%rdx,%xmm1,%xmm1                         
       │       mov          0x48(%rsp),%rdx                               
  0.35 │       vmovdqa      %xmm1,0xc0(%rsp)                              
  0.68 │       vmovq        %rax,%xmm0                                    
  0.00 │       vmovdqa      0x90(%rsp),%xmm1                              
  5.18 │       vpinsrq      $0x1,%rdx,%xmm0,%xmm0                         
  3.00 │       vmovdqa      %xmm0,0xe0(%rsp)                              
       │       vmovdqa      0xb0(%rsp),%xmm0                              
       │       vmovdqa      %xmm1,0xd0(%rsp)                              
       │       vmovdqa      %xmm0,0xf0(%rsp)                              
       │       sub          $0x1,%ecx                                     
  2.94 │     ↑ jne          200

Comment 4 Richard Biener 2024-05-10 07:52:37 UTC

I can't reproduce a slowdown on a Zen2 CPU.  The difference seems to be merely instruction scheduling.  I do note we're not doing a good job in handling

        for (i = 0; i < LOOPS_PER_CALL; i++) {
                r.v = r.v + add.v;
        }

where r.v and add.v are AVX512 sized vectors when emulating them with AVX
vectors.  We end up with

  r_v_lsm.48_48 = r.v;
  _11 = add.v;
  
  <bb 3> [local count: 1063004408]:
  # r_v_lsm.48_50 = PHI <_12(3), r_v_lsm.48_48(2)>
  # ivtmp_56 = PHI <ivtmp_55(3), 65536(2)>
  _16 = BIT_FIELD_REF <_11, 256, 0>;
  _37 = BIT_FIELD_REF <r_v_lsm.48_50, 256, 0>;
  _29 = _16 + _37;
  _387 = BIT_FIELD_REF <_11, 256, 256>;
  _375 = BIT_FIELD_REF <r_v_lsm.48_50, 256, 256>;
  _363 = _387 + _375;
  _12 = {_29, _363};
  ivtmp_55 = ivtmp_56 - 1;
  if (ivtmp_55 != 0)
    goto <bb 3>; [98.99%]
  else
    goto <bb 4>; [1.01%]

  <bb 4> [local count: 10737416]:

after lowering from 512bit to 256bit vectors and there's no pass that
would demote the 512bit reduction value to two 256bit ones.

There's also weird things going on in the target/on RTL.  A smaller testcase
illustrating the code generation issue is

typedef float v16sf __attribute__((vector_size(sizeof(float)*16)));

void foo (v16sf * __restrict r, v16sf *a, int n)
{
  for (int i = 0; i < n; ++i)
    *r = *r + *a;
}

So confirmed for non-optimal code but I don't see how it's a regression.

Comment 5 Haochen Jiang 2024-05-10 08:00:03 UTC

What I have found is that the binary built with GCC13 and GCC14 will regress on Cascadelake and Skylake.

But when I copied the binary to Icelake, it won't. Seems Icelake might fix this with micro-tuning.

I tried to move "vmovdqa %xmm1,0xd0(%rsp)" before "vmovdqa %xmm0,0xe0(%rsp)" and rebuilt the binary and it will save half the regression.

Comment 6 Hongtao Liu 2024-05-10 08:05:27 UTC

> I tried to move "vmovdqa %xmm1,0xd0(%rsp)" before "vmovdqa %xmm0,0xe0(%rsp)"
> and rebuilt the binary and it will save half the regression.

 57.93 │200:   vaddps       0xc0(%rsp),%ymm3,%ymm5                        
 11.11 │       vaddps       0xe0(%rsp),%ymm2,%ymm6
        ...
  3.22 │       vmovdqa      %xmm1,0xc0(%rsp)                                                           
       │       vmovdqa      %xmm5,0xd0(%rsp)                                                     
  3.52 │       vmovdqa      %xmm0,0xe0(%rsp)                              
       │       vmovdqa      %xmm6,0xf0(%rsp)   

I guess there're specific patterns in SKX microarhitecture for STLF, the main difference is instruction order of those xmm stores.

From compiler side, the worth thing to do is PR107916.

Comment 7 Haochen Jiang 2024-05-10 08:42:58 UTC

Furthermore, when I build with GCC11, the codegen is much better:

        vaddps       0xc0(%rsp),%ymm5,%ymm2
        vaddps       0xe0(%rsp),%ymm4,%ymm1
        vmovaps      %ymm2,0x80(%rsp)
        vmovdqa      0x90(%rsp),%xmm6
        vmovaps      %ymm1,0xa0(%rsp)
        vmovdqa      0xb0(%rsp),%xmm7
        vmovdqa      %xmm2,0xc0(%rsp)
        vmovdqa      %xmm6,0xd0(%rsp)
        vmovdqa      %xmm1,0xe0(%rsp)
        vmovdqa      %xmm7,0xf0(%rsp)
        sub          $0x1,%eax
        jne          401e00 <stress_vecfp_float_add_16.avx.1+0x1e0>

Seems we might get two separate issues for this regression.