Check this: https://www.phoronix.com/review/gcc13-clang16-raptorlake/3
Created attachment 55052 [details] Graphs
This bug report and the other ones are useless really. Please read https://gcc.gnu.org/bugs/ and report a decent bug report.
According to the latest Phoronix test which can be easily downloaded, run and reproduced, GCC 13.1 loses to Clang by a wide margin, in certain workloads it's ~30% (!) slower and I just wanted to alert its developers to a widening gap in performance v Clang. I'm not a developer either, I'm simply no one. My previous bug reports for performance regressions and deficiencies weren't met with such ... words, so, I'm sorry I'm not in a mood of proving anything, so I'll just go ahead and close it as useless, annoying and maybe even outright invalid.
Thanks for reporting this. Unfortunately, a single report can not help us. Would you mind file a bug with simple piece of code that we can reproduce such issue and this issue matters for the benchmark. Besides, I have read this report. I think this may be the X86 backend issue. We (downstream) RISC-V GCC have tested various workloads, turns out GCC is better than Clang in traditional CPU benchmark. Also, Clang is much better than GCC in AI program benchmark (For example mlperf). Start with the benchmark you mentioned (GraphicsMagick), Could you post the most important piece of code belongging to this benchmark ? Thanks.
All of the benchmarks in that report are from https://github.com/phoronix-test-suite/phoronix-test-suite. For GraphicsMagick, the relevant benchmark seems to be: https://github.com/phoronix-test-suite/phoronix-test-suite/blob/dea5e68ba7bc0eaa3646713a8e07100ffab929b5/ob-cache/test-profiles/pts/graphics-magick-1.6.1/test-definition.xml (it might be a different version of the test, but note that '1.6.1' does NOT equal the graphicsmagick version) with a script at https://github.com/phoronix-test-suite/phoronix-test-suite/blob/dea5e68ba7bc0eaa3646713a8e07100ffab929b5/ob-cache/test-profiles/pts/graphics-magick-1.6.1/install.sh#L25. I think it runs individual commands like this (OMP_NUM_THREADS="$NUM_CPU_CORES" ./gm benchmark -duration 60 convert DSC_6782.png $@ null), so: * OMP_NUM_THREADS="$NUM_CPU_CORES" ./gm benchmark -duration 60 convert DSC_6782.png -colorspace HWB null * OMP_NUM_THREADS="$NUM_CPU_CORES" ./gm benchmark -duration 60 convert DSC_6782.png -blur 0x1.0 null * OMP_NUM_THREADS="$NUM_CPU_CORES" ./gm benchmark -duration 60 convert DSC_6782.png -lat 10x10-5% null * OMP_NUM_THREADS="$NUM_CPU_CORES" ./gm benchmark -duration 60 convert DSC_6782.png -resize 50% HWB null * OMP_NUM_THREADS="$NUM_CPU_CORES" ./gm benchmark -duration 60 convert DSC_6782.png -sharpen 0x1.0 HWB null with GraphicsMagick (gm) built as with -fopenmp -O3 -march=native -flto -ltiff -lfreetype -ljpeg -lXext -lSM -lICE -lX11 -lbz2 -lz -lzstd -lpthread. But I can't actually find the test image DSC_6782.png, so... I think we really need more information here before it's actionable. Perhaps the reporter could reach out to Michael Larabel and ask him to comment here.
I installed the phoronix testuiste and uploaded sample data it uses to http://www.ucw.cz/~hubicka/sample-photo-6000x4000-1.zip I doubt they make much difference especially for resizing.
On zen3 hardware I get GCC: GraphicsMagick 1.3.38: pts/graphics-magick-2.1.0 [Operation: Resizing] Test 1 of 1 Estimated Trial Run Count: 3 Estimated Time To Completion: 4 Minutes [17:00 UTC] Started Run 1 @ 16:57:17 Started Run 2 @ 16:58:22 Started Run 3 @ 16:59:26 Operation: Resizing: 1390 1386 1383 Average: 1386 Iterations Per Minute Deviation: 0.25% clang16: GraphicsMagick 1.3.38: pts/graphics-magick-2.1.0 [Operation: Resizing] Test 1 of 1 Estimated Trial Run Count: 3 Estimated Time To Completion: 4 Minutes [16:54 UTC] Started Run 1 @ 16:51:48 Started Run 2 @ 16:52:52 Started Run 3 @ 16:53:56 Operation: Resizing: 180 180 180 Average: 180 Iterations Per Minute Deviation: 0.00% GCC profile: 52.07% VerticalFilter._omp_fn.0 24.59% HorizontalFilter._omp_fn.0 11.78% ReadCachePixels.isra.0 Clang does not seem to have openmp in it, so to get comparable runs I added OMP_THREAD_LIMIT=1 With this I get: GraphicsMagick 1.3.38: pts/graphics-magick-2.1.0 [Operation: Resizing] Test 1 of 1 Estimated Trial Run Count: 3 Estimated Time To Completion: 4 Minutes [17:17 UTC] Started Run 1 @ 17:14:14 Started Run 2 @ 17:15:18 Started Run 3 @ 17:16:22 Operation: Resizing: 184 186 186 Average: 185 Iterations Per Minute Deviation: 0.62% so GCC build is still bit faster. Internal loop of VerticalFillter is: 0.00 │4a0:┌─→mov 0x8(%rdx),%rax ▒ 1.33 │ │ vmovsd (%rdx),%xmm1 ▒ 1.58 │ │ add $0x10,%rdx ▒ 0.00 │ │ sub %r13,%rax ▒ 4.77 │ │ imul %r11,%rax ▒ 1.01 │ │ add %rcx,%rax ▒ 0.04 │ │ movzbl 0x2(%r15,%rax,4),%r10d ▒ 8.38 │ │ vcvtsi2sd %r10d,%xmm2,%xmm0 ▒ 2.44 │ │ movzbl 0x1(%r15,%rax,4),%r10d ◆ 1.55 │ │ movzbl (%r15,%rax,4),%eax ▒ 0.00 │ │ vfmadd231sd %xmm0,%xmm1,%xmm4 ▒ 13.91 │ │ vcvtsi2sd %r10d,%xmm2,%xmm0 ▒ 1.86 │ │ vfmadd231sd %xmm0,%xmm1,%xmm5 ▒ 13.00 │ │ vcvtsi2sd %eax,%xmm2,%xmm0 ▒ 2.02 │ │ vfmadd231sd %xmm0,%xmm1,%xmm3 ▒ 12.54 │ ├──cmp %rdx,%rdi ▒ 0.00 │ └──jne 4a0 ▒ HorisontalFiller: 0.01 │520:┌─→mov 0x8(%r8),%rdx ▒ 0.96 │ │ vmovsd (%r8),%xmm1 ▒ 1.93 │ │ add $0x10,%r8 ▒ 0.50 │ │ sub %r15,%rdx ▒ 4.02 │ │ add %r11,%rdx ▒ 2.26 │ │ movzbl 0x2(%r14,%rdx,4),%ebx ▒ 0.09 │ │ vcvtsi2sd %ebx,%xmm2,%xmm0 ▒ 10.10 │ │ movzbl 0x1(%r14,%rdx,4),%ebx ◆ 0.92 │ │ movzbl (%r14,%rdx,4),%edx ▒ 1.84 │ │ vfmadd231sd %xmm0,%xmm1,%xmm4 ▒ 6.82 │ │ vcvtsi2sd %ebx,%xmm2,%xmm0 ▒ 11.15 │ │ vfmadd231sd %xmm0,%xmm1,%xmm3 ▒ 13.81 │ │ vcvtsi2sd %edx,%xmm2,%xmm0 ▒ 6.16 │ │ vfmadd231sd %xmm0,%xmm1,%xmm5 ▒ 8.61 │ ├──cmp %rsi,%r8 ▒ 1.56 │ └──jne 520 ▒ ReadCachePixels: │2e0:┌─→mov (%rbx,%rax,4),%edx ▒ 83.03 │ │ mov %edx,(%r12,%rax,4) ▒ 12.34 │ │ inc %rax ▒ 0.02 │ ├──cmp %rsi,%rax ▒ With Clang I get: 49.08% VerticalFilter 24.66% HorizontalFilter 18.41% ReadCachePixels 6.75% SyncCacheViewPixels 0.00 │1c50:┌─→mov (%rdx,%rsi,1),%r9 ▒ 0.09 │ │ vmovddup -0x8(%rdx,%rsi,1),%xmm3 ▒ 0.00 │ │ add $0x10,%rsi ▒ 0.75 │ │ sub %rdi,%r9 ▒ 0.00 │ │ imul %rcx,%r9 ▒ 1.07 │ │ add %r11,%r9 ▒ 0.81 │ │ movzbl 0x2(%r14,%r9,4),%r10d ▒ 3.73 │ │ movzwl (%r14,%r9,4),%r9d ▒ 0.00 │ │ vcvtsi2sd %r10d,%xmm14,%xmm2 ▒ 0.11 │ │ vfmadd231sd %xmm2,%xmm3,%xmm1 ▒ 2.57 │ │ vmovd %r9d,%xmm2 ▒ 0.00 │ │ vpmovzxbd %xmm2,%xmm2 ▒ 0.95 │ │ vcvtdq2pd %xmm2,%xmm2 ▒ 0.74 │ │ vfmadd231pd %xmm2,%xmm3,%xmm0 ▒ 11.46 │ ├──cmp %rsi,%r8 ▒ │1b50:┌─→mov (%r10,%rdi,1),%rcx ▒ 0.76 │ │ vmovddup -0x8(%r10,%rdi,1),%xmm3 ▒ 0.00 │ │ add $0x10,%rdi ▒ 0.05 │ │ sub %r8,%rcx ▒ 0.30 │ │ add %rsi,%rcx ▒ 0.27 │ │ movzbl 0x2(%r14,%rcx,4),%ebp ▒ 0.28 │ │ movzwl (%r14,%rcx,4),%ecx ▒ 4.51 │ │ vcvtsi2sd %ebp,%xmm13,%xmm2 ▒ 0.75 │ │ vfmadd231sd %xmm2,%xmm3,%xmm1 ▒ 0.99 │ │ vmovd %ecx,%xmm2 ▒ 0.00 │ │ vpmovzxbd %xmm2,%xmm2 ▒ 0.29 │ │ vcvtdq2pd %xmm2,%xmm2 ▒ 0.27 │ │ vfmadd231pd %xmm2,%xmm3,%xmm0 ▒ 12.37 │ ├──cmp %rdi,%r9 ▒ 0.16 │ └──jne 1b50 ▒ 0.01 │ test %r10,%r10 ▒ 0.01 │ ↓ jle 28b4 ▒ │ lea 0x0(,%r15,4),%rcx ▒ 0.01 │ mov 0xd8(%rsp),%r10 ▒ 0.00 │ lea (%rcx,%r8,4),%rcx ▒ 0.01 │ lea (%rcx,%rbp,4),%rcx ▒ 0.01 │ lea (%rcx,%rdi,4),%rcx ▒ 0.01 │ lea (%rcx,%rax,4),%rcx ▒ 0.02 │ lea (%rcx,%rdx,4),%rcx ▒ 0.01 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0xc8(%rsp),%r10 ▒ 0.01 │ lea (%rcx,%r9,4),%rcx ▒ 0.01 │ lea (%rcx,%r13,4),%rcx ▒ 0.01 │ lea (%rcx,%r11,4),%rcx ▒ 0.01 │ lea (%rcx,%r12,4),%rcx ▒ 0.01 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0xb8(%rsp),%r10 ▒ 0.01 │ lea (%rcx,%r10,4),%rcx ▒ 0.03 │ mov 0xb0(%rsp),%r10 ▒ 0.01 │ lea (%rcx,%r10,4),%rcx ▒ 0.01 │ mov 0xa8(%rsp),%r10 ▒ 0.00 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0x98(%rsp),%r10 ▒ 0.00 │ lea (%rcx,%rsi,4),%rcx ▒ 0.03 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0x88(%rsp),%r10 ▒ 0.01 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0xa0(%rsp),%r10 ▒ 0.00 │ lea (%rcx,%r10,4),%rcx ▒ 0.01 │ mov 0x90(%rsp),%r10 ▒ 0.00 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0x58(%rsp),%r10 ▒ 0.02 │ lea (%rcx,%r10,4),%rcx ▒ 0.03 │ mov 0x50(%rsp),%r10 ▒ 0.00 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0x48(%rsp),%r10 ▒ 0.01 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0x40(%rsp),%r10 ▒ 0.00 │ lea (%rcx,%r10,4),%rcx ▒ 0.03 │ mov 0x38(%rsp),%r10 ▒ 0.02 │ lea (%rcx,%r10,4),%rcx ▒ 0.03 │ mov 0x60(%rsp),%r10 ▒ 0.01 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0x68(%rsp),%r10 ▒ 0.00 │ lea (%rcx,%r10,4),%rcx ▒ 0.03 │ mov 0x70(%rsp),%r10 ▒ 0.00 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0x78(%rsp),%r10 ▒ 0.03 │ lea (%rcx,%r10,4),%rcx ◆ 0.03 │ mov 0x80(%rsp),%r10 ▒ 0.01 │ lea (%rcx,%r10,4),%rcx ▒ 0.03 │ add 0x28(%rsp),%rcx ▒ 0.03 │ mov %rcx,0xf0(%rsp) ▒ 0.00 │ xor %ecx,%ecx ▒ 0.00 │ xor %ecx,%ecx ▒ │2584: mov 0xf0(%rsp),%r10 ▒ 0.01 │ mov (%r10,%rcx,4),%r10d ▒ 3.58 │ inc %rcx ▒ 0.03 │ mov %r10d,(%r14) ▒ 0.02 │ mov 0x30(%rsp),%r10 ▒ 0.01 │ add $0x4,%r14 ▒ 0.01 │ mov (%r10),%r10 ▒ 0.06 │ cmp %r10,%rcx ▒ 0.05 │ ↑ jl 2584 ▒ So I suppose the filler loops are vectorized while memcpy is unrolled (in very odd way). I guesss the vectorization does not help on zen3 but may help on Raptor Lake.
Created attachment 55178 [details] Preprocessed source of VerticalFiller and HorisontalFiller
Oddly enough simplified version of the loop SLP vectorizes for me: struct rgb {unsigned char r,g,b;} *rgbs; int *addr; double *weights; struct drgb {double r,g,b;}; struct drgb sum() { struct drgb r; for (int i = 0; i < 100000; i++) { int j = addr[i]; double w = weights[i]; r.r += rgbs[j].r * w; r.g += rgbs[j].g * w; r.b += rgbs[j].b * w; } return r; } I get: L2: movslq (%r9,%rdx,4), %rax vmovsd (%r8,%rdx,8), %xmm1 incq %rdx leaq (%rax,%rax,2), %rax addq %rsi, %rax movzbl (%rax), %ecx vmovddup %xmm1, %xmm4 vmovd %ecx, %xmm0 movzbl 1(%rax), %ecx movzbl 2(%rax), %eax vpinsrd $1, %ecx, %xmm0, %xmm0 vcvtdq2pd %xmm0, %xmm0 vfmadd231pd %xmm4, %xmm0, %xmm2 vcvtsi2sdl %eax, %xmm5, %xmm0 vfmadd231sd %xmm1, %xmm0, %xmm3 cmpq $100000, %rdx jne .L2 I think the actual loop is: <bb 53> [local count: 44202554]: _106 = _262->pixel; _109 = *source_231(D).columns; <bb 54> [local count: 401841405]: # pixel$green_332 = PHI <_124(89), pixel$green_265(53)> # i_357 = PHI <i_298(89), 0(53)> # pixel$red_371 = PHI <_119(89), pixel$red_263(53)> # pixel$blue_377 = PHI <_129(89), pixel$blue_267(53)> i.51_102 = (long unsigned int) i_357; _103 = i.51_102 * 16; _104 = _262 + _103; _105 = _104->pixel; _107 = _105 - _106; _108 = (long unsigned int) _107; _110 = _108 * _109; _112 = _110 + _621; weight_297 = _104->weight; _113 = _112 * 4; _114 = _276 + _113; _115 = _114->red; _116 = (int) _115; _117 = (double) _116; _118 = _117 * weight_297; _119 = _118 + pixel$red_371; _120 = _114->green; _121 = (int) _120; _122 = (double) _121; _123 = _122 * weight_297; _124 = _123 + pixel$green_332; _125 = _114->blue; _126 = (int) _125; _127 = (double) _126; _128 = _127 * weight_297; _129 = _128 + pixel$blue_377; i_298 = i_357 + 1; if (n_195 > i_298) goto <bb 89>; [89.00%] else goto <bb 118>; [11.00%] <bb 118> [local count: 44202554]: # _607 = PHI <_124(54)> # _606 = PHI <_119(54)> # _605 = PHI <_129(54)> goto <bb 55>; [100.00%] <bb 89> [local count: 357638851]: goto <bb 54>; [100.00%] and SLP vectorizer seems to claim: ../magick/resize.c:1284:52: note: _125 = _114->blue; ../magick/resize.c:1284:52: note: _120 = _114->green; ../magick/resize.c:1284:52: note: _115 = _114->red; ../magick/resize.c:1284:52: missed: not consecutive access weight_297 = _104->weight; ../magick/resize.c:1284:52: missed: not consecutive access _105 = _104->pixel; ../magick/resize.c:1284:52: missed: not consecutive access _134->red = iftmp.57_207; ../magick/resize.c:1284:52: missed: not consecutive access _134->green = iftmp.60_208; ../magick/resize.c:1284:52: missed: not consecutive access _134->blue = iftmp.63_209; ../magick/resize.c:1284:52: missed: not consecutive access _134->opacity = 0; ../magick/resize.c:1284:52: missed: not consecutive access _63 = *source_231(D).columns; ../magick/resize.c:1284:52: missed: not consecutive access _60 = _262->pixel; Not sure if that is related to the real testcase: struct rgb {unsigned char r,g,b;} *rgbs; int *addr; double *weights; struct drgb {double r,g,b,o;}; struct drgb sum() { struct drgb r; for (int i = 0; i < 100000; i++) { int j = addr[i]; double w = weights[i]; r.r += rgbs[j].r * w; r.g += rgbs[j].g * w; r.b += rgbs[j].b * w; } return r; } make us to miss the vectorization even though there is nothing using drgb->o: sum: .LFB0: .cfi_startproc movq %rdi, %r8 movq weights(%rip), %rsi movq addr(%rip), %rdi vxorps %xmm2, %xmm2, %xmm2 movq rgbs(%rip), %rcx xorl %edx, %edx .p2align 4 .p2align 3 .L2: movslq (%rdi,%rdx,4), %rax vmovsd (%rsi,%rdx,8), %xmm0 incq %rdx leaq (%rax,%rax,2), %rax addq %rcx, %rax movzbl (%rax), %r9d vcvtsi2sdl %r9d, %xmm2, %xmm1 movzbl 1(%rax), %r9d movzbl 2(%rax), %eax vfmadd231sd %xmm0, %xmm1, %xmm3 vcvtsi2sdl %r9d, %xmm2, %xmm1 vfmadd231sd %xmm0, %xmm1, %xmm5 vcvtsi2sdl %eax, %xmm2, %xmm1 vfmadd231sd %xmm0, %xmm1, %xmm4 cmpq $100000, %rdx jne .L2 vmovq %xmm4, %xmm4 vunpcklpd %xmm5, %xmm3, %xmm0 movq %r8, %rax vinsertf128 $0x1, %xmm4, %ymm0, %ymm0 vmovupd %ymm0, (%r8) vzeroupper ret
This is benchmarkeable version of the simplified testcase: jan@localhost:/tmp> cat t.c #define N 10000000 struct rgb {unsigned char r,g,b;} rgbs[N]; int *addr; struct drgb {double r,g,b; #ifdef OPACITY double o; #endif }; struct drgb sum(double w) { struct drgb r; for (int i = 0; i < N; i++) { r.r += rgbs[i].r * w; r.g += rgbs[i].g * w; r.b += rgbs[i].b * w; } return r; } jan@localhost:/tmp> cat q.c struct drgb {double r,g,b; #ifdef OPACITY double o; #endif }; struct drgb sum(double w); int main() { for (int i = 0; i < 1000; i++) sum(i); } jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g ; objdump -d a.out | grep vfmadd231pd ; perf stat ./a.out 40119d: c4 e2 d9 b8 d1 vfmadd231pd %xmm1,%xmm4,%xmm2 Performance counter stats for './a.out': 12,148.04 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 736 page-faults:u # 60.586 /sec 50,018,421,148 cycles:u # 4.117 GHz 220,502 stalled-cycles-frontend:u # 0.00% frontend cycles idle 39,950,154,369 stalled-cycles-backend:u # 79.87% backend cycles idle 120,000,191,713 instructions:u # 2.40 insn per cycle # 0.33 stalled cycles per insn 10,000,048,918 branches:u # 823.182 M/sec 7,959 branch-misses:u # 0.00% of all branches 12.149466078 seconds time elapsed 12.149084000 seconds user 0.000000000 seconds sys jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g -DOPACITY ; objdump -d a.out | grep vfmadd231pd ; perf stat ./a.out Performance counter stats for './a.out': 12,141.11 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 735 page-faults:u # 60.538 /sec 50,018,839,129 cycles:u # 4.120 GHz 185,034 stalled-cycles-frontend:u # 0.00% frontend cycles idle 29,963,999,798 stalled-cycles-backend:u # 59.91% backend cycles idle 120,000,191,729 instructions:u # 2.40 insn per cycle # 0.25 stalled cycles per insn 10,000,048,913 branches:u # 823.652 M/sec 7,311 branch-misses:u # 0.00% of all branches 12.142252354 seconds time elapsed 12.138237000 seconds user 0.004000000 seconds sys So on zen2 hardware I get same performance on both. It may be interesting to test it on Raptor Lake.
Hello, Hubicka and Artem I try to reproduce this issue in Raptor Lake, I use -fopenmp -O3 -flto, meet the following error, but if use -fopenmp -O3, no -flto, build ok. Could you help me? libtool: link: /home/sdp/jun/gcc0/install/bin/gcc -fopenmp -O3 -flto -march=native -Wall -o utilities/gm utilities/gm.o -L/home/sdp/jun/omp/Ofast/pts_g_gomp/install/.phoronix-test-suite/installed-tests/pts/graphics-magick-2.1.0/gm_/lib magick/.libs/libGraphicsMagick.a -lfreetype -ljbig -ltiff -ljpeg -lXext -lSM -lICE -lX11 -llzma -lbz2 -lz -lzstd -lm -lpthread -fopenmp /home/sdp/jun/btl0/install/bin/ld: /tmp/ccnX75zI.ltrans0.ltrans.o: in function `main': <artificial>:(.text.startup+0x1): undefined reference to `GMCommand' collect2: error: ld returned 1 exit status make[1]: *** [Makefile:6411: utilities/gm] Error 1 make[1]: Leaving directory hubicka at gcc dot gnu.org <gcc-bugzilla@gcc.gnu.org> 于2023年5月29日周一 02:50写道: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 > > --- Comment #10 from Jan Hubicka <hubicka at gcc dot gnu.org> --- > This is benchmarkeable version of the simplified testcase: > > jan@localhost:/tmp> cat t.c > #define N 10000000 > struct rgb {unsigned char r,g,b;} rgbs[N]; > int *addr; > struct drgb {double r,g,b; > #ifdef OPACITY > double o; > #endif > }; > > struct drgb sum(double w) > { > struct drgb r; > for (int i = 0; i < N; i++) > { > r.r += rgbs[i].r * w; > r.g += rgbs[i].g * w; > r.b += rgbs[i].b * w; > } > return r; > } > jan@localhost:/tmp> cat q.c > struct drgb {double r,g,b; > #ifdef OPACITY > double o; > #endif > }; > struct drgb sum(double w); > int > main() > { > for (int i = 0; i < 1000; i++) > sum(i); > } > > > jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g ; objdump -d a.out | grep > vfmadd231pd ; perf stat ./a.out > 40119d: c4 e2 d9 b8 d1 vfmadd231pd %xmm1,%xmm4,%xmm2 > > Performance counter stats for './a.out': > > 12,148.04 msec task-clock:u # 1.000 CPUs > utilized > 0 context-switches:u # 0.000 /sec > 0 cpu-migrations:u # 0.000 /sec > 736 page-faults:u # 60.586 /sec > 50,018,421,148 cycles:u # 4.117 GHz > 220,502 stalled-cycles-frontend:u # 0.00% frontend > cycles idle > 39,950,154,369 stalled-cycles-backend:u # 79.87% backend > cycles idle > 120,000,191,713 instructions:u # 2.40 insn per > cycle > # 0.33 stalled cycles per > insn > 10,000,048,918 branches:u # 823.182 M/sec > 7,959 branch-misses:u # 0.00% of all > branches > > 12.149466078 seconds time elapsed > > 12.149084000 seconds user > 0.000000000 seconds sys > > > jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g -DOPACITY ; objdump -d > a.out | grep vfmadd231pd ; perf stat ./a.out > > Performance counter stats for './a.out': > > 12,141.11 msec task-clock:u # 1.000 CPUs > utilized > 0 context-switches:u # 0.000 /sec > 0 cpu-migrations:u # 0.000 /sec > 735 page-faults:u # 60.538 /sec > 50,018,839,129 cycles:u # 4.120 GHz > 185,034 stalled-cycles-frontend:u # 0.00% frontend > cycles idle > 29,963,999,798 stalled-cycles-backend:u # 59.91% backend > cycles idle > 120,000,191,729 instructions:u # 2.40 insn per > cycle > # 0.25 stalled cycles per > insn > 10,000,048,913 branches:u # 823.652 M/sec > 7,311 branch-misses:u # 0.00% of all > branches > > 12.142252354 seconds time elapsed > > 12.138237000 seconds user > 0.004000000 seconds sys > > > So on zen2 hardware I get same performance on both. It may be interesting to > test it on Raptor Lake. > > -- > You are receiving this mail because: > You are on the CC list for the bug.
> /home/sdp/jun/btl0/install/bin/ld: /tmp/ccnX75zI.ltrans0.ltrans.o: in > function `main': > <artificial>:(.text.startup+0x1): undefined reference to `GMCommand' I wonder if your plugin is configured correctly. Can you try to build with -flto -fuse-linker-plugin.
The only difference between slp vectorization is: - # _68 = PHI <_5(3)> - # _67 = PHI <_11(3)> - # _66 = PHI <_16(3)> - <retval>.r = _68; - <retval>.g = _67; - <retval>.b = _66; + # _70 = PHI <_5(3)> + # _69 = PHI <_11(3)> + # _68 = PHI <_16(3)> + <retval>.r = _70; + <retval>.g = _69; + <retval>.b = _68; + <retval>.o = r$o_33(D); so SRA invents r$o_33(D) even if that variable is undefined. SLP vectorizer then sees it as interleaving stores: -t.c:19:16: note: _1 = rgbs[i_35].r; -t.c:19:16: note: _7 = rgbs[i_35].g; -t.c:19:16: note: _12 = rgbs[i_35].b; -t.c:19:16: note: Detected interleaving store of size 3 -t.c:19:16: note: <retval>.r = _68; -t.c:19:16: note: <retval>.g = _67; -t.c:19:16: note: <retval>.b = _66; +t.c:19:16: note: _1 = rgbs[i_37].r; +t.c:19:16: note: _7 = rgbs[i_37].g; +t.c:19:16: note: _12 = rgbs[i_37].b; +t.c:19:16: note: Detected interleaving store of size 4 +t.c:19:16: note: <retval>.r = _70; +t.c:19:16: note: <retval>.g = _69; +t.c:19:16: note: <retval>.b = _68; +t.c:19:16: note: <retval>.o = r$o_33(D); For first case it first tries to vectorize for vector of 3 doubles and fails: -t.c:19:16: note: <retval>.r = _68; -t.c:19:16: note: <retval>.g = _67; -t.c:19:16: note: <retval>.b = _66; -t.c:19:16: note: starting SLP discovery for node 0x2cb4fe8 -t.c:19:16: note: Build SLP for <retval>.r = _68; -t.c:19:16: note: get vectype for scalar type (group size 3): double -t.c:19:16: note: vectype: vector(2) double -t.c:19:16: note: nunits = 2 -t.c:19:16: missed: Build SLP failed: unrolling required in basic block SLP -t.c:19:16: note: Build SLP for <retval>.g = _67; -t.c:19:16: note: get vectype for scalar type (group size 3): double -t.c:19:16: note: vectype: vector(2) double -t.c:19:16: note: nunits = 2 -t.c:19:16: missed: Build SLP failed: unrolling required in basic block SLP -t.c:19:16: note: Build SLP for <retval>.b = _66; -t.c:19:16: note: get vectype for scalar type (group size 3): double -t.c:19:16: note: vectype: vector(2) double -t.c:19:16: note: nunits = 2 -t.c:19:16: missed: Build SLP failed: unrolling required in basic block SLP -t.c:19:16: note: SLP discovery for node 0x2cb4fe8 failed And later it tries to vectorize first 2 items: -t.c:19:16: note: Splitting SLP group at stmt 2 -t.c:19:16: note: Split group into 2 and 1 -t.c:19:16: note: Starting SLP discovery for -t.c:19:16: note: <retval>.r = _68; -t.c:19:16: note: <retval>.g = _67; -t.c:19:16 ... and after a lot of blablabla succeeds. If opaque field is present we start with vector of size 4: +t.c:19:16: note: <retval>.r = _70; +t.c:19:16: note: <retval>.g = _69; +t.c:19:16: note: <retval>.b = _68; +t.c:19:16: note: <retval>.o = r$o_33(D); +t.c:19:16: note: vect_is_simple_use: operand _70 = PHI <_5(3)>, type of def: internal +t.c:19:16: note: vect_is_simple_use: operand _69 = PHI <_11(3)>, type of def: internal +t.c:19:16: note: vect_is_simple_use: operand _68 = PHI <_16(3)>, type of def: internal +t.c:19:16: note: vect_is_simple_use: operand r$o_33(D), type of def: external +t.c:19:16: missed: treating operand as external +t.c:19:16: note: SLP discovery for node 0x2e80058 succeeded +t.c:19:16: note: SLP size 1 vs. limit 23. +t.c:19:16: note: Final SLP tree for instance 0x2def840: +t.c:19:16: note: node 0x2e80058 (max_nunits=4, refcnt=2) vector(4) double +t.c:19:16: note: op template: <retval>.r = _70; +t.c:19:16: note: stmt 0 <retval>.r = _70; +t.c:19:16: note: stmt 1 <retval>.g = _69; +t.c:19:16: note: stmt 2 <retval>.b = _68; +t.c:19:16: note: stmt 3 <retval>.o = r$o_33(D); +t.c:19:16: note: children 0x2e800d8 +t.c:19:16: note: node (external) 0x2e800d8 (max_nunits=1, refcnt=1) +t.c:19:16: note: { _70, _69, _68, r$o_33(D) } So it seems to succeed vectorizing with 4 entries but it does so for the single return statement: <bb 3> [local count: 1063004409]: # i_37 = PHI <i_22(5), 0(2)> # r$r_40 = PHI <_5(5), r$r_25(D)(2)> # r$g_42 = PHI <_11(5), r$g_26(D)(2)> # r$b_44 = PHI <_16(5), r$b_27(D)(2)> # ivtmp_67 = PHI <ivtmp_66(5), 10000000(2)> _1 = rgbs[i_37].r; _2 = (int) _1; _3 = (double) _2; _4 = _3 * w_21(D); _5 = _4 + r$r_40; _7 = rgbs[i_37].g; _8 = (int) _7; _9 = (double) _8; _10 = _9 * w_21(D); _11 = _10 + r$g_42; _12 = rgbs[i_37].b; _13 = (int) _12; _14 = (double) _13; _15 = _14 * w_21(D); _16 = _15 + r$b_44; i_22 = i_37 + 1; ivtmp_66 = ivtmp_67 - 1; if (ivtmp_66 != 0) goto <bb 5>; [99.00%] else goto <bb 4>; [1.00%] <bb 5> [local count: 1052374367]: goto <bb 3>; [100.00%] <bb 4> [local count: 10737416]: # _70 = PHI <_5(3)> # _69 = PHI <_11(3)> # _68 = PHI <_16(3)> _65 = {_70, _69, _68, r$o_33(D)}; MEM <vector(4) double> [(double *)&<retval>] = _65; that seems somewhat pointless. If one adds code initializing opacity field then vectorization works well. So perhaps SLP vectorizer needs to be told how to deal with uninitialized variabels that may be common in code like this after SRA? Richi, it is not clear to me where SLP vectorizer discards the idea of vectorizing the loop body in this case. But I think one needs to address: +t.c:19:16: missed: treating operand as external I wonder if the loop would work faster it it used vectors of size 4 with the last field unused.
(In reply to Jan Hubicka from comment #13) > The only difference between slp vectorization is: > > - # _68 = PHI <_5(3)> > - # _67 = PHI <_11(3)> > - # _66 = PHI <_16(3)> > - <retval>.r = _68; > - <retval>.g = _67; > - <retval>.b = _66; > + # _70 = PHI <_5(3)> > + # _69 = PHI <_11(3)> > + # _68 = PHI <_16(3)> > + <retval>.r = _70; > + <retval>.g = _69; > + <retval>.b = _68; > + <retval>.o = r$o_33(D); > > so SRA invents r$o_33(D) even if that variable is undefined. Is this the testcase from comment #10 ? I don't see r$o in my dumps.
Oh, because I missed the -DOPACITY in the second command line. The reason for SRAs creating the repalcement is total scalarization :-/
Shouldn't we DCE something = x_N(D); stores when x is a VAR_DECL, at least provided something can't trap? I mean, the previous content is one of the possible uninitialized values.
I was also thinking of DCE. It looks like plausible idea. It may leads to a surprise where you sture same undefined variable to two places and later compare them for equality, but that is undefined anyway.
One interesting observation: clang is able to do this: 0.09 │ │ vmovddup -0x8(%rdx,%rsi,1),%xmm3 ▒ ... 0.11 │ │ vfmadd231sd %xmm2,%xmm3,%xmm1 ▒ ... 0.74 │ │ vfmadd231pd %xmm2,%xmm3,%xmm0 ▒ It figures out that duplicated V2DFmode value in %xmm3 can also be accessed in the same register as DFmode value. OTOH, current gcc does: vmovsd (%rsi,%rax,8), %xmm1 ... vmovddup %xmm1, %xmm4 ... vfmadd231pd %xmm4, %xmm0, %xmm2 ... vfmadd231sd %xmm1, %xmm0, %xmm3 The above code needs two registers.
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:e1e127de18dbee47b88fa0ce74a1c7f4d658dc68 commit r14-4571-ge1e127de18dbee47b88fa0ce74a1c7f4d658dc68 Author: Zhang, Jun <jun.zhang@intel.com> Date: Fri Sep 22 23:56:37 2023 +0800 x86: set spincount 1 for x86 hybrid platform By test, we find in hybrid platform spincount 1 is better. Use '-march=native -Ofast -funroll-loops -flto', results as follows: spec2017 speed RPL ADL 657.xz_s 0.00% 0.50% 603.bwaves_s 10.90% 26.20% 607.cactuBSSN_s 5.50% 72.50% 619.lbm_s 2.40% 2.50% 621.wrf_s -7.70% 2.40% 627.cam4_s 0.50% 0.70% 628.pop2_s 48.20% 153.00% 638.imagick_s -0.10% 0.20% 644.nab_s 2.30% 1.40% 649.fotonik3d_s 8.00% 13.80% 654.roms_s 1.20% 1.10% Geomean-int 0.00% 0.50% Geomean-fp 6.30% 21.10% Geomean-all 5.70% 19.10% omp2012 RPL ADL 350.md -1.81% -1.75% 351.bwaves 7.72% 12.50% 352.nab 14.63% 19.71% 357.bt331 -0.20% 1.77% 358.botsalgn 0.00% 0.00% 359.botsspar 0.00% 0.65% 360.ilbdc 0.00% 0.25% 362.fma3d 2.66% -0.51% 363.swim 10.44% 0.00% 367.imagick 0.00% 0.12% 370.mgrid331 2.49% 25.56% 371.applu331 1.06% 4.22% 372.smithwa 0.74% 3.34% 376.kdtree 10.67% 16.03% GEOMEAN 3.34% 5.53% include/ChangeLog: PR target/109812 * spincount.h: New file. libgomp/ChangeLog: * env.c (initialize_env): Use do_adjust_default_spincount. * config/linux/x86/spincount.h: New file.
On zen4 hardware I now get GCC13 with -O3 -flto -march=native -fopenmp 2163 2161 2153 Average: 2159 Iterations Per Minute clang 17 with -O3 -flto -march=native -fopenmp 2004 1988 1991 Average: 1994 Iterations Per Minute trunk -O3 -flto -march=native -fopenmp Operation: Resizing: 2126 2135 2123 Average: 2128 Iterations Per Minute So no big changes here...
The main gap is from openmp for hybrid machine.