I decided to create 2 C programs that matches newlines in a file (The file is src/Sema.zig from Zig 0.14) from https://godbolt.org/z/v9hqzPv4b. Both programs behave the same. The only difference is at line 56, where the first C code has no if condition. GCC adds SIMD when no if condition is used as seen in second C program. Clang optimizes both with SIMD. The difference seems to be at -fdump-tree-optimized. Gentoo GCC 14.2 was used and both C programs was optimized with -std=gnu23 -O3 -march=icelake-client -D_FILE_OFFSET_BITS=64 -flto. uname -a is Linux tux 6.6.67-gentoo-gentoo-dist #4 SMP PREEMPT_DYNAMIC Sun Jan 26 03:15:41 EST 2025 x86_64 Intel(R) Core(TM) i5-1035G1 CPU @ 1.00GHz GenuineIntel GNU/Linux The results are measured from poop with the following speedups: ./poop './main2' './main1' -d 60000 Benchmark 1 (10000 runs): ./main2 measurement mean ± σ min … max outliers delta wall_time 4.58ms ± 972us 2.11ms … 6.88ms 0 ( 0%) 0% peak_rss 3.10MB ± 64.4KB 2.78MB … 3.20MB 1 ( 0%) 0% cpu_cycles 4.97M ± 110K 4.47M … 6.18M 1090 (11%) 0% instructions 12.0M ± 1.19 12.0M … 12.0M 799 ( 8%) 0% cache_references 31.4K ± 528 30.1K … 32.9K 0 ( 0%) 0% cache_misses 4.26K ± 808 2.73K … 10.8K 170 ( 2%) 0% branch_misses 28.1K ± 285 10.4K … 28.2K 153 ( 2%) 0% Benchmark 2 (10000 runs): ./main1 measurement mean ± σ min … max outliers delta wall_time 3.28ms ± 310us 1.54ms … 4.61ms 1807 (18%) - 28.4% ± 0.4% peak_rss 3.10MB ± 64.0KB 2.78MB … 3.20MB 2 ( 0%) - 0.0% ± 0.1% cpu_cycles 2.06M ± 28.2K 2.02M … 2.72M 602 ( 6%) - 58.6% ± 0.0% instructions 2.37M ± 1.14 2.37M … 2.37M 5 ( 0%) - 80.2% ± 0.0% cache_references 31.4K ± 378 30.5K … 32.8K 5 ( 0%) + 0.3% ± 0.0% cache_misses 4.25K ± 809 2.71K … 15.6K 246 ( 2%) - 0.3% ± 0.5% branch_misses 2.16K ± 35.0 1.44K … 2.32K 110 ( 1%) - 92.3% ± 0.0%
Created attachment 60710 [details] test program #1 Next time attach the testcases.
Created attachment 60711 [details] test program #2
Created attachment 60712 [details] Test program #1 -fdump-tree-optimized
Created attachment 60713 [details] functions in one file to compare
Created attachment 60714 [details] Test program #2 -fdump-tree-optimized
/app/example.cpp:17:18: missed: unusable type for last operand in vector/vector shift/rotate. /app/example.cpp:20:22: missed: not vectorized: relevant stmt not supported: _4 = 1 << _3; Basically we can vectorize `bool<<i` but not `1<<i`.
Created attachment 60715 [details] testcase to show that the issue is `1<<n` vs `bool<<n` If WORKS is defined, then this loop can be vectorized and we can shift the bool. But if WORKS is not defined, t becomes 1 and we don't vectorizer the loop.
*** Bug 119262 has been marked as a duplicate of this bug. ***