Bug 119223 - GCC does not optimize with AVX in bitshift with if condition
Summary: GCC does not optimize with AVX in bitshift with if condition
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 14.2.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
: 119262 (view as bug list)
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2025-03-11 17:52 UTC by Kael Franco
Modified: 2025-03-13 12:52 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2025-03-11 00:00:00


Attachments
test program #1 (653 bytes, text/plain)
2025-03-11 17:54 UTC, Andrew Pinski
Details
test program #2 (660 bytes, text/plain)
2025-03-11 17:54 UTC, Andrew Pinski
Details
Test program #1 -fdump-tree-optimized (1.87 KB, text/plain)
2025-03-11 17:59 UTC, Kael Franco
Details
functions in one file to compare (183 bytes, text/plain)
2025-03-11 18:00 UTC, Andrew Pinski
Details
Test program #2 -fdump-tree-optimized (1.19 KB, text/plain)
2025-03-11 18:01 UTC, Kael Franco
Details
testcase to show that the issue is `1<<n` vs `bool<<n` (195 bytes, text/plain)
2025-03-11 18:09 UTC, Andrew Pinski
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kael Franco 2025-03-11 17:52:13 UTC
I decided to create 2 C programs that matches newlines in a file (The file is src/Sema.zig from Zig 0.14) from https://godbolt.org/z/v9hqzPv4b. Both programs behave the same.

The only difference is at line 56, where the first C code has no if condition.
GCC adds SIMD when no if condition is used as seen in second C program. Clang optimizes both with SIMD. The difference seems to be at -fdump-tree-optimized.

Gentoo GCC 14.2 was used and both C programs was optimized with -std=gnu23 -O3 -march=icelake-client -D_FILE_OFFSET_BITS=64 -flto.
uname -a is Linux tux 6.6.67-gentoo-gentoo-dist #4 SMP PREEMPT_DYNAMIC Sun Jan 26 03:15:41 EST 2025 x86_64 Intel(R) Core(TM) i5-1035G1 CPU @ 1.00GHz GenuineIntel GNU/Linux
The results are measured from poop with the following speedups:
./poop './main2' './main1' -d 60000
Benchmark 1 (10000 runs): ./main2
measurement          mean ± σ            min … max           outliers         delta
wall_time          4.58ms ±  972us    2.11ms … 6.88ms          0 ( 0%)        0%
peak_rss           3.10MB ± 64.4KB    2.78MB … 3.20MB          1 ( 0%)        0%
cpu_cycles         4.97M  ±  110K     4.47M  … 6.18M        1090 (11%)        0%
instructions       12.0M  ± 1.19      12.0M  … 12.0M         799 ( 8%)        0%
cache_references   31.4K  ±  528      30.1K  … 32.9K           0 ( 0%)        0%
cache_misses       4.26K  ±  808      2.73K  … 10.8K         170 ( 2%)        0%
branch_misses      28.1K  ±  285      10.4K  … 28.2K         153 ( 2%)        0%
Benchmark 2 (10000 runs): ./main1
measurement          mean ± σ            min … max           outliers         delta
wall_time          3.28ms ±  310us    1.54ms … 4.61ms       1807 (18%)        - 28.4% ±  0.4%
peak_rss           3.10MB ± 64.0KB    2.78MB … 3.20MB          2 ( 0%)          -  0.0% ±  0.1%
cpu_cycles         2.06M  ± 28.2K     2.02M  … 2.72M         602 ( 6%)        - 58.6% ±  0.0%
instructions       2.37M  ± 1.14      2.37M  … 2.37M           5 ( 0%)        - 80.2% ±  0.0%
cache_references   31.4K  ±  378      30.5K  … 32.8K           5 ( 0%)          +  0.3% ±  0.0%
cache_misses       4.25K  ±  809      2.71K  … 15.6K         246 ( 2%)          -  0.3% ±  0.5%
branch_misses      2.16K  ± 35.0      1.44K  … 2.32K         110 ( 1%)        - 92.3% ±  0.0%
Comment 1 Andrew Pinski 2025-03-11 17:54:09 UTC
Created attachment 60710 [details]
test program #1

Next time attach the testcases.
Comment 2 Andrew Pinski 2025-03-11 17:54:46 UTC
Created attachment 60711 [details]
test program #2
Comment 3 Kael Franco 2025-03-11 17:59:21 UTC
Created attachment 60712 [details]
Test program #1 -fdump-tree-optimized
Comment 4 Andrew Pinski 2025-03-11 18:00:00 UTC
Created attachment 60713 [details]
functions in one file to compare
Comment 5 Kael Franco 2025-03-11 18:01:19 UTC
Created attachment 60714 [details]
Test program #2 -fdump-tree-optimized
Comment 6 Andrew Pinski 2025-03-11 18:06:58 UTC
/app/example.cpp:17:18: missed:   unusable type for last operand in vector/vector shift/rotate.
/app/example.cpp:20:22: missed:   not vectorized: relevant stmt not supported: _4 = 1 << _3;

Basically we can vectorize `bool<<i` but not `1<<i`.
Comment 7 Andrew Pinski 2025-03-11 18:09:54 UTC
Created attachment 60715 [details]
testcase to show that the issue is `1<<n` vs `bool<<n`

If WORKS is defined, then this loop can be vectorized and we can shift the bool. But if WORKS is not defined, t becomes 1 and we don't vectorizer the loop.
Comment 8 Andrew Pinski 2025-03-13 02:17:53 UTC
*** Bug 119262 has been marked as a duplicate of this bug. ***