[Bug middle-end/85740] Non-optimal determining maximum in a loop
tkoenig at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Sat May 12 11:31:00 GMT 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740
--- Comment #10 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
Created attachment 44121
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44121&action=edit
Test case for vectorizing, inc. assembly
This test case, written by Nicolas König, shows a proof of concept for
vectorizing a maxloc loop using AVX2. It lacks loop peeling, so the
overhead for real-life cases will be a little higher.
Compiled with
$ gcc -Ofast -funroll-loops -Wall main-gp.c maxloc_unroll.s maxloc_nounroll.s
with a Ryzen 1700, the results are
# Ints per cycle
# n normal expect AVX2 AVX2_unroll
128 0.179272 0.107563 0.537815 0.537815
256 0.396285 0.442907 1.075630 1.254902
512 0.456328 0.654731 1.368984 1.254902
1024 0.501961 0.912656 1.771626 1.434174
2048 0.552617 1.368984 1.434174 1.771626
4096 0.568257 1.338562 1.673203 2.041874
8192 0.529541 1.392724 1.661663 1.958871
16384 0.533056 1.342291 1.204706 1.617055
32768 0.535128 1.478167 1.451453 1.832252
65536 0.435206 1.479301 1.612995 2.077079
131072 0.586499 1.366558 1.603602 2.081565
262144 0.533831 1.366315 1.708424 2.071499
524288 0.582885 1.458868 1.601936 2.053294
1048576 0.580800 1.367224 1.563047 2.060289
2097152 0.567541 1.120453 1.552151 1.420402
4194304 0.555527 0.934199 1.001712 1.210831
8388608 0.553388 0.968962 1.211283 1.010533
16777216 0.529886 0.972008 1.143200 1.291568
33554432 0.538509 0.971257 1.303704 1.245031
67108864 0.567889 1.016425 1.320057 1.413633
134217728 0.572604 1.043561 1.338406 1.437528
268435456 0.578321 1.046748 1.420888 1.450485
536870912 0.572145 1.042577 1.445350 1.424409
536870912 0.458679 1.072158 1.451617 1.442926
268435456 0.460019 1.011306 1.380031 1.404121
134217728 0.460833 1.007872 1.318771 1.409553
67108864 0.457754 1.008473 1.281166 1.387048
33554432 0.425367 0.933973 1.378792 1.078957
16777216 0.449178 0.903576 1.371796 1.416321
8388608 0.436478 1.043172 1.298123 1.344617
4194304 0.421925 1.023954 1.214096 1.288334
2097152 0.458274 1.068309 1.060794 1.147894
1048576 0.470394 1.488727 1.247491 1.777959
524288 0.473536 1.493919 2.073448 2.096280
262144 0.476521 1.504707 2.261032 2.067056
131072 0.473536 1.494209 2.189131 2.060427
65536 0.432861 1.477034 2.238710 2.018355
32768 0.468757 1.482715 2.072612 2.024716
16384 0.470129 1.455838 2.068165 1.974928
8192 0.469671 1.417301 2.190374 2.041874
4096 0.474294 1.368984 2.007843 1.974928
2048 0.459811 1.075630 2.007843 2.077079
1024 0.406995 1.254902 1.882353 1.505882
512 0.406995 0.792570 1.673203 0.941176
256 0.327366 0.627451 1.254902 0.752941
128 0.342246 0.941176 0.941176 0.537815
so
a) __builtin_expect is a big win
b) AVX2 is an even bigger win
c) Unrolling the AVX2 loop is a win for intermediate sizes, at large sizes
(outside of any cache)
Dunno how realistic it is to emit this kind of code for the general case,
or what we could do in the Fortran front end to encourage vectorization
like that. Use vector data types and do the loop peeling by hand, I presume.
More information about the Gcc-bugs
mailing list