[Bug middle-end/85740] Non-optimal determining maximum in a loop

Sat May 12 11:31:00 GMT 2018

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740

--- Comment #10 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
Created attachment 44121
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44121&action=edit
Test case for vectorizing, inc. assembly

This test case, written by Nicolas König, shows a proof of concept for
vectorizing a maxloc loop using AVX2. It lacks loop peeling, so the
overhead for real-life cases will be a little higher.

Compiled with

$ gcc -Ofast -funroll-loops -Wall main-gp.c maxloc_unroll.s maxloc_nounroll.s

with a Ryzen 1700, the results are

# Ints per cycle
#           n        normal        expect          AVX2   AVX2_unroll
          128      0.179272      0.107563      0.537815      0.537815
          256      0.396285      0.442907      1.075630      1.254902
          512      0.456328      0.654731      1.368984      1.254902
         1024      0.501961      0.912656      1.771626      1.434174
         2048      0.552617      1.368984      1.434174      1.771626
         4096      0.568257      1.338562      1.673203      2.041874
         8192      0.529541      1.392724      1.661663      1.958871
        16384      0.533056      1.342291      1.204706      1.617055
        32768      0.535128      1.478167      1.451453      1.832252
        65536      0.435206      1.479301      1.612995      2.077079
       131072      0.586499      1.366558      1.603602      2.081565
       262144      0.533831      1.366315      1.708424      2.071499
       524288      0.582885      1.458868      1.601936      2.053294
      1048576      0.580800      1.367224      1.563047      2.060289
      2097152      0.567541      1.120453      1.552151      1.420402
      4194304      0.555527      0.934199      1.001712      1.210831
      8388608      0.553388      0.968962      1.211283      1.010533
     16777216      0.529886      0.972008      1.143200      1.291568
     33554432      0.538509      0.971257      1.303704      1.245031
     67108864      0.567889      1.016425      1.320057      1.413633
    134217728      0.572604      1.043561      1.338406      1.437528
    268435456      0.578321      1.046748      1.420888      1.450485
    536870912      0.572145      1.042577      1.445350      1.424409

    536870912      0.458679      1.072158      1.451617      1.442926
    268435456      0.460019      1.011306      1.380031      1.404121
    134217728      0.460833      1.007872      1.318771      1.409553
     67108864      0.457754      1.008473      1.281166      1.387048
     33554432      0.425367      0.933973      1.378792      1.078957
     16777216      0.449178      0.903576      1.371796      1.416321
      8388608      0.436478      1.043172      1.298123      1.344617
      4194304      0.421925      1.023954      1.214096      1.288334
      2097152      0.458274      1.068309      1.060794      1.147894
      1048576      0.470394      1.488727      1.247491      1.777959
       524288      0.473536      1.493919      2.073448      2.096280
       262144      0.476521      1.504707      2.261032      2.067056
       131072      0.473536      1.494209      2.189131      2.060427
        65536      0.432861      1.477034      2.238710      2.018355
        32768      0.468757      1.482715      2.072612      2.024716
        16384      0.470129      1.455838      2.068165      1.974928
         8192      0.469671      1.417301      2.190374      2.041874
         4096      0.474294      1.368984      2.007843      1.974928
         2048      0.459811      1.075630      2.007843      2.077079
         1024      0.406995      1.254902      1.882353      1.505882
          512      0.406995      0.792570      1.673203      0.941176
          256      0.327366      0.627451      1.254902      0.752941
          128      0.342246      0.941176      0.941176      0.537815

so

a) __builtin_expect is a big win

b) AVX2 is an even bigger win

c) Unrolling the AVX2 loop is a win for intermediate sizes, at large sizes
   (outside of any cache)

Dunno how realistic it is to emit this kind of code for the general case,
or what we could do in the Fortran front end to encourage vectorization
like that. Use vector data types and do the loop peeling by hand, I presume.