Bug 90993 - simd integer division not optimized
Summary: simd integer division not optimized
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 10.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2019-06-25 10:40 UTC by Matthias Kretz (Vir)
Modified: 2020-02-27 20:07 UTC (History)
1 user (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2019-06-25 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Kretz (Vir) 2019-06-25 10:40:03 UTC
Test case (https://godbolt.org/z/CYipz7):

template <class T> using V [[gnu::vector_size(16)]] = T;

V<char > f(V<char > a, V<char > b) { return a / b; }
V<short> f(V<short> a, V<short> b) { return a / b; }
V<int  > f(V<int  > a, V<int  > b) { return a / b; }
V<unsigned char > f(V<unsigned char > a, V<unsigned char > b) { return a / b; }
V<unsigned short> f(V<unsigned short> a, V<unsigned short> b) { return a / b; }
V<unsigned int  > f(V<unsigned int  > a, V<unsigned int  > b) { return a / b; }

(You can extend the test case to 32 and 64 bit vectors.)

All these divisions have no SIMD instruction on x86. However, conversion to float or double vectors is lossless (char & short -> float, int -> double) and enables implementation via divps/divpd. This leads to a considerable speedup (especially on divider throughput), even with the cost of the conversions. The division by 0 case is UB (http://eel.is/c++draft/expr.mul#4), so it doesn't matter that a potential SIGFPE turns into "whatever". ;-)

For reference, this is the result of my library implementation: https://godbolt.org/z/Xgo9Pk.

And benchmark results on Skylake i7:
                  TYPE            Latency     Speedup     Throughput     Speedup
                            [cycles/call]              [cycles/call]
 schar,                              24.5                       9.81
 schar, simd_abi::__sse              32.3        12.1           9.19        17.1
 schar, vector_size(16)               128        3.06            125        1.26
 schar, simd_abi::__avx              40.3        19.4           18.7        16.8
 schar, vector_size(32)               255        3.07            256        1.23
--------------------------------------------------------------------------------
 uchar,                              20.8                       7.55
 uchar, simd_abi::__sse              31.9        10.4            9.5        12.7
 uchar, vector_size(16)               121        2.74            116        1.04
 uchar, simd_abi::__avx              39.9        16.7           18.8        12.8
 uchar, vector_size(32)               230         2.9            224        1.08
--------------------------------------------------------------------------------
 short,                              22.7                        6.4
 short, simd_abi::__sse              23.6         7.7           4.52        11.3
 short, vector_size(16)              62.6        2.91           58.4       0.877
 short, simd_abi::__avx              30.6        11.9           9.55        10.7
 short, vector_size(32)               120        3.03            114         0.9
--------------------------------------------------------------------------------
ushort,                              19.4                       7.37
ushort, simd_abi::__sse              23.7        6.55           4.55        12.9
ushort, vector_size(16)              61.3        2.53           57.4        1.03
ushort, simd_abi::__avx              30.6        10.1           8.86        13.3
ushort, vector_size(32)               116        2.67            114        1.03
--------------------------------------------------------------------------------
   int,                              23.2                       7.14
   int, simd_abi::__sse              24.7        3.75           7.24        3.95
   int, vector_size(16)              40.3         2.3           30.9       0.924
   int, simd_abi::__avx              35.6        5.22           14.5        3.95
   int, vector_size(32)              64.2         2.9           61.4        0.93
--------------------------------------------------------------------------------
  uint,                              20.5                       7.14
  uint, simd_abi::__sse                44        1.86           7.73        3.69
  uint, vector_size(16)              39.7        2.07           30.9       0.925
  uint, simd_abi::__avx              56.9        2.89             16        3.57
  uint, vector_size(32)              71.4         2.3           71.5       0.798
--------------------------------------------------------------------------------

I have not investigated whether the same optimization makes sense for other targets than x86.

Since this optimization requires optimized vector conversions, PR85048 is relevant.
Comment 1 Richard Biener 2019-06-25 12:12:22 UTC
Confirmed.  Implementation will be a bit awkward unless done as pattern where it should be straight-forward.  I wouldn't suggest the backend lie and implement
vector integer division patterns.
Comment 2 jsm-csl@polyomino.org.uk 2019-08-01 23:24:56 UTC
Note that this is only valid with -fno-trapping-math (since integer 
division is not permitted to raise the "inexact" exception flag).  (With 
AVX-512 there are instruction variants that suppress exceptions and so 
could be used even with -ftrapping-math.)
Comment 3 Matthias Kretz (Vir) 2020-02-27 20:07:46 UTC
IIUC, AVX512 only allows overriding the rounding-mode from div instructions. So that wouldn't help.

What standard requires that "integer division is not permitted to raise the "inexact" exception flag"? C++ points to C which says:

"Certain programming conventions support the intended model of use for the floating-point environment:
— a function call does not alter its caller’s floating-point control modes, clear its caller’s floating-point status flags, nor depend on the state of its caller’s floating-point status flags unless the function is so documented;
— a function call is assumed to require default floating-point control modes, unless its documentation promises otherwise;
— a function call is assumed to have the potential for raising floating-point exceptions, unless its documentation promises otherwise." [§7.6 p3]

Thus it's a valid implementation to use floating point division for a SIMD type library, but not valid for auto-vectorization of integer division? (For vector_size(N) types the spec is up to you, no?)