Test case (https://godbolt.org/z/CYipz7): template <class T> using V [[gnu::vector_size(16)]] = T; V<char > f(V<char > a, V<char > b) { return a / b; } V<short> f(V<short> a, V<short> b) { return a / b; } V<int > f(V<int > a, V<int > b) { return a / b; } V<unsigned char > f(V<unsigned char > a, V<unsigned char > b) { return a / b; } V<unsigned short> f(V<unsigned short> a, V<unsigned short> b) { return a / b; } V<unsigned int > f(V<unsigned int > a, V<unsigned int > b) { return a / b; } (You can extend the test case to 32 and 64 bit vectors.) All these divisions have no SIMD instruction on x86. However, conversion to float or double vectors is lossless (char & short -> float, int -> double) and enables implementation via divps/divpd. This leads to a considerable speedup (especially on divider throughput), even with the cost of the conversions. The division by 0 case is UB (http://eel.is/c++draft/expr.mul#4), so it doesn't matter that a potential SIGFPE turns into "whatever". ;-) For reference, this is the result of my library implementation: https://godbolt.org/z/Xgo9Pk. And benchmark results on Skylake i7: TYPE Latency Speedup Throughput Speedup [cycles/call] [cycles/call] schar, 24.5 9.81 schar, simd_abi::__sse 32.3 12.1 9.19 17.1 schar, vector_size(16) 128 3.06 125 1.26 schar, simd_abi::__avx 40.3 19.4 18.7 16.8 schar, vector_size(32) 255 3.07 256 1.23 -------------------------------------------------------------------------------- uchar, 20.8 7.55 uchar, simd_abi::__sse 31.9 10.4 9.5 12.7 uchar, vector_size(16) 121 2.74 116 1.04 uchar, simd_abi::__avx 39.9 16.7 18.8 12.8 uchar, vector_size(32) 230 2.9 224 1.08 -------------------------------------------------------------------------------- short, 22.7 6.4 short, simd_abi::__sse 23.6 7.7 4.52 11.3 short, vector_size(16) 62.6 2.91 58.4 0.877 short, simd_abi::__avx 30.6 11.9 9.55 10.7 short, vector_size(32) 120 3.03 114 0.9 -------------------------------------------------------------------------------- ushort, 19.4 7.37 ushort, simd_abi::__sse 23.7 6.55 4.55 12.9 ushort, vector_size(16) 61.3 2.53 57.4 1.03 ushort, simd_abi::__avx 30.6 10.1 8.86 13.3 ushort, vector_size(32) 116 2.67 114 1.03 -------------------------------------------------------------------------------- int, 23.2 7.14 int, simd_abi::__sse 24.7 3.75 7.24 3.95 int, vector_size(16) 40.3 2.3 30.9 0.924 int, simd_abi::__avx 35.6 5.22 14.5 3.95 int, vector_size(32) 64.2 2.9 61.4 0.93 -------------------------------------------------------------------------------- uint, 20.5 7.14 uint, simd_abi::__sse 44 1.86 7.73 3.69 uint, vector_size(16) 39.7 2.07 30.9 0.925 uint, simd_abi::__avx 56.9 2.89 16 3.57 uint, vector_size(32) 71.4 2.3 71.5 0.798 -------------------------------------------------------------------------------- I have not investigated whether the same optimization makes sense for other targets than x86. Since this optimization requires optimized vector conversions, PR85048 is relevant.
Confirmed. Implementation will be a bit awkward unless done as pattern where it should be straight-forward. I wouldn't suggest the backend lie and implement vector integer division patterns.
Note that this is only valid with -fno-trapping-math (since integer division is not permitted to raise the "inexact" exception flag). (With AVX-512 there are instruction variants that suppress exceptions and so could be used even with -ftrapping-math.)
IIUC, AVX512 only allows overriding the rounding-mode from div instructions. So that wouldn't help. What standard requires that "integer division is not permitted to raise the "inexact" exception flag"? C++ points to C which says: "Certain programming conventions support the intended model of use for the floating-point environment: — a function call does not alter its caller’s floating-point control modes, clear its caller’s floating-point status flags, nor depend on the state of its caller’s floating-point status flags unless the function is so documented; — a function call is assumed to require default floating-point control modes, unless its documentation promises otherwise; — a function call is assumed to have the potential for raising floating-point exceptions, unless its documentation promises otherwise." [§7.6 p3] Thus it's a valid implementation to use floating point division for a SIMD type library, but not valid for auto-vectorization of integer division? (For vector_size(N) types the spec is up to you, no?)