90993 – simd integer division not optimized

Bug 90993 - simd integer division not optimized

Summary: simd integer division not optimized

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	10.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2019-06-25 10:40 UTC by Matthias Kretz (Vir)
Modified:	2020-02-27 20:07 UTC (History)
CC List:	1 user (show)

See Also:	85048
Host:
Target:	x86_64--, i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2019-06-25 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Matthias Kretz (Vir) 2019-06-25 10:40:03 UTC

Test case (https://godbolt.org/z/CYipz7):

template <class T> using V [[gnu::vector_size(16)]] = T;

V<char > f(V<char > a, V<char > b) { return a / b; }
V<short> f(V<short> a, V<short> b) { return a / b; }
V<int  > f(V<int  > a, V<int  > b) { return a / b; }
V<unsigned char > f(V<unsigned char > a, V<unsigned char > b) { return a / b; }
V<unsigned short> f(V<unsigned short> a, V<unsigned short> b) { return a / b; }
V<unsigned int  > f(V<unsigned int  > a, V<unsigned int  > b) { return a / b; }

(You can extend the test case to 32 and 64 bit vectors.)

All these divisions have no SIMD instruction on x86. However, conversion to float or double vectors is lossless (char & short -> float, int -> double) and enables implementation via divps/divpd. This leads to a considerable speedup (especially on divider throughput), even with the cost of the conversions. The division by 0 case is UB (http://eel.is/c++draft/expr.mul#4), so it doesn't matter that a potential SIGFPE turns into "whatever". ;-)

For reference, this is the result of my library implementation: https://godbolt.org/z/Xgo9Pk.

And benchmark results on Skylake i7:
                  TYPE            Latency     Speedup     Throughput     Speedup
                            [cycles/call]              [cycles/call]
 schar,                              24.5                       9.81
 schar, simd_abi::__sse              32.3        12.1           9.19        17.1
 schar, vector_size(16)               128        3.06            125        1.26
 schar, simd_abi::__avx              40.3        19.4           18.7        16.8
 schar, vector_size(32)               255        3.07            256        1.23
--------------------------------------------------------------------------------
 uchar,                              20.8                       7.55
 uchar, simd_abi::__sse              31.9        10.4            9.5        12.7
 uchar, vector_size(16)               121        2.74            116        1.04
 uchar, simd_abi::__avx              39.9        16.7           18.8        12.8
 uchar, vector_size(32)               230         2.9            224        1.08
--------------------------------------------------------------------------------
 short,                              22.7                        6.4
 short, simd_abi::__sse              23.6         7.7           4.52        11.3
 short, vector_size(16)              62.6        2.91           58.4       0.877
 short, simd_abi::__avx              30.6        11.9           9.55        10.7
 short, vector_size(32)               120        3.03            114         0.9
--------------------------------------------------------------------------------
ushort,                              19.4                       7.37
ushort, simd_abi::__sse              23.7        6.55           4.55        12.9
ushort, vector_size(16)              61.3        2.53           57.4        1.03
ushort, simd_abi::__avx              30.6        10.1           8.86        13.3
ushort, vector_size(32)               116        2.67            114        1.03
--------------------------------------------------------------------------------
   int,                              23.2                       7.14
   int, simd_abi::__sse              24.7        3.75           7.24        3.95
   int, vector_size(16)              40.3         2.3           30.9       0.924
   int, simd_abi::__avx              35.6        5.22           14.5        3.95
   int, vector_size(32)              64.2         2.9           61.4        0.93
--------------------------------------------------------------------------------
  uint,                              20.5                       7.14
  uint, simd_abi::__sse                44        1.86           7.73        3.69
  uint, vector_size(16)              39.7        2.07           30.9       0.925
  uint, simd_abi::__avx              56.9        2.89             16        3.57
  uint, vector_size(32)              71.4         2.3           71.5       0.798
--------------------------------------------------------------------------------

I have not investigated whether the same optimization makes sense for other targets than x86.

Since this optimization requires optimized vector conversions, PR85048 is relevant.

Comment 1 Richard Biener 2019-06-25 12:12:22 UTC

Confirmed.  Implementation will be a bit awkward unless done as pattern where it should be straight-forward.  I wouldn't suggest the backend lie and implement
vector integer division patterns.

Comment 2 jsm-csl@polyomino.org.uk 2019-08-01 23:24:56 UTC

Note that this is only valid with -fno-trapping-math (since integer 
division is not permitted to raise the "inexact" exception flag).  (With 
AVX-512 there are instruction variants that suppress exceptions and so 
could be used even with -ftrapping-math.)

Comment 3 Matthias Kretz (Vir) 2020-02-27 20:07:46 UTC

IIUC, AVX512 only allows overriding the rounding-mode from div instructions. So that wouldn't help.

What standard requires that "integer division is not permitted to raise the "inexact" exception flag"? C++ points to C which says:

"Certain programming conventions support the intended model of use for the floating-point environment:
— a function call does not alter its caller’s floating-point control modes, clear its caller’s floating-point status flags, nor depend on the state of its caller’s floating-point status flags unless the function is so documented;
— a function call is assumed to require default floating-point control modes, unless its documentation promises otherwise;
— a function call is assumed to have the potential for raising floating-point exceptions, unless its documentation promises otherwise." [§7.6 p3]

Thus it's a valid implementation to use floating point division for a SIMD type library, but not valid for auto-vectorization of integer division? (For vector_size(N) types the spec is up to you, no?)