Sequence of calls to sqrt() is not vectorized. I found Bug 21466 that claims that it was fixed in GCC 4.3, but looks that change was reverted - at least 4.4.7 it also is not vectorized. I suspect that after that change errors were not reported correctly - non-vectorized code uses sqrtsd, and for negative numbers it also calls sqrt for its side effects. I wrote following code snippet as a possible solution for SSE instructions. I did not check all details how errors should be reported for sequence of sqrt calls, so it may need some changes. #include <emmintrin.h> #include <math.h> #define SIZE 8 double d1[SIZE]; double d2[SIZE]; void test() { int m = 0; for (int n = 0; n < SIZE; n += 2) { __m128d v = _mm_load_pd(&d1[n]); __m128d vs = _mm_sqrt_pd(v); __m128d vn = _mm_cmplt_pd(v, _mm_setzero_pd()); m |= _mm_movemask_pd(vn); _mm_store_pd(&d2[n], vs); } if (m) sqrt(-1.0); }
I think there are patches which turn -fmath-errno off by default floating around. It might have gone into the GCC trunk sources already.
I checked that godbolt.org uses g++ (GCC-Explorer-Build) 9.0.0 20181110 (experimental). This version does not have such patch merged. Anyway, code compiled with -fmath-errno enabled would benefit from vectorization if it can be done.
We can't vectorize sqrt () here unless we know nobody looks at errno later.
I checked man page for errno and it has following sencence: "Valid error numbers are all nonzero; errno is never set to zero by any system call or library function." This means that code like mine from Comment 0 should do the trick: it checks for negative values for all processed values, stores status in temporary variable, and calls sqrt(-1) once at the end if one of these values was negative. I have created small benchmark: [code] #include <benchmark/benchmark.h> #include <math.h> #include <emmintrin.h> #define SIZE 160 double src[SIZE]; double dest[SIZE]; static void BM_sqrt(benchmark::State& state) { for (auto _ : state) { for (int n = 0; n < SIZE; ++n) dest[n] = sqrt(src[n]); benchmark::ClobberMemory(); } } // Register the function as a benchmark BENCHMARK(BM_sqrt); static void BM_sse_sqrt_errno(benchmark::State& state) { for (auto _ : state) { int m = 0; for (int n = 0; n < SIZE; n += 2) { __m128d v = _mm_load_pd(&src[n]); __m128d vs = _mm_sqrt_pd(v); __m128d vn = _mm_cmplt_pd(v, _mm_setzero_pd()); m |= _mm_movemask_pd(vn); _mm_store_pd(&dest[n], vs); } if (m) sqrt(-1.0); benchmark::ClobberMemory(); } } // Register the function as a benchmark BENCHMARK(BM_sse_sqrt_errno); static void BM_sse_sqrt(benchmark::State& state) { for (auto _ : state) { for (int n = 0; n < SIZE; n += 2) { __m128d v = _mm_load_pd(&src[n]); __m128d vs = _mm_sqrt_pd(v); _mm_store_pd(&dest[n], vs); } benchmark::ClobberMemory(); } } // Register the function as a benchmark BENCHMARK(BM_sse_sqrt); BENCHMARK_MAIN(); [/code] This code was compiled using gcc 4.8.5, with following options: g++ -std=c++11 -o test test.cc -O3 -I/benchmark/include/ -L/benchmark/lib/ -lbenchmark Results for SIZE = 16 (loops unrolled): --------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------- BM_sqrt 86 ns 86 ns 7188074 BM_sse_sqrt_errno 15 ns 15 ns 48084834 BM_sse_sqrt 15 ns 15 ns 47797778 Results for SIZE = 160 (loops not unrolled): --------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------- BM_sqrt 995 ns 995 ns 839866 BM_sse_sqrt_errno 156 ns 156 ns 4348870 BM_sse_sqrt 144 ns 144 ns 4549107 As you can see, results for BM_sse_sqrt_errno are much better than BM_sqrt and close to BM_sse_sqrt. If optimization implemented in BM_sse_sqrt_errno satisfies error handling requirements for sqrt(), it is definitely worth implementing in gcc.