Bug 88153 - sqrt() is not vectorized
Summary: sqrt() is not vectorized
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 8.2.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2018-11-22 16:55 UTC by Daniel Fruzynski
Modified: 2021-10-01 03:23 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Fruzynski 2018-11-22 16:55:27 UTC
Sequence of calls to sqrt() is not vectorized.

I found Bug 21466 that claims that it was fixed in GCC 4.3, but looks that change was reverted - at least 4.4.7 it also is not vectorized. I suspect that after that change errors were not reported correctly - non-vectorized code uses sqrtsd, and for negative numbers it also calls sqrt for its side effects.

I wrote following code snippet as a possible solution for SSE instructions. I did not check all details how errors should be reported for sequence of sqrt calls, so it may need some changes.

#include <emmintrin.h>
#include <math.h>

#define SIZE 8
double d1[SIZE];
double d2[SIZE];

void test()
{
    int m = 0;
    for (int n = 0; n < SIZE; n += 2)
    {
        __m128d v = _mm_load_pd(&d1[n]);
        __m128d vs = _mm_sqrt_pd(v);
        __m128d vn = _mm_cmplt_pd(v, _mm_setzero_pd());
        m |= _mm_movemask_pd(vn);
        _mm_store_pd(&d2[n], vs);
    }

    if (m)
        sqrt(-1.0);
}
Comment 1 Andrew Pinski 2018-11-22 20:02:44 UTC
I think there are patches which turn -fmath-errno off by default floating around.  It might have gone into the GCC trunk sources already.
Comment 2 Daniel Fruzynski 2018-11-22 20:15:11 UTC
I checked that godbolt.org uses g++ (GCC-Explorer-Build) 9.0.0 20181110 (experimental). This version does not have such patch merged.

Anyway, code compiled with -fmath-errno enabled would benefit from vectorization if it can be done.
Comment 3 Richard Biener 2018-11-23 08:19:43 UTC
We can't vectorize sqrt () here unless we know nobody looks at errno later.
Comment 4 Daniel Fruzynski 2018-11-26 09:59:26 UTC
I checked man page for errno and it has following sencence:

"Valid error numbers are all nonzero; errno is never set to zero by any system call or library function."

This means that code like mine from Comment 0 should do the trick: it checks for negative values for all processed values, stores status in temporary variable, and calls sqrt(-1) once at the end if one of these values was negative.

I have created small benchmark:

[code]
#include <benchmark/benchmark.h>
#include <math.h>
#include <emmintrin.h>

#define SIZE 160

double src[SIZE];
double dest[SIZE];

static void BM_sqrt(benchmark::State& state)
{
    for (auto _ : state)
    {
        for (int n = 0; n < SIZE; ++n)
            dest[n] = sqrt(src[n]);
        benchmark::ClobberMemory();
    }
}
// Register the function as a benchmark
BENCHMARK(BM_sqrt);

static void BM_sse_sqrt_errno(benchmark::State& state)
{
    for (auto _ : state)
    {
        int m = 0;
        for (int n = 0; n < SIZE; n += 2)
        {
            __m128d v = _mm_load_pd(&src[n]);
            __m128d vs = _mm_sqrt_pd(v);
            __m128d vn = _mm_cmplt_pd(v, _mm_setzero_pd());
            m |= _mm_movemask_pd(vn);
            _mm_store_pd(&dest[n], vs);
        }
        if (m)
            sqrt(-1.0);
        benchmark::ClobberMemory();
    }
}
// Register the function as a benchmark
BENCHMARK(BM_sse_sqrt_errno);

static void BM_sse_sqrt(benchmark::State& state)
{
    for (auto _ : state)
    {
        for (int n = 0; n < SIZE; n += 2)
        {
            __m128d v = _mm_load_pd(&src[n]);
            __m128d vs = _mm_sqrt_pd(v);
            _mm_store_pd(&dest[n], vs);
        }
        benchmark::ClobberMemory();
    }
}
// Register the function as a benchmark
BENCHMARK(BM_sse_sqrt);

BENCHMARK_MAIN();
[/code]

This code was compiled using gcc 4.8.5, with following options:
g++ -std=c++11 -o test test.cc -O3 -I/benchmark/include/ -L/benchmark/lib/ -lbenchmark

Results for SIZE = 16 (loops unrolled):

---------------------------------------------------------
Benchmark                  Time           CPU Iterations
---------------------------------------------------------
BM_sqrt                   86 ns         86 ns    7188074
BM_sse_sqrt_errno         15 ns         15 ns   48084834
BM_sse_sqrt               15 ns         15 ns   47797778

Results for SIZE = 160 (loops not unrolled):

---------------------------------------------------------
Benchmark                  Time           CPU Iterations
---------------------------------------------------------
BM_sqrt                  995 ns        995 ns     839866
BM_sse_sqrt_errno        156 ns        156 ns    4348870
BM_sse_sqrt              144 ns        144 ns    4549107

As you can see, results for BM_sse_sqrt_errno are much better than BM_sqrt and close to BM_sse_sqrt. If optimization implemented in BM_sse_sqrt_errno satisfies error handling requirements for sqrt(), it is definitely worth implementing in gcc.