Bug 40770 - Vectorization of complex types, vectorization of sincos missing
Summary: Vectorization of complex types, vectorization of sincos missing
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.5.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
: 50534 (view as bug list)
Depends on:
Blocks: vectorizer 40766
  Show dependency treegraph
 
Reported: 2009-07-16 09:42 UTC by Richard Biener
Modified: 2020-09-22 11:31 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2009-07-16 09:43:14


Attachments
test source file (651 bytes, text/plain)
2020-09-22 10:15 UTC, 康 珊
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Biener 2009-07-16 09:42:04 UTC
The following should be vectorized with proper -mveclib:

float xf[1024];
float sf[1024];
float cf[1024];
void foo (void)
{
  int i;
  for (i = 0; i < 1024; ++i)
    {
      sf[i] = __builtin_sinf (xf[i]);
      cf[i] = __builtin_cosf (xf[i]);
    }
}

double xd[1024];
double sd[1024];
double cd[1024];
void bar (void)
{
  int i;
  for (i = 0; i < 1024; ++i)
    {
      sd[i] = __builtin_sin (xd[i]);
      cd[i] = __builtin_cos (xd[i]);
    }
}
Comment 1 Richard Biener 2009-07-16 09:44:27 UTC
The middle-end presents the vectorizer with

<bb 3>:
  # i_13 = PHI <i_7(4), 0(2)>
  # ivtmp.26_8 = PHI <ivtmp.26_16(4), 1024(2)>
  D.1623_3 = xd[i_13];
  sincostmp.21_1 = __builtin_cexpi (D.1623_3);
  D.1624_4 = IMAGPART_EXPR <sincostmp.21_1>;
  sd[i_13] = D.1624_4;
  D.1625_6 = REALPART_EXPR <sincostmp.21_1>;
  cd[i_13] = D.1625_6;
  i_7 = i_13 + 1;
  ivtmp.26_16 = ivtmp.26_8 - 1;
  if (ivtmp.26_16 != 0)
    goto <bb 4>;
  else
    goto <bb 5>;

which has first of all complex types (they should be recognized as V2DF
with vectorization factor 1, thus SLP-able).

For the float case

<bb 3>:
  # i_13 = PHI <i_7(4), 0(2)>
  # ivtmp.6_8 = PHI <ivtmp.6_16(4), 1024(2)>
  D.1610_3 = xf[i_13];
  sincostmp.1_1 = __builtin_cexpif (D.1610_3);
  D.1611_4 = IMAGPART_EXPR <sincostmp.1_1>;
  sf[i_13] = D.1611_4;
  D.1612_6 = REALPART_EXPR <sincostmp.1_1>;
  cf[i_13] = D.1612_6;
  i_7 = i_13 + 1;
  ivtmp.6_16 = ivtmp.6_8 - 1;
  if (ivtmp.6_16 != 0)
    goto <bb 4>;
  else
    goto <bb 5>;

they should be V2SF, thus use V4SF and vectorization factor 2.  Still use
SLP probably.
Comment 2 Ira Rosen 2009-07-16 12:29:12 UTC
pr40770.c:20: note: ==> examining statement: sincostmp.21_1 = __builtin_cexpi (D.1625_3);
pr40770.c:20: note: get vectype for scalar type:  complex double
pr40770.c:20: note: not vectorized: unsupported data-type complex double

make_vector_type returns NULL for this type.
Comment 3 rguenther@suse.de 2009-07-16 13:05:25 UTC
Subject: Re:  Vectorization of complex types,
 vectorization of sincos missing

On Thu, 16 Jul 2009, irar at il dot ibm dot com wrote:

> ------- Comment #2 from irar at il dot ibm dot com  2009-07-16 12:29 -------
> pr40770.c:20: note: ==> examining statement: sincostmp.21_1 = __builtin_cexpi
> (D.1625_3);
> pr40770.c:20: note: get vectype for scalar type:  complex double
> pr40770.c:20: note: not vectorized: unsupported data-type complex double
> 
> make_vector_type returns NULL for this type.

Yes - there is no vector type for complex double.  But the vectorizer
could query for a vector type for the complex component type (double)
and divide the vector element count by 2 (for complex) to get the
vectorization factor which would be 1 here.  Should SLP the be possible
for that loop?

Richard.
Comment 4 Tobias Burnus 2009-07-16 13:32:00 UTC
(In reply to comment #3)
> Yes - there is no vector type for complex double.  But the vectorizer
> could query for a vector type for the complex component type (double)
> and divide the vector element count by 2 (for complex) to get the
> vectorization factor which would be 1 here.

I do not know much about this, but wouldn't that fail if one wants to vectorize true complex functions such as ccosf (assuming that they are in principle vectorizable)?
Comment 5 rguenther@suse.de 2009-07-16 13:57:45 UTC
Subject: Re:  Vectorization of complex types,
 vectorization of sincos missing

On Thu, 16 Jul 2009, burnus at gcc dot gnu dot org wrote:

> ------- Comment #4 from burnus at gcc dot gnu dot org  2009-07-16 13:32 -------
> (In reply to comment #3)
> > Yes - there is no vector type for complex double.  But the vectorizer
> > could query for a vector type for the complex component type (double)
> > and divide the vector element count by 2 (for complex) to get the
> > vectorization factor which would be 1 here.
> 
> I do not know much about this, but wouldn't that fail if one wants to vectorize
> true complex functions such as ccosf (assuming that they are in principle
> vectorizable)?

Well, for ccosf we would have a vectorization factor of 2 left for V4SF.
Of course this assumes that we present the vectorizer with a vectorized
ccosf with the signature v4sf (*)(v4sf).  Or we would need to introduce
complex vector modes - which I'd rather avoid.

Richard.
Comment 6 Ira Rosen 2009-07-16 17:31:00 UTC
(In reply to comment #3)
> > make_vector_type returns NULL for this type.
> Yes - there is no vector type for complex double.  But the vectorizer
> could query for a vector type for the complex component type (double)
> and divide the vector element count by 2 (for complex) to get the
> vectorization factor which would be 1 here.  

I see.

> Should SLP the be possible
> for that loop?

Not with the current implementation - SLP needs strided stores to start. Here the stores are not even adjacent. I think, it would be better to vectorize this loop with regular loop-based vectorization to avoid permutations. I'll take a better look on Sunday.

Ira

> Richard.

Comment 7 Ira Rosen 2009-07-20 11:18:34 UTC
AFAIU, querying for the component type of complex type is not difficult to implement. 
I think, that loop-based vectorization is preferable here, so we should stay with vectorization factor of 2 for doubles.

The next problem is to vectorize 
  D.1611_4 = IMAGPART_EXPR <sincostmp.1_1>;
and
  D.1612_6 = REALPART_EXPR <sincostmp.1_1>;

Currently, we support only loads and stores with IMAGPART/REALPART_EXPR, vectorizing them as strided accesses, with extract odd and even operations for loads. So, we will have to support interleaving of non-memory variables. 

Does __builtin_cexpi have a vector implementation? If so, does it return two vectors?

If not, I guess, we need something like:

  sincostmp.1 = __builtin_cexpi (xd[i]);
  sincostmp.2 = __builtin_cexpi (xd[i+1]);
  v1 = VEC_EXTRACT_EVEN (sincostmp.1, sincostmp.2);
  v2 = VEC_EXTRACT_ODD (sincostmp.1, sincostmp.2);
  sf[i:i+1] = v1;
  cf[i:i+1] = v2;
  i = i + 2;

Or we can use the two vectors from vectorized __builtin_cexpi as parameters of extract operations.
Does that make sense?

Ira
Comment 8 Uroš Bizjak 2009-07-20 12:30:41 UTC
(In reply to comment #7)

> Does __builtin_cexpi have a vector implementation? If so, does it return two
> vectors?

No, only vectorized sincos is implemented (see also links at PR40766, Comment #11).
Comment 9 rguenther@suse.de 2009-07-20 12:55:24 UTC
Subject: Re:  Vectorization of complex types,
 vectorization of sincos missing

On Mon, 20 Jul 2009, irar at il dot ibm dot com wrote:

> 
> 
> ------- Comment #7 from irar at il dot ibm dot com  2009-07-20 11:18 -------
> AFAIU, querying for the component type of complex type is not difficult to
> implement. 
> I think, that loop-based vectorization is preferable here, so we should stay
> with vectorization factor of 2 for doubles.
> 
> The next problem is to vectorize 
>   D.1611_4 = IMAGPART_EXPR <sincostmp.1_1>;
> and
>   D.1612_6 = REALPART_EXPR <sincostmp.1_1>;
> 
> Currently, we support only loads and stores with IMAGPART/REALPART_EXPR,
> vectorizing them as strided accesses, with extract odd and even operations for
> loads. So, we will have to support interleaving of non-memory variables. 
> 
> Does __builtin_cexpi have a vector implementation? If so, does it return two
> vectors?

No, currently cexpi doesn't have a vectorized version.  We could add
an internal builtin for that that takes a vector as argument and
returns a vector with complex components.  And lower this during expansion
to a suitable available form (eventually just two calls).

> If not, I guess, we need something like:
> 
>   sincostmp.1 = __builtin_cexpi (xd[i]);
>   sincostmp.2 = __builtin_cexpi (xd[i+1]);
>   v1 = VEC_EXTRACT_EVEN (sincostmp.1, sincostmp.2);
>   v2 = VEC_EXTRACT_ODD (sincostmp.1, sincostmp.2);
>   sf[i:i+1] = v1;
>   cf[i:i+1] = v2;
>   i = i + 2;

Yes, that was my initial idea.

> Or we can use the two vectors from vectorized __builtin_cexpi as parameters of
> extract operations.
> Does that make sense?

Yes, I think so.  With a vectorized builtin we'd have

  v0 = xd[i:i+1];
  sincostmp.1 = __builtin_vect_cexpi (v0);
  v1 = VEC_EXTRACT_EVEN (sincostmp.1[0], sincostmp.1[1]);
  v2 = VEC_EXTRACT_ODD (sincostmp.1[0], sincostmp.1[1]);
  sf[i:i+1] = v1;
  cf[i:i+1] = v2;
  i = i + 2;

where sincostmp.1[0] would select the lower half of a V4DF and
sincostmp.1[1] the upper half of a V4DF.  But that's probably
more difficult as we'd have both V2DF and V4DF in the IL.

Richard.
Comment 10 vincenzo Innocente 2011-09-13 09:52:53 UTC
resurrecting this:
just checked with gcc version 4.7.0 20110910

-mveclibabi=svml -LwhereverYouhaveIntelSoftware/linux/x86_64/Compiler/11.1/072/lib/intel64/ -lsvml  -lirc

and sincos does not vectorize nor emits svml symbols (sin, cos independently are ok)
Comment 11 Richard Biener 2011-09-27 13:59:57 UTC
*** Bug 50534 has been marked as a duplicate of this bug. ***
Comment 12 Richard Biener 2020-09-22 06:35:13 UTC
*** Bug 97149 has been marked as a duplicate of this bug. ***
Comment 13 康 珊 2020-09-22 07:04:19 UTC
(In reply to Richard Biener from comment #12)
> *** Bug 97149 has been marked as a duplicate of this bug. ***

Will you look at this?
Comment 14 rguenther@suse.de 2020-09-22 08:11:01 UTC
On Tue, 22 Sep 2020, kangshan0910 at hotmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40770
> 
> --- Comment #13 from ? ? <kangshan0910 at hotmail dot com> ---
> (In reply to Richard Biener from comment #12)
> > *** Bug 97149 has been marked as a duplicate of this bug. ***
> 
> Will you look at this?

glibc needs to think of an ABI for OMP SIMD for this, then we can
autovectorize.  Alternatively we could of course use
vector sin and vector cos separately which might be better than
not vectorizing it at all?
Comment 15 Jakub Jelinek 2020-09-22 09:03:44 UTC
OpenMP has a way to express this,
#pragma omp declare simd notinbranch linear(y,z)
double sincos (double x, double *y, double *z);
As can be seen on -O2 -fopenmp-simd:
#pragma omp declare simd notinbranch linear(y, z)
double foo (double x, double *y, double *z);

double a[1024], b[1024], c[1024];

void
bar (void)
{
  #pragma omp simd
  for (int i = 0; i < 1024; i++)
    foo (a[i], &b[i], &c[i]);
}
it works fine with that.  But if #pragma omp simd is commented out, it is not, because the compiler has no assurance on what the function call behavior will be e.g. for data reference analysis (it could e.g. modify the global vars).
Now, for sincos proper, the compiler knows what the function does if not -fno-builtin-sincos.

Unfortunately, I think glibc currently defines sincos in math-vector.h the unuseful way with just simd ("notinbranch") attribute, which effectively means
that the caller is supposed to pass vectors of pointers (well integers with those sizes) and so needs scatter under the hood.
So the best thing would be to change glibc first.
Comment 16 Florian Weimer 2020-09-22 09:37:37 UTC
(In reply to Jakub Jelinek from comment #15)
> Unfortunately, I think glibc currently defines sincos in math-vector.h the
> unuseful way with just simd ("notinbranch") attribute, which effectively
> means
> that the caller is supposed to pass vectors of pointers (well integers with
> those sizes) and so needs scatter under the hood.
> So the best thing would be to change glibc first.

What exactly do we need to change? Factor out the declaration of sincos, so that we can apply “linear(__p1, __p2)” to it?

Should glibc provide a cexpi function? The complex result would avoid having to go through memory at all.
Comment 17 Richard Biener 2020-09-22 09:49:10 UTC
IIRC cepxi has the complication that we have to define the mangling for vector _Complex double.  But cexpi is what the vectorizer currently sees, of course
(we can of course vectorize it as sincos with arguments in memory again).

So yeah, glibc is lacking here.

Can one actually define multiple OMP SIMD variants for a single function?
Comment 18 Jakub Jelinek 2020-09-22 09:58:18 UTC
(In reply to Florian Weimer from comment #16)
> (In reply to Jakub Jelinek from comment #15)
> > Unfortunately, I think glibc currently defines sincos in math-vector.h the
> > unuseful way with just simd ("notinbranch") attribute, which effectively
> > means
> > that the caller is supposed to pass vectors of pointers (well integers with
> > those sizes) and so needs scatter under the hood.
> > So the best thing would be to change glibc first.
> 
> What exactly do we need to change? Factor out the declaration of sincos, so
> that we can apply “linear(__p1, __p2)” to it?

It could provide
#pragma omp declare simd linear(__p1, __p2)
double sincos (double __p0, double *__p1, double *__p2);
prototypes and assembly definition for that function.
Note, right now the simd attribute syntax doesn't allow specifying linear arguments, there have been talks about extending the syntax, but it hasn't been finished yet.  Looking at sysdeps/x86_64/fpu/*sincos*.S I actually see the correct symbols there already, just they aren't exported from the library (the symbols with *l4l4*, *l8l8* etc. in there).
So I guess we first need to extend the simd attribute syntax, then change glibc to use it and export those, and finally benefit.

> Should glibc provide a cexpi function? The complex result would avoid having
> to go through memory at all.

Not sure if that would help, because vectorization of complex types doesn't really work well in GCC unless they are lowered to scalar ops before vectorization.
Comment 19 Jakub Jelinek 2020-09-22 09:58:46 UTC
(In reply to Richard Biener from comment #17)
> Can one actually define multiple OMP SIMD variants for a single function?

Yes.
Comment 20 康 珊 2020-09-22 10:15:47 UTC
Created attachment 49255 [details]
test source file
Comment 21 康 珊 2020-09-22 10:19:53 UTC
(In reply to Jakub Jelinek from comment #15)
> OpenMP has a way to express this,
> #pragma omp declare simd notinbranch linear(y,z)
> double sincos (double x, double *y, double *z);
> As can be seen on -O2 -fopenmp-simd:
> #pragma omp declare simd notinbranch linear(y, z)
> double foo (double x, double *y, double *z);
> 
> double a[1024], b[1024], c[1024];
> 
> void
> bar (void)
> {
>   #pragma omp simd
>   for (int i = 0; i < 1024; i++)
>     foo (a[i], &b[i], &c[i]);
> }
> it works fine with that.  But if #pragma omp simd is commented out, it is
> not, because the compiler has no assurance on what the function call
> behavior will be e.g. for data reference analysis (it could e.g. modify the
> global vars).
> Now, for sincos proper, the compiler knows what the function does if not
> -fno-builtin-sincos.
> 
> Unfortunately, I think glibc currently defines sincos in math-vector.h the
> unuseful way with just simd ("notinbranch") attribute, which effectively
> means
> that the caller is supposed to pass vectors of pointers (well integers with
> those sizes) and so needs scatter under the hood.
> So the best thing would be to change glibc first.

Thanks for your reply.(In reply to Jakub Jelinek from comment #15)
> OpenMP has a way to express this,
> #pragma omp declare simd notinbranch linear(y,z)
> double sincos (double x, double *y, double *z);
> As can be seen on -O2 -fopenmp-simd:
> #pragma omp declare simd notinbranch linear(y, z)
> double foo (double x, double *y, double *z);
> 
> double a[1024], b[1024], c[1024];
> 
> void
> bar (void)
> {
>   #pragma omp simd
>   for (int i = 0; i < 1024; i++)
>     foo (a[i], &b[i], &c[i]);
> }
> it works fine with that.  But if #pragma omp simd is commented out, it is
> not, because the compiler has no assurance on what the function call
> behavior will be e.g. for data reference analysis (it could e.g. modify the
> global vars).
> Now, for sincos proper, the compiler knows what the function does if not
> -fno-builtin-sincos.
> 
> Unfortunately, I think glibc currently defines sincos in math-vector.h the
> unuseful way with just simd ("notinbranch") attribute, which effectively
> means
> that the caller is supposed to pass vectors of pointers (well integers with
> those sizes) and so needs scatter under the hood.
> So the best thing would be to change glibc first.

Thanks for your reply. I tried your method. But it still doesn't work for sincos. 

#pragma omp declare simd notinbranch linear(y, z)
void sincos(double x, double *y, double *z);

for (unsigned int i = 0; i < loopCount; i++)
    {
#pragma omp simd
        for (unsigned int j = i * dim; j < (i + 1) * dim; j++)
        {
            sincos(input_array[j], &result_array[j], &result_array1[j]);
        }
    }

$ gcc -m64 -g -O2 -fopenmp-simd -fPIC -Wall test_sincos.c -o test_sincos -lm -march=skylake-avx512 -ffast-math -ftree-loop-vectorize
$ ldd test_sincos
        linux-vdso.so.1 (0x00007ffd2393f000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f574f49d000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f574f2a2000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f574f601000)
My gcc version is "10.1.1 20200507".
And I have uploaded my test source file.
Comment 22 Richard Biener 2020-09-22 10:43:07 UTC
(In reply to 康 珊 from comment #21)
> $ gcc -m64 -g -O2 -fopenmp-simd -fPIC -Wall test_sincos.c -o test_sincos -lm
> -march=skylake-avx512 -ffast-math -ftree-loop-vectorize

Try adding -fno-builtin-sincos here

> $ ldd test_sincos
>         linux-vdso.so.1 (0x00007ffd2393f000)
>         libm.so.6 => /lib64/libm.so.6 (0x00007f574f49d000)
>         libc.so.6 => /lib64/libc.so.6 (0x00007f574f2a2000)
>         /lib64/ld-linux-x86-64.so.2 (0x00007f574f601000)
> My gcc version is "10.1.1 20200507".
> And I have uploaded my test source file.
Comment 23 康 珊 2020-09-22 11:14:48 UTC
It seems that gcc will use "_ZGVeN8vl8l8_sincos" which doesn't exit in libmvec.
/usr/bin/ld: /tmp/ccTYyEZM.o: in function `main':
/home/pnp/ks/test_sincos.c:38: undefined reference to `_ZGVeN8vl8l8_sincos'
/usr/bin/ld: /home/pnp/ks/test_sincos.c:38: undefined reference to `_ZGVeN8vl8l8_sincos'
collect2: error: ld returned 1 exit status

I'm using clear linux the glibc of which is 2.31 and I also tried Ubuntu 19.10 the glibc of which is 2.30.
Comment 24 rguenther@suse.de 2020-09-22 11:17:09 UTC
On September 22, 2020 1:14:48 PM GMT+02:00, kangshan0910 at hotmail dot com <gcc-bugzilla@gcc.gnu.org> wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40770
>
>--- Comment #23 from 康 珊 <kangshan0910 at hotmail dot com> ---
>It seems that gcc will use "_ZGVeN8vl8l8_sincos" which doesn't exit in
>libmvec.
>/usr/bin/ld: /tmp/ccTYyEZM.o: in function `main':
>/home/pnp/ks/test_sincos.c:38: undefined reference to
>`_ZGVeN8vl8l8_sincos'
>/usr/bin/ld: /home/pnp/ks/test_sincos.c:38: undefined reference to
>`_ZGVeN8vl8l8_sincos'
>collect2: error: ld returned 1 exit status

That's expected and the thing that needs to be fixed. 

>I'm using clear linux the glibc of which is 2.31 and I also tried
>Ubuntu 19.10
>the glibc of which is 2.30.
Comment 25 康 珊 2020-09-22 11:31:00 UTC
(In reply to rguenther@suse.de from comment #24)
> On September 22, 2020 1:14:48 PM GMT+02:00, kangshan0910 at hotmail dot com
> <gcc-bugzilla@gcc.gnu.org> wrote:
> >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40770
> >
> >--- Comment #23 from 康 珊 <kangshan0910 at hotmail dot com> ---
> >It seems that gcc will use "_ZGVeN8vl8l8_sincos" which doesn't exit in
> >libmvec.
> >/usr/bin/ld: /tmp/ccTYyEZM.o: in function `main':
> >/home/pnp/ks/test_sincos.c:38: undefined reference to
> >`_ZGVeN8vl8l8_sincos'
> >/usr/bin/ld: /home/pnp/ks/test_sincos.c:38: undefined reference to
> >`_ZGVeN8vl8l8_sincos'
> >collect2: error: ld returned 1 exit status
> 
> That's expected and the thing that needs to be fixed. 
> 
> >I'm using clear linux the glibc of which is 2.31 and I also tried
> >Ubuntu 19.10
> >the glibc of which is 2.30.

But it has _ZGVeN8vvv_sincos(__m512d, __m512i, __m512i), why gcc not use this call?