Bug 31485 - C complex numbers, amd64 SSE, missed optimization opportunity
Summary: C complex numbers, amd64 SSE, missed optimization opportunity
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.1.2
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on: 35252
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2007-04-05 12:29 UTC by Joel Yliluoma
Modified: 2020-04-21 09:06 UTC (History)
10 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2008-08-02 12:21:42


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Joel Yliluoma 2007-04-05 12:29:39 UTC
Considering that "complex" turns basically any basic type into a vector type, complex number addition and subtraction could utilize SSE instructions to perform the operation on real and imaginary parts simultaneously. (Only applies to addition and subtraction.)

Code:

#include <complex.h>

typedef float complex ss1;
typedef float ss2 __attribute__((vector_size(sizeof(ss1))));

ss1 add1(ss1 a, ss1 b) { return a + b; }
ss2 add2(ss2 a, ss2 b) { return a + b; }

Produces:

add1:
        movq    %xmm0, -8(%rsp)
        movq    %xmm1, -16(%rsp)
        movss   -4(%rsp), %xmm0
        movss   -8(%rsp), %xmm1
        addss   -12(%rsp), %xmm0
        addss   -16(%rsp), %xmm1
        movss   %xmm0, -20(%rsp)
        movss   %xmm1, -24(%rsp)
        movq    -24(%rsp), %xmm0
        ret
add2:
        movlps  %xmm0, -16(%rsp)
        movlps  %xmm1, -24(%rsp)
        movaps  -24(%rsp), %xmm0
        addps   -16(%rsp), %xmm0
        movaps  %xmm0, -56(%rsp)
        movlps  -56(%rsp), %xmm0
        ret

Command line:
    gcc -msse  -O3 -S test2.c
    (Results are same with -ffast-math)
Architecture:
CPU=AMD Athlon(tm) 64 X2 Dual Core Processor 4600+
CPU features=fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy

GCC is:
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --program-suffix=-4.1 --enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr --enable-checking=release x86_64-linux-gnu
Thread model: posix
gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
Comment 1 Richard Biener 2007-04-09 18:47:51 UTC
Complex operations are lowered at the tree-level so this would require vectorizing
of straight line code.  Second, calling conventions are different.
Comment 2 victork 2008-07-29 22:06:25 UTC
Revision 138198 fixes loop aware SLP vectorization for addition of complex numbers. So if addition of is done inside a loop, there is a good chance now that it will be vectorized.
Comment 3 Richard Biener 2008-08-02 12:21:42 UTC
Operations in loops should now be vectorized.  The original testcase is
probably not worth vectorizing due to calling convention problems (_Complex T
is not passed as a vector).

Complex lowering could generate vectorized code directly though for operations
not in a loop.
Comment 4 Uroš Bizjak 2008-08-02 13:00:15 UTC
(In reply to comment #3)
> Operations in loops should now be vectorized.  The original testcase is
> probably not worth vectorizing due to calling convention problems (_Complex T
> is not passed as a vector).

Not really. For some unknown reason, _Complex float is passed as a two element vector in SSE register. This introduces (double!) store forwarding penalty, since we have to split the value into SSE pair before processing. This is wrong ABI design, as shown by comparing generated code from following example:

--cut here--
_Complex float testf (_Complex float a, _Complex float b)
{
  return a + b;
}

_Complex double testd (_Complex double a, _Complex double b)
{
  return a + b;
}
--cut here--

testf:
	movq	%xmm0, -8(%rsp)
	movq	%xmm1, -16(%rsp)
	movss	-8(%rsp), %xmm0
	movss	-4(%rsp), %xmm2
	addss	-16(%rsp), %xmm0
	addss	-12(%rsp), %xmm2
	movss	%xmm0, -24(%rsp)
	movss	%xmm2, -20(%rsp)
	movq	-24(%rsp), %xmm0
	ret

testd:
	addsd	%xmm3, %xmm1
	addsd	%xmm2, %xmm0
	ret
Comment 5 Richard Biener 2008-08-02 13:18:37 UTC
Doh, this is indeed completely broken ;)  I'll experiment with lowering
complex operations to vectorized form a bit.
Comment 6 Daniel Davis 2010-04-17 00:28:02 UTC
Has any work been done on this enhancement?  I'm using gcc 4.3.2, and I noticed that there is still limited use of SSE instructions for complex arithmetic.  

Unless I'm missing something in my understanding, wouldn't the ideal for all _Complex double additions with SSE2 be to use addpd, and movapd or movupd for memory operations?

Comment 7 Richard Biener 2010-04-17 11:11:02 UTC
We now have basic-block vectorization but it still works on memory accesses
(visible on the gimple level) only.  So it doesn't handle

add1 (ss1 a, ss1 b)
{
  float D.3164;
  float D.3163;
  float b$imag;
  float b$real;
  float a$imag;
  float a$real;
  ss1 D.3154;

<bb 2>:
  a$real_4 = REALPART_EXPR <a_1(D)>;
  a$imag_5 = IMAGPART_EXPR <a_1(D)>;
  b$real_6 = REALPART_EXPR <b_2(D)>;
  b$imag_7 = IMAGPART_EXPR <b_2(D)>;
  D.3163_8 = a$real_4 + b$real_6;
  D.3164_9 = a$imag_5 + b$imag_7;
  D.3154_3 = COMPLEX_EXPR <D.3163_8, D.3164_9>;
  return D.3154_3;

}

though maybe it could be teached to see REAL/IMAG_PART exprs as loads
and COMPLEX_EXPR as store.  Ira?
Comment 8 Ira Rosen 2010-04-21 11:33:46 UTC
Yes, it's possible to add this to SLP. But I don't understand how 
D.3154_3 = COMPLEX_EXPR <D.3163_8, D.3164_9>;
should be vectorized. D.3154_3 is complex and the rhs will be a vector {D.3163_8, D.3164_9} (btw, we have to change float to double, otherwise, we don't have complete vectors and this is not supported).
Comment 9 rguenther@suse.de 2010-04-21 11:44:20 UTC
Subject: Re:  C complex numbers, amd64 SSE, missed
 optimization opportunity

On Wed, 21 Apr 2010, irar at il dot ibm dot com wrote:

> ------- Comment #8 from irar at il dot ibm dot com  2010-04-21 11:33 -------
> Yes, it's possible to add this to SLP. But I don't understand how 
> D.3154_3 = COMPLEX_EXPR <D.3163_8, D.3164_9>;
> should be vectorized. D.3154_3 is complex and the rhs will be a vector
> {D.3163_8, D.3164_9} (btw, we have to change float to double, otherwise, we
> don't have complete vectors and this is not supported).

Dependent on how D.3154_3 is used afterwards it will be much like
an interleaved/strided store (if {D.3163_8, D.3164_9} is in xmm2 and the
complex is in the lower halves of the register pair xmm0 and xmm1
we'd emit vec_extracts).  On the tree level we can probably
represent this as

 D.3154_3 = VIEW_CONVERT_EXPR <compex_double> (vec_temp_4);

where vec_temp_4 is the {D.3163_8, D.3164_9} vector.
Or similar, but with present known-to-work trees

 realpart = BIT_FIELD_REF <0, ..> (vec_tmp_4);
 imagpart = BIT_FIELD_REF <64, ..> (vec_tmp_4);
 D.3154_3 = COMPLEX_EXPR <realpart, imagpart>;

One could also see the COMPLEX_EXPR as a root for SLP induction
vectorization (I suppose we don't do SLP induction at the moment,
induction in the sense that we pick arbitrary scalars and combine
them into vectors).

Richard.
Comment 10 Ira Rosen 2010-04-21 18:33:33 UTC
Thanks. So, it is not always profitable and requires a cost model. 
I am now working on cost model for basic block vectorization, I can look at this once we have one.
Comment 11 Joel Yliluoma 2020-04-21 06:43:36 UTC
Looks like this issue has taken a step or two *backwards* in the past years.

Where as the second function used to be vectorized properly, today it seems neither of them are.

Contrast this with Clang, which compiles *both* functions into a single instruction:

  vaddps xmm0, xmm1, xmm0

or some variant thereof depending on the -m options.

Compiler Explorer link: https://godbolt.org/z/2AKhnt
Comment 12 Richard Biener 2020-04-21 07:07:18 UTC
(In reply to Joel Yliluoma from comment #11)
> Looks like this issue has taken a step or two *backwards* in the past years.
> 
> Where as the second function used to be vectorized properly, today it seems
> neither of them are.

Which version do you see vectorizing the second (add2) function?

> Contrast this with Clang, which compiles *both* functions into a single
> instruction:
> 
>   vaddps xmm0, xmm1, xmm0
> 
> or some variant thereof depending on the -m options.
> 
> Compiler Explorer link: https://godbolt.org/z/2AKhnt

The main issues on the GCC side are
  a) ABI details not exposed at the point of vectorization (several PRs about
     this exist)
  b) "Poor" support for two-element float vectors (an understatement, we have
     some support for MMX but that's integer only, but I'm not sure we've
     enabled the 3dnow part to be emulated with SSE)

oddly enough even with -mmmx -m3dnow I see add2 lowered by veclower so
the vector type or the vector add must be unsupported(?).

llvm is known to support emulating smaller vectors just fine (and by
design is also aware of ABI details).
Comment 13 Joel Yliluoma 2020-04-21 07:17:50 UTC
GCC 4.1.2 is indicated in the bug report headers.
Luckily, Compiler Explorer has a copy of that exact version, and it indeed vectorizes the second function: https://godbolt.org/z/DC_SSb

On my own system, the earliest I have is 4.6. The Compiler Explorer has 4.4, and it, or anything newer than that, no longer vectorizes either function.
Comment 14 Richard Biener 2020-04-21 07:37:56 UTC
(In reply to Joel Yliluoma from comment #13)
> GCC 4.1.2 is indicated in the bug report headers.
> Luckily, Compiler Explorer has a copy of that exact version, and it indeed
> vectorizes the second function: https://godbolt.org/z/DC_SSb
> 
> On my own system, the earliest I have is 4.6. The Compiler Explorer has 4.4,
> and it, or anything newer than that, no longer vectorizes either function.

Ah, OK - that's before GCC learned vectorization and is code-generated by
RTL expanding

  return {BIT_FIELD_REF <a, 128, 0> + BIT_FIELD_REF <b, 128, 0>};

so the only vector support was GCCs generic vectors (and intrinsics).  The
generated code is far from perfect though.  I also think llvms code
generation is bogus since it appears the ABI does not guarantee zeroed
upper elements of the xmm0 argument which means they could contain sNaNs:

typedef float ss2 __attribute__((vector_size(8)));
typedef float ss4 __attribute__((vector_size(16)));
ss2 add2(ss2 a, ss2 b);
void bar(ss4 a)
{
  volatile ss2 x;
  x = add2 ((ss2){a[0], a[1]}, (ss2){a[0], a[1]});
}

produces

bar:
.LFB1:  
        .cfi_startproc
        subq    $56, %rsp
        .cfi_def_cfa_offset 64
        movdqa  %xmm0, %xmm1
        call    add2
        movq    %xmm0, 24(%rsp)
        addq    $56, %rsp

which means we pass through 'a' unchanged.
Comment 15 Joel Yliluoma 2020-04-21 08:18:17 UTC
(In reply to Richard Biener from comment #14)
> I also think llvms code generation is bogus since it appears the ABI
> does not guarantee zeroed upper elements of the xmm0 argument
> which means they could contain sNaNs:

Why would it matter that the unused portions of the register contain NaNs?
Comment 16 Jakub Jelinek 2020-04-21 08:23:21 UTC
(In reply to Joel Yliluoma from comment #15)
> (In reply to Richard Biener from comment #14)
> > I also think llvms code generation is bogus since it appears the ABI
> > does not guarantee zeroed upper elements of the xmm0 argument
> > which means they could contain sNaNs:
> 
> Why would it matter that the unused portions of the register contain NaNs?

Because it could then raise exceptions that shouldn't be raised?
Comment 17 rguenther@suse.de 2020-04-21 08:29:01 UTC
On Tue, 21 Apr 2020, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485
> 
> Jakub Jelinek <jakub at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |jakub at gcc dot gnu.org
> 
> --- Comment #16 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> (In reply to Joel Yliluoma from comment #15)
> > (In reply to Richard Biener from comment #14)
> > > I also think llvms code generation is bogus since it appears the ABI
> > > does not guarantee zeroed upper elements of the xmm0 argument
> > > which means they could contain sNaNs:
> > 
> > Why would it matter that the unused portions of the register contain NaNs?
> 
> Because it could then raise exceptions that shouldn't be raised?

Note it might be llvm actually zeros the upper half at the caller
(in disagreement with GCC).  Maybe also the psABI specifies that
should happen and GCC is wrong.  Just at the moment interoperating
GCC and LLVM is prone to the above mentioned issue.
Comment 18 Jakub Jelinek 2020-04-21 08:32:52 UTC
Note, we could do movq %xmm0, %xmm0; movq %xmm1, %xmm1; addpd %xmm1, %xmm0 for the #c4 first function.
Comment 19 Jakub Jelinek 2020-04-21 08:33:52 UTC
CCing Micha and Honza on the ABI question.
Comment 20 Joel Yliluoma 2020-04-21 08:34:42 UTC
(In reply to Jakub Jelinek from comment #16)
> (In reply to Joel Yliluoma from comment #15)
> > (In reply to Richard Biener from comment #14)
> > > I also think llvms code generation is bogus since it appears the ABI
> > > does not guarantee zeroed upper elements of the xmm0 argument
> > > which means they could contain sNaNs:
> > 
> > Why would it matter that the unused portions of the register contain NaNs?
> 
> Because it could then raise exceptions that shouldn't be raised?

Which exceptions would be generated by data in an unused portion of a register? Does for example “addps” generate an exception if one or two of the operands contains NaNs? Which instructions would generate exceptions?

I can only think of divps, when dividing by a zero, but it does not seem that even LLVM compiles the two-element vector division into divps.

If the register is passed as a parameter to a library function, they would not make judgments based on the values of the unused portions of the registers.
Comment 21 Jakub Jelinek 2020-04-21 08:43:06 UTC
(In reply to Joel Yliluoma from comment #20)
> Which exceptions would be generated by data in an unused portion of a
> register?

addps adds 4 float elements, there is no "unused" portion.
If some of the elements contain garbage, it can trigger for e.g. the addition
FE_INVALID, FE_OVERFLOW, FE_UNDERFLOW or FE_INEXACT (FE_DIVBYZERO obviously isn't relevant to addition).
Please read the standard about floating point exceptions, fenv.h etc.
Comment 22 rguenther@suse.de 2020-04-21 08:47:58 UTC
On Tue, 21 Apr 2020, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485
> 
> Jakub Jelinek <jakub at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |hjl.tools at gmail dot com,
>                    |                            |hubicka at gcc dot gnu.org,
>                    |                            |matz at gcc dot gnu.org
> 
> --- Comment #19 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> CCing Micha and Honza on the ABI question.

The arguments are class SSE (__m64), but I fail to find clarification
as to whether "unused" parts of argument registers (the SSEUP part
of the %xmmN register) is supposed to be zeroed or has unspecified
contents.
Comment 23 Joel Yliluoma 2020-04-21 08:51:35 UTC
(In reply to Jakub Jelinek from comment #21)
> (In reply to Joel Yliluoma from comment #20)
> > Which exceptions would be generated by data in an unused portion of a
> > register?
> 
> addps adds 4 float elements, there is no "unused" portion.
> If some of the elements contain garbage, it can trigger for e.g. the addition
> FE_INVALID, FE_OVERFLOW, FE_UNDERFLOW or FE_INEXACT (FE_DIVBYZERO obviously
> isn't relevant to addition).
> Please read the standard about floating point exceptions, fenv.h etc.

There is “unused” portion, for the purposes of the data use. Same as with padding in structs; the memory is unused because no part in program relies on its contents, even though the CPU may load those portions in registers when e.g. moving and copying the struct. The CPU won’t know whether it’s used or not.

You mention FE_INVALID etc., but those are concepts within the C standard library, not in the hardware. The C standard library will not make judgments on the upper portions of the register. So if you have two float[2]s, and you add them together into another float[2], and the compiler uses addps to achieve this task, what is the mechanism that would supposedly generate an exception, when no part in the software depends and makes judgments on the irrelevant parts of the register?
Comment 24 Jakub Jelinek 2020-04-21 08:58:07 UTC
Bugzilla is not the right place to educate users.  Of course the C FE_* exceptions map to real hardware exceptions, on x86 read e.g. about MXCSR register and in the description of each instruction on which Exceptions it can raise.
Comment 25 Joel Yliluoma 2020-04-21 09:06:49 UTC
(In reply to Jakub Jelinek from comment #24)
> on x86 read e.g. about MXCSR register and in the description of each
> instruction on which Exceptions it can raise.

So the quick answer to #15 is that addps instruction may raise exceptions. Ok, thanks for clearing that up. My bad. So it seems that LLVM relies on the assumption that the upper portions of the register are zeroed, and this is what you said in the first place.