Considering that "complex" turns basically any basic type into a vector type, complex number addition and subtraction could utilize SSE instructions to perform the operation on real and imaginary parts simultaneously. (Only applies to addition and subtraction.) Code: #include <complex.h> typedef float complex ss1; typedef float ss2 __attribute__((vector_size(sizeof(ss1)))); ss1 add1(ss1 a, ss1 b) { return a + b; } ss2 add2(ss2 a, ss2 b) { return a + b; } Produces: add1: movq %xmm0, -8(%rsp) movq %xmm1, -16(%rsp) movss -4(%rsp), %xmm0 movss -8(%rsp), %xmm1 addss -12(%rsp), %xmm0 addss -16(%rsp), %xmm1 movss %xmm0, -20(%rsp) movss %xmm1, -24(%rsp) movq -24(%rsp), %xmm0 ret add2: movlps %xmm0, -16(%rsp) movlps %xmm1, -24(%rsp) movaps -24(%rsp), %xmm0 addps -16(%rsp), %xmm0 movaps %xmm0, -56(%rsp) movlps -56(%rsp), %xmm0 ret Command line: gcc -msse -O3 -S test2.c (Results are same with -ffast-math) Architecture: CPU=AMD Athlon(tm) 64 X2 Dual Core Processor 4600+ CPU features=fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy GCC is: Target: x86_64-linux-gnu Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --program-suffix=-4.1 --enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr --enable-checking=release x86_64-linux-gnu Thread model: posix gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
Complex operations are lowered at the tree-level so this would require vectorizing of straight line code. Second, calling conventions are different.
Revision 138198 fixes loop aware SLP vectorization for addition of complex numbers. So if addition of is done inside a loop, there is a good chance now that it will be vectorized.
Operations in loops should now be vectorized. The original testcase is probably not worth vectorizing due to calling convention problems (_Complex T is not passed as a vector). Complex lowering could generate vectorized code directly though for operations not in a loop.
(In reply to comment #3) > Operations in loops should now be vectorized. The original testcase is > probably not worth vectorizing due to calling convention problems (_Complex T > is not passed as a vector). Not really. For some unknown reason, _Complex float is passed as a two element vector in SSE register. This introduces (double!) store forwarding penalty, since we have to split the value into SSE pair before processing. This is wrong ABI design, as shown by comparing generated code from following example: --cut here-- _Complex float testf (_Complex float a, _Complex float b) { return a + b; } _Complex double testd (_Complex double a, _Complex double b) { return a + b; } --cut here-- testf: movq %xmm0, -8(%rsp) movq %xmm1, -16(%rsp) movss -8(%rsp), %xmm0 movss -4(%rsp), %xmm2 addss -16(%rsp), %xmm0 addss -12(%rsp), %xmm2 movss %xmm0, -24(%rsp) movss %xmm2, -20(%rsp) movq -24(%rsp), %xmm0 ret testd: addsd %xmm3, %xmm1 addsd %xmm2, %xmm0 ret
Doh, this is indeed completely broken ;) I'll experiment with lowering complex operations to vectorized form a bit.
Has any work been done on this enhancement? I'm using gcc 4.3.2, and I noticed that there is still limited use of SSE instructions for complex arithmetic. Unless I'm missing something in my understanding, wouldn't the ideal for all _Complex double additions with SSE2 be to use addpd, and movapd or movupd for memory operations?
We now have basic-block vectorization but it still works on memory accesses (visible on the gimple level) only. So it doesn't handle add1 (ss1 a, ss1 b) { float D.3164; float D.3163; float b$imag; float b$real; float a$imag; float a$real; ss1 D.3154; <bb 2>: a$real_4 = REALPART_EXPR <a_1(D)>; a$imag_5 = IMAGPART_EXPR <a_1(D)>; b$real_6 = REALPART_EXPR <b_2(D)>; b$imag_7 = IMAGPART_EXPR <b_2(D)>; D.3163_8 = a$real_4 + b$real_6; D.3164_9 = a$imag_5 + b$imag_7; D.3154_3 = COMPLEX_EXPR <D.3163_8, D.3164_9>; return D.3154_3; } though maybe it could be teached to see REAL/IMAG_PART exprs as loads and COMPLEX_EXPR as store. Ira?
Yes, it's possible to add this to SLP. But I don't understand how D.3154_3 = COMPLEX_EXPR <D.3163_8, D.3164_9>; should be vectorized. D.3154_3 is complex and the rhs will be a vector {D.3163_8, D.3164_9} (btw, we have to change float to double, otherwise, we don't have complete vectors and this is not supported).
Subject: Re: C complex numbers, amd64 SSE, missed optimization opportunity On Wed, 21 Apr 2010, irar at il dot ibm dot com wrote: > ------- Comment #8 from irar at il dot ibm dot com 2010-04-21 11:33 ------- > Yes, it's possible to add this to SLP. But I don't understand how > D.3154_3 = COMPLEX_EXPR <D.3163_8, D.3164_9>; > should be vectorized. D.3154_3 is complex and the rhs will be a vector > {D.3163_8, D.3164_9} (btw, we have to change float to double, otherwise, we > don't have complete vectors and this is not supported). Dependent on how D.3154_3 is used afterwards it will be much like an interleaved/strided store (if {D.3163_8, D.3164_9} is in xmm2 and the complex is in the lower halves of the register pair xmm0 and xmm1 we'd emit vec_extracts). On the tree level we can probably represent this as D.3154_3 = VIEW_CONVERT_EXPR <compex_double> (vec_temp_4); where vec_temp_4 is the {D.3163_8, D.3164_9} vector. Or similar, but with present known-to-work trees realpart = BIT_FIELD_REF <0, ..> (vec_tmp_4); imagpart = BIT_FIELD_REF <64, ..> (vec_tmp_4); D.3154_3 = COMPLEX_EXPR <realpart, imagpart>; One could also see the COMPLEX_EXPR as a root for SLP induction vectorization (I suppose we don't do SLP induction at the moment, induction in the sense that we pick arbitrary scalars and combine them into vectors). Richard.
Thanks. So, it is not always profitable and requires a cost model. I am now working on cost model for basic block vectorization, I can look at this once we have one.
Looks like this issue has taken a step or two *backwards* in the past years. Where as the second function used to be vectorized properly, today it seems neither of them are. Contrast this with Clang, which compiles *both* functions into a single instruction: vaddps xmm0, xmm1, xmm0 or some variant thereof depending on the -m options. Compiler Explorer link: https://godbolt.org/z/2AKhnt
(In reply to Joel Yliluoma from comment #11) > Looks like this issue has taken a step or two *backwards* in the past years. > > Where as the second function used to be vectorized properly, today it seems > neither of them are. Which version do you see vectorizing the second (add2) function? > Contrast this with Clang, which compiles *both* functions into a single > instruction: > > vaddps xmm0, xmm1, xmm0 > > or some variant thereof depending on the -m options. > > Compiler Explorer link: https://godbolt.org/z/2AKhnt The main issues on the GCC side are a) ABI details not exposed at the point of vectorization (several PRs about this exist) b) "Poor" support for two-element float vectors (an understatement, we have some support for MMX but that's integer only, but I'm not sure we've enabled the 3dnow part to be emulated with SSE) oddly enough even with -mmmx -m3dnow I see add2 lowered by veclower so the vector type or the vector add must be unsupported(?). llvm is known to support emulating smaller vectors just fine (and by design is also aware of ABI details).
GCC 4.1.2 is indicated in the bug report headers. Luckily, Compiler Explorer has a copy of that exact version, and it indeed vectorizes the second function: https://godbolt.org/z/DC_SSb On my own system, the earliest I have is 4.6. The Compiler Explorer has 4.4, and it, or anything newer than that, no longer vectorizes either function.
(In reply to Joel Yliluoma from comment #13) > GCC 4.1.2 is indicated in the bug report headers. > Luckily, Compiler Explorer has a copy of that exact version, and it indeed > vectorizes the second function: https://godbolt.org/z/DC_SSb > > On my own system, the earliest I have is 4.6. The Compiler Explorer has 4.4, > and it, or anything newer than that, no longer vectorizes either function. Ah, OK - that's before GCC learned vectorization and is code-generated by RTL expanding return {BIT_FIELD_REF <a, 128, 0> + BIT_FIELD_REF <b, 128, 0>}; so the only vector support was GCCs generic vectors (and intrinsics). The generated code is far from perfect though. I also think llvms code generation is bogus since it appears the ABI does not guarantee zeroed upper elements of the xmm0 argument which means they could contain sNaNs: typedef float ss2 __attribute__((vector_size(8))); typedef float ss4 __attribute__((vector_size(16))); ss2 add2(ss2 a, ss2 b); void bar(ss4 a) { volatile ss2 x; x = add2 ((ss2){a[0], a[1]}, (ss2){a[0], a[1]}); } produces bar: .LFB1: .cfi_startproc subq $56, %rsp .cfi_def_cfa_offset 64 movdqa %xmm0, %xmm1 call add2 movq %xmm0, 24(%rsp) addq $56, %rsp which means we pass through 'a' unchanged.
(In reply to Richard Biener from comment #14) > I also think llvms code generation is bogus since it appears the ABI > does not guarantee zeroed upper elements of the xmm0 argument > which means they could contain sNaNs: Why would it matter that the unused portions of the register contain NaNs?
(In reply to Joel Yliluoma from comment #15) > (In reply to Richard Biener from comment #14) > > I also think llvms code generation is bogus since it appears the ABI > > does not guarantee zeroed upper elements of the xmm0 argument > > which means they could contain sNaNs: > > Why would it matter that the unused portions of the register contain NaNs? Because it could then raise exceptions that shouldn't be raised?
On Tue, 21 Apr 2020, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485 > > Jakub Jelinek <jakub at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |jakub at gcc dot gnu.org > > --- Comment #16 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > (In reply to Joel Yliluoma from comment #15) > > (In reply to Richard Biener from comment #14) > > > I also think llvms code generation is bogus since it appears the ABI > > > does not guarantee zeroed upper elements of the xmm0 argument > > > which means they could contain sNaNs: > > > > Why would it matter that the unused portions of the register contain NaNs? > > Because it could then raise exceptions that shouldn't be raised? Note it might be llvm actually zeros the upper half at the caller (in disagreement with GCC). Maybe also the psABI specifies that should happen and GCC is wrong. Just at the moment interoperating GCC and LLVM is prone to the above mentioned issue.
Note, we could do movq %xmm0, %xmm0; movq %xmm1, %xmm1; addpd %xmm1, %xmm0 for the #c4 first function.
CCing Micha and Honza on the ABI question.
(In reply to Jakub Jelinek from comment #16) > (In reply to Joel Yliluoma from comment #15) > > (In reply to Richard Biener from comment #14) > > > I also think llvms code generation is bogus since it appears the ABI > > > does not guarantee zeroed upper elements of the xmm0 argument > > > which means they could contain sNaNs: > > > > Why would it matter that the unused portions of the register contain NaNs? > > Because it could then raise exceptions that shouldn't be raised? Which exceptions would be generated by data in an unused portion of a register? Does for example “addps” generate an exception if one or two of the operands contains NaNs? Which instructions would generate exceptions? I can only think of divps, when dividing by a zero, but it does not seem that even LLVM compiles the two-element vector division into divps. If the register is passed as a parameter to a library function, they would not make judgments based on the values of the unused portions of the registers.
(In reply to Joel Yliluoma from comment #20) > Which exceptions would be generated by data in an unused portion of a > register? addps adds 4 float elements, there is no "unused" portion. If some of the elements contain garbage, it can trigger for e.g. the addition FE_INVALID, FE_OVERFLOW, FE_UNDERFLOW or FE_INEXACT (FE_DIVBYZERO obviously isn't relevant to addition). Please read the standard about floating point exceptions, fenv.h etc.
On Tue, 21 Apr 2020, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485 > > Jakub Jelinek <jakub at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |hjl.tools at gmail dot com, > | |hubicka at gcc dot gnu.org, > | |matz at gcc dot gnu.org > > --- Comment #19 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > CCing Micha and Honza on the ABI question. The arguments are class SSE (__m64), but I fail to find clarification as to whether "unused" parts of argument registers (the SSEUP part of the %xmmN register) is supposed to be zeroed or has unspecified contents.
(In reply to Jakub Jelinek from comment #21) > (In reply to Joel Yliluoma from comment #20) > > Which exceptions would be generated by data in an unused portion of a > > register? > > addps adds 4 float elements, there is no "unused" portion. > If some of the elements contain garbage, it can trigger for e.g. the addition > FE_INVALID, FE_OVERFLOW, FE_UNDERFLOW or FE_INEXACT (FE_DIVBYZERO obviously > isn't relevant to addition). > Please read the standard about floating point exceptions, fenv.h etc. There is “unused” portion, for the purposes of the data use. Same as with padding in structs; the memory is unused because no part in program relies on its contents, even though the CPU may load those portions in registers when e.g. moving and copying the struct. The CPU won’t know whether it’s used or not. You mention FE_INVALID etc., but those are concepts within the C standard library, not in the hardware. The C standard library will not make judgments on the upper portions of the register. So if you have two float[2]s, and you add them together into another float[2], and the compiler uses addps to achieve this task, what is the mechanism that would supposedly generate an exception, when no part in the software depends and makes judgments on the irrelevant parts of the register?
Bugzilla is not the right place to educate users. Of course the C FE_* exceptions map to real hardware exceptions, on x86 read e.g. about MXCSR register and in the description of each instruction on which Exceptions it can raise.
(In reply to Jakub Jelinek from comment #24) > on x86 read e.g. about MXCSR register and in the description of each > instruction on which Exceptions it can raise. So the quick answer to #15 is that addps instruction may raise exceptions. Ok, thanks for clearing that up. My bad. So it seems that LLVM relies on the assumption that the upper portions of the register are zeroed, and this is what you said in the first place.
Note there might be a dup of this bug somewhere too.
And there are two issues here, one is related to SLP not happening and the other deals with the argument and return value passing.
*** Bug 104406 has been marked as a duplicate of this bug. ***
https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630011.html