When comparing the kind of code ICC outputs vs gcc, it's really obvious gcc could make a better use of x86 baroque addressing modes. More specifically i rarely ever see it using the *8 scale factor, even when addressing nicely power-of-2 sized stuff, and that's definitely a performance problem when dealing with those large SSE vectors. In that testcase the *8 scale factor is only used once and even if it's questionnable the use a of a fancier mode would help in this particular testcase, there's no doubt it would in Real World. Also, take note of the horrible code for massage48; in Real World it's even worse: 4012a6: mov $0x30,%edx 4012af: imul %edx,%eax That's not from the testcase, that's in a loop and edx get reloaded each time. Tested with today's cvs and something like -O3 -march=k8 -fomit-frame-pointer -mfpmath=sse and -O3 -march=pentium4 -fomit-frame-pointer -mfpmath=sse.
Created attachment 8095 [details] Various address generation snippets
*** This bug has been marked as a duplicate of 18463 ***
No, Steven, this is a different problem. Note that there are not two memory references, as in the other PR.
That said, what are you expecting here for massage48? On K8, the latency of imul for a 32-bit register operand is 3 cycles. Alternately, we can break this down into leal (%eax,%eax,2), %eax sall $4, $eax # Use eax unscaled Which is 2 cycles latency for the leal and one for the sall, so gcc rightly chooses to use the single multiply instruction, since the alternative is no cheaper. On pentium4, timings are different, and we select the leal+sall sequence. I don't see anything in this test case at all that could be made to use more complex addressing modes. As for the "Real World" code that uses a variable imul instead of a constant, that would be more interesting to examine.
I'm sorry for providing such a poor testcase. Here's the kind of *48 sequence i'm seeing, k8 codegen; that's happening at a point where's there's quite some register pressure and it really doesn't help that another register is needed (gcc has to push/pop whereas icc doesn't). 402520: mov (%esi),%ecx 402522: mov $0x30,%eax 402527: mov 0x3c(%esp),%ebx 40252b: add $0x4,%esi 40252e: imul %eax,%ecx 402531: add 0x94(%esp),%ecx 402538: mov (%ecx),%edi After that there's a long string of vector operations where gcc computes all addresses upfront (with shifts) and then use a *1 scale factor; ICC only use *8 for that and is faster.
Created attachment 8105 [details] the whole deal Ok, here's a self contained excerpt. There's a variable imul, simple addressing and some really weird stuff happening to the hit_t structure: it seems that gcc tries to cache it locally but then updates both the cached and original version. Take note that this intersect leaf function is part of a bigger function that searches for leaves etc; i mean in the real app there's even more registers being spilled etc. It's not pretty and called million times per seconds.
Ah hah. This is a bit of "cleverness" in the backend. It turns out that for K8, imul with an 8-bit immediate is vector decoded, and imul with a register is direct decoded. In theory, splitting out the constant allows greater throughput through the instruction decoder. Now, we *only* do this split after register allocation, and only if there are in fact a free register. So in no case should this be causing more register spills. As an aside, if we had proper control over the exact instruction form being used, as opposed to yielding that control to the assembler, we'd emit the imul with a 32-bit immediate (0x69 db 30 00 00 00), since that's direct decoded too. But emitting machine code directly from the compiler is pie-in-the-sky territory. Other than that, I guess the rest is just generic register-allocation sucking. Fancy attaching the assembly produced with icc for the second test case, so that we know what we're aiming for?
Created attachment 8107 [details] icc asm dump Here you are.
But i had to rewrite the hit_t structure in a way more closer to what's found in the original source to avoid the same useless cloning i noted earlier with gcc. Something like: union float4_t { float f[4]; __m128 v; operator const __m128() const { return v; } operator __m128() { return v; } }; union int4_t { int i[4]; __m128i v; operator const __m128i() const { return v; } operator __m128i() { return v; } }; struct hit_t { float4_t t,u,v; int4_t id; }; to avoid this: 40125f: andnps %xmm4,%xmm0 401262: orps 0x40(%esp),%xmm0 401267: movaps %xmm0,0x40(%esp) 40126c: movaps %xmm0,0x10(%eax) I really don't get why it's happening; and i'm pretty sure it wasn't there with some previous version of gcc. I'd seriously need a clue. That aside, ICC's ouput is much cleaner and despite the P4 way to compute addresses, faster on my K8.
Created attachment 8108 [details] Take #2 Hmm. I'm attaching another asm dump for ICC because it spits better code for the function as found in the original app. Got to love those temperamental compilers.
Ok, I see what Intel is doing. It's computing an index by 16 by doing addl %ecx,%ecx movl (%ebx, %ecx, 8), %eax instead of sall $4, %ecx movl (%ebx, %ecx), %eax which, considering the suckitude of the P4 shifter, is a win. It should not be a win for any other cpu.
Yes that's not a win per se but even with those "unrolled" addr computations its encodings end up generally tighter, ie: gcc: 40114d: c1 e1 04 shl $0x4,%ecx 401150: 8d 41 30 lea 0x30(%ecx),%eax ... 40115a: 0f 58 0c 07 addps (%edi,%eax,1),%xmm1 ... 40116f: 0f 58 04 0f addps (%edi,%ecx,1),%xmm0 icc: 4236e4: 03 ed add %ebp,%ebp 4236e6: 0f 28 64 ef 30 movaps 0x30(%edi,%ebp,8),%xmm4 4236eb: 0f 28 0c ef movaps (%edi,%ebp,8),%xmm1 Small win (and it's hard to follow as they schedule things very differently and gcc touches the stack a lot more), but could be even better if shifting was allowed. And in such a lenghty loop, decoding bandwith is scarce. If gcc wasn't so greedingly trying to precompute indexes and offsets... Could you tell me why gcc feels obliged to make a local copy of the hit_t structure on the stack and then update both the copy and the original? Ideally i'd like it to not try to outsmart me :) (or maybe i'm missing something obvious).
Ah! Seems that another temporary isn't eliminated, much like http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19274, this time with _mm_set1_epi32. 40129b: 89 44 24 1c mov %eax,0x1c(%esp) 40129f: 66 0f 6e 64 24 1c movd 0x1c(%esp),%xmm4 4012a5: 66 0f 70 cc 00 pshufd $0x0,%xmm4,%xmm1
If you're still looking at K8, moving through memory is two cycles faster than moving directly between the general register file and the sse register file.
Yes, and i'm not asking for a GPR->SSE transfer. What i'm asking is why gcc feels the urge to copy that memory reference to the stack before fooling around with it. The full sequence is: 401298: 8b 42 28 mov 0x28(%edx),%eax 40129b: 89 44 24 1c mov %eax,0x1c(%esp) 40129f: 66 0f 6e 64 24 1c movd 0x1c(%esp),%xmm4 4012a5: 66 0f 70 cc 00 pshufd $0x0,%xmm4,%xmm1 whereas ICC does: 40241d: 66 0f 6e 44 ca 28 movd 0x28(%edx,%ecx,8),%xmm0 40242c: 66 0f 70 c0 00 pshufd $0x0,%xmm0,%xmm0 I've opened http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19714 for this.
Created attachment 8113 [details] intersect_packet asm dump, icc
Created attachment 8114 [details] intersect_packet asm dump, gcc I've double checked what i was doing in the app and that testcase. In fact in the app i force gcc to not inline the intersect_packet function because then the generated code makes no sense at all: silly addr generation, local copy of one structure... you name it. The function takes 353 bytes for ICC and 387 bytes for gcc. If we factor out the _mm_set1_epi32 problem that's 387-12 = 375 bytes, or +22. Looking at SSE register movements (movaps), 10 for ICC and 18 for gcc, there's +8 for gcc (at 3 bytes a piece) or 24 bytes. So beside scheduling issues (i'd really like gcc to respect my load ordering, like ICC does, with a 48byte size there might be 2 cache lines addressed) the main difference seems to be related to register allocation. I wonder why it's so much worse when inlined, as that require in practice to workaround by adding a call on top of it.
Created attachment 8115 [details] mark builtins constant This patch marks builtins const or pure as appropriate, and all of them nothrow. This could result in slightly better code out of the tree optimizers. It could also result in higher register pressure as well.
Created attachment 8116 [details] hack sse simode inputs This patch lies a bit to the rtl optimizers saying that some of the sse insns take memory inputs when they don't. The goal here is to convince the rtl optimizers to feed memory inputs to these instructions directly, which removes a register temporary, and thus the ability of the register allocator to screw up and assign that temporary to the wrong register class.
I think you'll also want to try using -fno-gcse. The gcse pass is hoisting values out of your loop (as it is supposed to), except that we don't have enough registers to hold it all, so the values get spilled back to the stack. This is a known problem. If we had anything approaching a decent register allocator, we'd be able to move these values (known to be loop invariant) back inside the loop instead of spilling them. Of course, we've got multiple passes that hoist values out of loops, and using -fno-gcse only prevents half of the spilled values from being moved.
-fno-gcse is a godsend, instant speedup and most of the sillyness when inlining is gone. Now i've applied both your patches, and while there's promising they also triggers their own nastyness; gcc is so fond of memory inputs that it dumps stuff on the stack only to address them some instructions latter (well, that's my interpretation :). For example, 4010c3: 0f 28 6c 13 30 movaps 0x30(%ebx,%edx,1),%xmm5 4010c8: 0f 28 f9 movaps %xmm1,%xmm7 4010cb: 0f 28 cb movaps %xmm3,%xmm1 4010ce: 0f 29 6c 24 10 movaps %xmm5,0x10(%esp) 4010d3: 0f 59 ce mulps %xmm6,%xmm1 4010d6: 0f 59 c4 mulps %xmm4,%xmm0 4010d9: 0f 28 6c 16 30 movaps 0x30(%esi,%edx,1),%xmm5 4010de: 0f 59 5c 24 10 mulps 0x10(%esp),%xmm3 or 40119d: 0f c2 c1 01 cmpltps %xmm1,%xmm0 4011a1: 0f 29 04 24 movaps %xmm0,(%esp) 4011a5: 0f 28 c5 movaps %xmm5,%xmm0 4011a8: 0f c2 c1 01 cmpltps %xmm1,%xmm0 4011ac: 0f 28 c8 movaps %xmm0,%xmm1 4011af: 0f 56 0c 24 orps (%esp),%xmm1 Other than those quirks, it looks better to me. Just to be sure i've tried that patched version on my app, and it's slower than the unpatched version (both with -fno-gcse).
Hmm, there's something fishy with _mm_set1_epi32. With your patches there's no stack copy anymore but, with http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19714 testcase, i get: 00401080 <eliminated(int)>: 401080: 66 0f 6e 4c 24 04 movd 0x4(%esp),%xmm1 401086: 66 0f 70 c1 00 pshufd $0x0,%xmm1,%xmm0 40108b: c3 ret 00401090 <not_eliminated(int const*)>: 401090: 8b 44 24 04 mov 0x4(%esp),%eax 401094: 66 0f 6e 08 movd (%eax),%xmm1 401098: 66 0f 70 c1 00 pshufd $0x0,%xmm1,%xmm0 40109d: c3 ret 00401050 <not_eliminated_bis(int const&)>: 401050: 8b 44 24 04 mov 0x4(%esp),%eax 401054: 66 0f 6e 08 movd (%eax),%xmm1 401058: 66 0f 70 c1 00 pshufd $0x0,%xmm1,%xmm0 40105d: c3 ret ... and that's quite bogus :)
(In reply to comment #22) No, it isn't. Look at your functions again. The assembly that you pasted is 100% perfect. You cannot improve on that in any way.
(In reply to comment #21) > 4010ce: 0f 29 6c 24 10 movaps %xmm5,0x10(%esp) > 4010de: 0f 59 5c 24 10 mulps 0x10(%esp),%xmm3 > 4011a1: 0f 29 04 24 movaps %xmm0,(%esp) > 4011af: 0f 56 0c 24 orps (%esp),%xmm1 These instruction patterns didn't change. I only fibbed wrt "int" inputs. It's not impossible that some of this is due to adding the const to the builtin signatures, but... > Just to be sure i've tried that patched version on my app, and it's slower > than the unpatched version (both with -fno-gcse). So, overall the combination of the two patches is a lose? How about each patch individually?
Oops, my bad. Thought pshufd mixed both operands à la shufps; i'm obviously not familiar with the integer side of SSE. And yes the combination is a lose, albeit a small one around 3%. But i'm timing the whole thing not just that intersection code. To give you an idea here's the peak perf for a given tree. intersection inlined: ICC: 16.4 fps gcc: 14.7 unpatched, 14.4 patched gcc & -fno-gcse: 14.8 unpatched, 14.5 patched intersection not inlined: gcc: 14.9 unpatched, 13.7 patched gcc & -fno-gcse: 14.8 unpatched, 14.0 patched The problem is that the surrounding code (kdtree search) also change a lot and gcc doesn't find an optimal way for that either. What's reassuring is that ICC isn't much smarter than gcc on the tree search. Each compiler comes up with nice trick here and there but none got the whole picture right. Still ICC doesn't suck as much on average. I'm going to replace all that tree search with asm to settle the issue someday. BTW the difference between ICC and GCC is around ~5% on the scalar path. I'm going to try each patch in turn as requested.
Created attachment 8119 [details] modify modes_tieable_p for sse One more thing to try, to induce the register allocator to not be quite so stupid. At least within a single basic block anyway. But these short-lived registers seem to be the bulk of the complete moronitude to begin with. In particular I'd like to know if this patch makes d-19680-2 unnecessary.
In previous test i've used a crufted string of compilation options; i've removed all that crap for -O3 -march=k8 -mfpmath=sse -fno-gcse -fno-exceptions. The second patch, hack sse simode inputs, is a small win or at least a draw, inlined or not, -fno-gcse or not. The first one, mark builtins constant, is a lose in every conditions; it's the one that triggers those stack references/movements noted earlier (and my guess is what's causing the perf regression, other than that the code seems prettier).
Wow! We got a winner. 15.8 fps with -fno-gcse, inlining and only d-19680-3. 402680: 66 0f 6f d1 movdqa %xmm1,%xmm2 .. 402688: 66 0f db 50 30 pand 0x30(%eax),%xmm2 40268d: 66 0f 6e 41 28 movd 0x28(%ecx),%xmm0 402692: 66 0f 70 c0 00 pshufd $0x0,%xmm0,%xmm0 402697: 66 0f df c8 pandn %xmm0,%xmm1 40269b: 66 0f eb ca por %xmm2,%xmm1 40269f: 0f 29 48 30 movaps %xmm1,0x30(%eax) That's the final integer update. Perfect. Want me to try that champ in conjunction with d-19680-1?
d-19680-1 + d-19680-3 isn't as good, 14.9fps, as some silly stack movements are induced; ie: 40265f: 0f 29 04 24 movaps %xmm0,(%esp) 402663: 0f 57 c0 xorps %xmm0,%xmm0 402666: 0f c2 d0 01 cmpltps %xmm0,%xmm2 40266a: 0f 56 14 24 orps (%esp),%xmm2
Subject: Bug 19680 CVSROOT: /cvs/gcc Module name: gcc Changes by: rth@gcc.gnu.org 2005-02-02 00:30:38 Modified files: gcc : ChangeLog gcc/config/i386: i386-protos.h i386.c i386.h Log message: PR target/19680 * config/i386/i386.h (MODES_TIEABLE_P): Use ix86_modes_tieable_p. * config/i386/i386.c (ix86_hard_regno_mode_ok): Change return type to bool. (ix86_tieable_integer_mode_p, ix86_modes_tieable_p): New. * config/i386/i386-protos.h: Update. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.7369&r2=2.7370 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386-protos.h.diff?cvsroot=gcc&r1=1.130&r2=1.131 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386.c.diff?cvsroot=gcc&r1=1.794&r2=1.795 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386.h.diff?cvsroot=gcc&r1=1.420&r2=1.421
I'm going to ping this bugreport because there's still some very bad interaction with gcse in current gcc. Just compile the 'packet_intersection.cpp' testcase with ie g++-4.1-4120050501 for ia32 to convince yourself (x86-64 is apparently immune, thanks to the lower register pressure). It takes a bit of luck, or the use of -fno-gcse, to dodge that issue which really kills performance.