This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug libstdc++/85466] Performance is slow when doing 'branchless' conditional style math operations
- From: "jgreenhalgh at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Thu, 19 Apr 2018 13:02:31 +0000
- Subject: [Bug libstdc++/85466] Performance is slow when doing 'branchless' conditional style math operations
- Auto-submitted: auto-generated
- References: <bug-85466-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466
--- Comment #11 from James Greenhalgh <jgreenhalgh at gcc dot gnu.org> ---
With Jonathon's suggested change, copied in to the original poster's framework
(without -fno-trapping-math), Clang hot loop ( score: 165065
http://quick-bench.com/6NaD8ay0f8qMh9n0aMriYEiuKNA ) is:
0.16% movups 0x61a80(%r15,%rax,4),%xmm6
1.15% movups 0x61a90(%r15,%rax,4),%xmm7
0.60% movaps %xmm1,%xmm3
5.44% cmpltps %xmm6,%xmm3
0.44% movaps %xmm1,%xmm6
0.40% cmpltps %xmm7,%xmm6
0.44% movaps %xmm5,%xmm7
4.97% andps %xmm3,%xmm7
0.20% andnps %xmm4,%xmm3
0.36% orps %xmm7,%xmm3
1.04% movaps %xmm5,%xmm7
4.97% andps %xmm6,%xmm7
0.11% andnps %xmm4,%xmm6
4.95% orps %xmm7,%xmm6
5.53% movups %xmm3,0x61a80(%rbx,%rax,4)
0.47% movups %xmm6,0x61a90(%rbx,%rax,4)
4.42% movups 0x61aa0(%r15,%rax,4),%xmm3
20.42% movups 0x61ab0(%r15,%rax,4),%xmm6
1.00% movaps %xmm1,%xmm7
0.49% cmpltps %xmm3,%xmm7
9.79% movaps %xmm1,%xmm3
0.16% cmpltps %xmm6,%xmm3
2.26% movaps %xmm5,%xmm6
0.60% andps %xmm7,%xmm6
4.20% andnps %xmm4,%xmm7
1.18% orps %xmm6,%xmm7
2.22% movaps %xmm5,%xmm6
0.47% andps %xmm3,%xmm6
4.24% andnps %xmm4,%xmm3
4.88% movups %xmm7,0x61aa0(%rbx,%rax,4)
0.27% orps %xmm6,%xmm3
5.22% movups %xmm3,0x61ab0(%rbx,%rax,4)
6.02% add $0x10,%rax
jne 405b30 <ifStandard(benchmark::State&)+0x4a0>
GCC hot loop ( score: 2385754
http://quick-bench.com/ehLe-aqkpXkkx2sHLd6TWq_p4g4 ) is:
0.56% movss 0x0(%rbp,%rdx,1),%xmm0
1.47% xor %eax,%eax
2.00% subss %xmm2,%xmm0
7.02% ucomiss %xmm1,%xmm0
6.77% seta %al
4.96% xor %ecx,%ecx
0.25% ucomiss %xmm0,%xmm1
0.84% pxor %xmm0,%xmm0
0.09% seta %cl
5.40% sub %ecx,%eax
3.22% cvtsi2ss %eax,%xmm0
9.87% ucomiss %xmm0,%xmm1
6.53% ja 4053a8 <ifNoConditional(benchmark::State&)+0x1d8>
10.24% mulss %xmm4,%xmm0
11.55% addss %xmm3,%xmm0
5.46% movss %xmm0,(%rbx,%rdx,1)
2.00% add $0x4,%rdx
cmp $0x61a80,%rdx
jne 405350 <ifNoConditional(benchmark::State&)+0x180>
Daniel Elliott does that better match your expectations? If so, I think this
can be resolved as missed optimization of invalid code.