This is the mail archive of the
gcc-help@gcc.gnu.org
mailing list for the GCC project.
optimizing 128bit integer arithmetic on ia32
- From: Mathieu Lacage <mathieu dot lacage at gmail dot com>
- To: gcc-help <gcc-help at gcc dot gnu dot org>
- Date: Mon, 23 Aug 2010 18:37:52 +0200
- Subject: optimizing 128bit integer arithmetic on ia32
hi,
Here is the code I have:
struct uint64x64_t
{
struct {uint64_t hi; uint64_t lo;} _v;
};
inline uint64x64_t &operator += (uint64x64_t &lhs, const uint64x64_t &rhs)
{
lhs._v.hi += rhs._v.hi;
lhs._v.lo += rhs._v.lo;
if (lhs._v.hi < rhs._v.lo)
{
lhs._v.hi++;
}
return lhs;
}
One of the benchmarks I run is a loop to call += above for a couple
billion times with -O3:
- 64x64 class above: 5.5ns per addition
- 64bit addition: 1.3ns per addition
- 32bit addition 1.3ns per addition
A quick look at the assembly for the inner loop of my benchmark shows
this for the += code:
#define SIZE 128
uint64x64_t result = a;
for (long int i = 0; i < n; i++)
{
result += array[i%SIZE];
}
804b6c0: 89 f3 mov %esi,%ebx
804b6c2: 83 e3 7f and $0x7f,%ebx
804b6c5: c1 e3 04 shl $0x4,%ebx
804b6c8: 8b 44 0b 08 mov 0x8(%ebx,%ecx,1),%eax
804b6cc: 8b 54 0b 0c mov 0xc(%ebx,%ecx,1),%edx
804b6d0: 01 47 08 add %eax,0x8(%edi)
804b6d3: 8b 04 0b mov (%ebx,%ecx,1),%eax
804b6d6: 11 57 0c adc %edx,0xc(%edi)
804b6d9: 8b 54 0b 04 mov 0x4(%ebx,%ecx,1),%edx
804b6dd: 01 07 add %eax,(%edi)
804b6df: 11 57 04 adc %edx,0x4(%edi)
804b6e2: 8b 57 0c mov 0xc(%edi),%edx
804b6e5: 3b 54 0b 04 cmp 0x4(%ebx,%ecx,1),%edx
804b6e9: 8b 47 08 mov 0x8(%edi),%eax
804b6ec: 77 13 ja 804b701
<ns3::uint64x64_t run_add<ns3::uint64x64_t>(ns3::uint64x64_t,
ns3::uint64x64_t, long)+0xe1>
804b6ee: 72 05 jb 804b6f5
<ns3::uint64x64_t run_add<ns3::uint64x64_t>(ns3::uint64x64_t,
ns3::uint64x64_t, long)+0xd5>
804b6f0: 3b 04 0b cmp (%ebx,%ecx,1),%eax
804b6f3: 73 0c jae 804b701
<ns3::uint64x64_t run_add<ns3::uint64x64_t>(ns3::uint64x64_t,
ns3::uint64x64_t, long)+0xe1>
804b6f5: 83 c0 01 add $0x1,%eax
804b6f8: 83 d2 00 adc $0x0,%edx
804b6fb: 89 47 08 mov %eax,0x8(%edi)
804b6fe: 89 57 0c mov %edx,0xc(%edi)
804b701: 83 c6 01 add $0x1,%esi
804b704: 39 75 14 cmp %esi,0x14(%ebp)
804b707: 7f b7 jg 804b6c0
<ns3::uint64x64_t run_add<ns3::uint64x64_t>(ns3::uint64x64_t,
ns3::uint64x64_t, long)+0xa0>
That looks actually pretty good except for the fact that instead of
generating add+adc+adc+adc, we generate 3 distinct pairs of add+adc
(which is not suprising given the C++ code I wrote). So, the question
is: is there a way to improve upon the code generated to get the
'perfect' add+adc*3 sequence (short of adding support to gcc for
__uint128_t on ia32 or writing inline assembly) ?
Mathieu
--
Mathieu Lacage <mathieu.lacage@gmail.com>