Why is the performance of 32bit program worse than 64bit program running on the same 64bit system, They are compiled from same source. Which gcc option can fix it?
Vincent Diepeveen
diep@xs4all.nl
Wed Mar 26 14:47:00 GMT 2014
On Wed, 26 Mar 2014, Florian Weimer wrote:
> On 03/25/2014 04:51 PM, Vincent Diepeveen wrote:
>
>> a) for example if you use signed 32 bits indexation, for example
>>
>> int i, array[64];
>>
>> i = ...;
>> x = array[i];
>>
>> this goes very fast in 32 bits processor and 32 bits mode yet a lot
>> slower in 64 bits mode, as i needs a sign extension to 64 bits.
>> So the compiler generates 1 additional instruction in 64 bits mode
>> to sign extend i from 32 bits to 64 bits.
>
> Is this relevant in practice? I'm asking because it's a missed optimization
> opportunity—negative subscripts lead to undefined behavior here, so the sign
> extension can be omitted.
Yes this is very relevant of course, as it is an instruction.
It all adds up you know. Now i don't know whether some modern processors
can secretly internal fuse this - as about 99.9% of all C and C++ source
codes in existance just use 'int' of course.
In the C specification in fact 'int' gets defined as the fastest possible
datatype.
Well at x64 it is not. It's a lot slower if you use to index it. Factor 2
slower to be precise, if you use it to index, as it generates another
instruction.
If i write normal code, i simply use "int" and standardize upon that.
Writing for speed has not been made easier, because "int" still is a 32
bits datatype whereas we have 64 bits processors nowadays.
Problem would be solved when 'sizeof(int)' suddenly is 8 bytes of course.
That would mean big refactoring of lots of codes though, yet one day we
will need to go through that proces :)
I tend to remember that back in the days, sizeof(long) at DEC alpha was 8
bytes already.
Now i'm not suggesting, not even indicating, this would be a wise change.
>> b) some processors can 'issue' more 32 bits instructions a clock than 64
>> bits instructions.
>
> Some earlier processors also support more µop optimization in 32 bit mode.
I'm not a big expert on how the decoding and transport phase of processors
nowadays works - it all has become so very complex.
Yet the decoding and delivery of the instructions is the bottleneck at
todays processors. They all have plenty of execution units.
They just cannot decode+deliver enough bytes per clock.
>> My chessprogram Diep which is deterministic integer code (so no vector
>> codes) compiled 32 bits versus 64 bits is about 10%-12% slower in 64
>> bits than in 32 bits. This where it does use a few 64 bits datatypes
>> (very little though). In 64 bits the datasize used doesn't grow,
>> instruction wise it grows immense of course.
>
> Well, chess programs used to be the prototypical example for 64 bit
> architectures ...
Only when a bunch of CIA related organisations got involved in funding a
bunch of programs - as it's easier then to copy source code if you
write it for a sneaky organisation anyway.
The from origin top chess engines are all 32 bits based as they can
execute 32 bits instructions faster of course and most mobile phones still
are 32 bits anyway.
You cannot just cut and paste source codes from others and get away with
it in a commercial setting.
Commercial seen that's too expensive to cut n paste
other persons work because of all the courtcases, and you bet they will be
there - just when governments get involved for first time in history i saw
a bunch of guys work together who otherwise would stick out each others
eyes at any given occasion :)
I made another chessprogram here a while ago which gets nearby 10 million
nps single core. No 64 bits engine will ever manage that :)
Those extra instructions you can execute are deadly. And we're NOT
speaking about vector instructions here - just integers.
The reason why 64 bits is interesting is not because it is any faster - it
is not. It's slower in terms of executing instructions.
Yet algorithmically you can use a huge hashtable with all cores together,
so that speeds you up bigtime then.
More than a decade ago i was happy to use 200 GB there at the SGI
supercomputer. It really helps... ...not as much as some would guess it
helps, yet a factor 2 really is a lot :)
>> Besides the above reasons another reason why 32 bits programs compiled
>> 64 bits can be a lot slower in case of Diep is:
>>
>> c) the larger code size causes more L1 instruction cache misses.
>
> This really depends on the code. Not everything is larger. Typically it's
> the increased pointer size that cause increased data cache misses, which then
> casues slowdowns.
Really a lot changes to 64 bits of course, as
the above chesssoftware is mainly busy with array lookups and branches in
between them.
You need those lookups everywhere. Arrays are really important. Not only
as you want to lookup something, but also because they avoid writing
out another bunch of lines of codes to get to the same :)
Also the index into the array needs to be 64 bits of course. Which means
that in the end every value gets converted to 64 bits in 64 bits mode,
which makes sense.
Now i'm sure you define all array lookups as lookups into a pointer so
we're on the same page then :)
Please also note that suddenly lots of branches in chessprograms also
tend to get slower. Some in fact might go from say around a clock or 5
penalty to 30 clocks penalty, because the distance in bytes between the
conditional jump and the spot where it might jump to is more bytes away.
That you really feel bigtime.
GCC always has been worldchampion in rewriting branches to something that
in advance is slower than the straightforward manner - and even the PGO
phase couldn't improve upon that. Yet it especially slowed down most at
AMD.
I tend to remember a discussion between a GCC guy and Linus there,
where Linus said there was no excuse to not now and then generate CMOV's
at modern processors like core2 and opteron - where the GCC teammember (a
polish name i didn't recognize) argued that crippling GCC was needed as he
owned a P4 :)
That was not long after i posted some similar code in forums showing how
FUBAR gcc was with branches - yet "by accident" that got a 25-30 clocks
penalty at AMD and not at intel.
That piece of code goes better nowadays.
Where GCC needs major improvements is in the PGO phase right now.
It's just abnormal difference. something like 3% speedup using pgo in GCC
versus 20-25% speedup with other compilers under which intel c++.
I do not know what it causes - yet there should be tons of source codes
available that have the same problem.
> -- > Florian Weimer / Red Hat Product
Security Team >
More information about the Gcc-help
mailing list