This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Problem with extremely large procedures and 64-bit code

Thanks Richard for your input, much appreciated.

I followed up on your suggestions; unfortunately the -Wdisabled-optimization option you suggested did not cause any warnings. Still trying one by one the --params options without success. I got a new hint, though, running the same examples on a MacBook I don't see the same issue at all, time difference between 64-bit and 32-bit in each optimize/debug versions is slightly off, and 64-bit always about 10% faster in each class. I guess somehow the compiler flags are different, perhaps you, or someone knows what flags are set differently by default between them, though is hard to compare the actual speeds because the hardware is different. Here are the specs on the mac:

gcc: Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn) - don't know what that means expected  a number like 4.2.1 or something like that,  2.53 GHz Intel Core 2 Duo

    Anything comes to your mind?

    Thanks again for your help,


On 1/20/15 1:21 AM, Richard Biener wrote:
On Tue, Jan 20, 2015 at 4:57 AM, Ricardo Telichevesky <> wrote:

     I have a strange problem with extremely large procedures when generating
64-bit code
I am using gcc 4.9.2 on  RHEL6.3 on a 64-thread 4-socket  Xeon E7 4820 with
256GB of memory. No avx extensions, using sse option when building the
compiler. This particular code is serial. I made measurements with 32- and
64- bit both debug -g and optimize -O3 for two different examples (this is a
circuit simulator and each example is a different circuit that uses
different transistors).

     Example A is the one the effect is more acute. I listed at the bottom of
the e-mail the 3 procedures that consume 90% of the execution time:

a) As a counter-example, the factor code listed is heavily optimized
hand-written 300-lines of C++ code that behaves as expected: 64-bit optimize
is way faster than any other, up to 15x faster than 32-bit debug (btw great
job in the compiler, it is really shining here).

b) evalTran has 18000 lines of auto-generated code and behaves very
counter-intuitively 64-bit optimize code is 3x slower than 32-bit optimize

c) evalTranRhs has 5000 lines even worse: 64-bit is 4x slower than 32-bit.
Notice that all the data structures in 32-bit code and 64-bit code are
identical and most variables are identical - in fact all integers used are
64-bit, and most operations are floating-point ops. Initially I thought the
64-bit code was a lot bigger than 32-bit code and the cache was overwhelmed.
In fact the difference in code sizes is not even 10% (at least debug -
notice I calculated the size of each procedure in bytes)  so my trash-the
I-cache conjecture seems to be wrong. The overall execution time is causing
us a lot of problems - 64-bit optimize takes 16seconds, even more than
32-bit debug 10seconds and 32-bit optimize 4.8 seconds. Considering we only
care about 64-bit optimize we got a big problem here.

     Example B is not so bad, and in fact 64-bit code is slightly faster than
32-bit code, would be nice if went even faster, but if I got A to behave
like that I'd be pretty happy already.

     I tried to look at the wide array of optimizing options for the code, it
is is a dizzying task and I could not get any kind of intuition besides the
-O3 ... Would you have any suggestions for the proper flags for those
ridiculously large auto-generated codes that might be able to alleviate this
32-bit vs 64-bit problem? Would you think that the fact this code is in a
dynamic linked library (-fPIC) plays a role?
It's hard to tell without a testcase but GCC has various limits on
code sizes passes deal with so you might trip one of these which
effectively would disable optimizations.  For example loop dependence
analysis has a limit on the number of memory references it considers
(--param loop-max-datarefs-for-datadeps, default 1000).  Note that not
all such limits are controlled by --params.  We have
-Wdisabled-optimization that should warn if you run into any such
case (but the warning is unfortunately not correctly implemented by
all passes having such limits).


     Thanks very much for your help,

All times are wall clock in micro-seconds - the main was checked against the
reported UNIX time and is exact.

example  A
evalTran has 18000 lines of C code   (two for loops around 99% of the code)
evalTranRhs has 5000 lines of C code (two for loops around 99% of the code)

32 bit debug -g -m32 -fPIC -Wall -Winvalid-pch -msse2
%time  elapsed(us)       #calls      per call(us)        timer name @DN@
-----  -----------       ------      ------------ --------------
2.503 254536            8335        30 numerical TRAN factor
56.01  5695065           8335        683 evalTran    bytes=231791
35.41  3600646           13924       258 evalTranRhs bytes=57501
100    10168242          1           10168242            main @DT@

32 bit optimize -O3 -m32 -fPIC -Wall -Winvalid-pch -msse2
%time  elapsed(us)       #calls      per call(us) timer name @DN@
-----  -----------       ------      ------------ --------------
0.710  34442             8335        4 numerical TRAN factor
43.06  2087757           8335        250                 evalTran
43.49  2108786           13925       151 evalTranRhs
100    4848520           1           4848520             main @DT@

64 bit debug -g -fPIC -Wall -Winvalid-pch -msse2
%time  elapsed(us)       #calls      per call(us) timer name @DN@
-----  -----------       ------      ------------ --------------
0.973  205144            8335        24 numerical TRAN factor
46.43  9785920           8335        1174 evalTran bytes=252741
49.72  10478888          13924       752 evalTranRhs bytes=58442
100    21077659          1           21077659            main @DT@

64 bit optimize -O3 -fPIC -Wall -Winvalid-pch -msse2
%time  elapsed(us)       #calls      per call(us)        timer name @DN@
-----  -----------       ------      ------------ --------------
0.147  23819             8335        2 numerical TRAN factor
39.26  6360254           8335        763                 evalTran
57.28  9279087           13924       666 evalTranRhs
100    16198762          1           16198762            main @DT@

example B
evalTran has 10000 lines of C code   (two for loops around 99% of the code)
evalTranRhs has 2500 lines of C code (two for loops around 99% of the code)

32-bit debug -g -fPIC -Wall -Winvalid-pch -msse2
%time  elapsed(us)       #calls      per call(us)        timer name @DN@
-----  -----------       ------      ------------ --------------
6.55   989826            46612       21 numerical TRAN factor
63.17  9546694           46612       204 evalTran    bytes=141478
22.36  3379311           47626       70 evalTranRhs bytes=35871
100    15112540          1           15112540            main @DT@

32-bit optimize -O3 -fPIC -Wall -Winvalid-pch -msse2
%time  elapsed(us)       #calls      per call(us)        timer name @DN@
-----  -----------       ------      ------------ --------------
3.012 157060            46612       3 numerical TRAN factor
50.42  2629251           46612       56                  evalTran
34.18  1782641           47626       37 evalTranRhs
100    5214827           1           5214827             main @DT@

64-bit debug -g -fPIC -Wall -Winvalid-pch -msse2
%time  elapsed(us)       #calls      per call(us)        timer name @DN@
-----  -----------       ------      ------------ --------------
6.439  837743            46612       17 numerical TRAN factor
63.02  8199007           46612       175 evalTran    bytes=154542
22.21  2889893           47626       60 evalTranRhs bytes=36487
100    13011058          1           13011058            main @DT@

64-bit optimize -O3 -fPIC -Wall -Winvalid-pch -msse2
%time  elapsed(us)       #calls      per call(us)        timer name @DN@
-----  -----------       ------      ------------ --------------
2.389  103855            46612       2 numerical TRAN factor
53.52  2326715           46612       49                  evalTran
33.1   1438995           47626       30 evalTranRhs
100    4347691           1           4347691             main @DT@

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]