This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: gcc compile-time performance
I am not so sure divides are what kill it. -Will
Jan Hubicka wrote:
>> From: Stan Shebs <shebs@apple.com>
>> Date: Fri, 17 May 2002 09:25:32 -0700
>>
>> That's my personal suspicion too, but no, I don't have any real
>> evidence. The lack of hot spots in profiling is a strong hint.
>> One oddball idea I've thought about is to functionize all the
>> tree and rtl macros, and run a profile on that to see what are
>> the most used/abused macros.
>>
>>I know that the subreg-byte changes added a lot of overhead
>>particularly via the subreg_regno_offset() function (which was
>>an inline macro in my original diffs).
>>
>
> Do you have some data? Perhaps we can replace the division by simple lookup
> table...
>
> Honza
>
>>The divisions are what kill it. That overhead could be eliminated
>>if all the mode sizes were powers of 2 and we had some
>>GET_MODE_SIZE_LOG2() interface. Then we just transform all the
>>divides there into shifts.
>>
>> Then there's the extreme approach of having maintainers only
>> accept patches that either remove code or make the compiler run
>> faster... :-)
>>
>>There is a better way, have maintainers work on approval of such
>>changes faster than approval of other changes :-)
>>
Given the comment that divides were taking a significant time, I
decided to get some data to see how much of a problem divides were. I
checked out the gcc_3_1_release from gcc.gnu.org. I configured it to
build a native compiler on a RH Linux 7.2 running on an Inspiron 4100
with a 1GHz mobile Pentium III processor, 256MB DRAM, and a 40GB hard
drive. I built the tool. I then went into the build/gcc directory,
"make clean; time make bootstrap" while having oprofile take
measurement using the CPU_CLK_UNHALTED (counter 0) and DIV (counter
1).
Result of the "time make bootstrap" command:
real 25m10.672s
user 23m51.240s
sys 0m32.500s
The real time of 25m10s is 1510 second, there were 32108 samples for
divides on /home/wcohen/gcc31/native/gcc/stage1/cc1 with each sample
representing 4980 divide operations (both floating point and
integer). This means there were 73,991 samples total, 368x10^6 divides
on the entire system. Assuming worst case divides from Pentium 4
software optimization manual 70 cycles per divide and 1GHz this would
be 25.8 seconds of runtime, about 1.7% of the run time for the entire
system. The 1.7% is a pretty pessemistic estimate of the time. That
70 cycles is the latency of the idiv, throughput on the Pentium 4 is
23 cycles.
Of course this data assumes that the bootstrap process represents
typical program behavior and is exercising the same parts of gcc that
other programs are.
-Will
[root@litespeed root]# op_time -r -c 1|more /* DIV 4980 per sample */
32108 43.3945 0.0000 /home/wcohen/gcc31/native/gcc/stage1/cc1
20852 28.1818 0.0000 /home/wcohen/gcc31/native/gcc/stage2/cc1
10893 14.7221 0.0000 /home/wcohen/gcc31/native.install/lib/gcc-lib/i686-pc-linux-gnu/3.1/cc1
5886 7.9550 0.0000 /usr/bin/as
2366 3.1977 0.0000 /lib/modules/2.4.9-21custom/build/vmlinux
731 0.9880 0.0000 /usr/bin/ld
259 0.3500 0.0000 /lib/ext3.o
233 0.3149 0.0000 /usr/bin/make
123 0.1662 0.0000 /home/wcohen/gcc31/native/gcc/cc1
...
[root@litespeed root]# op_time -r -c 0 |more /* CPU_CLK_UNHALTED 498000 per */
1287676 51.8822 0.0000 /home/wcohen/gcc31/native/gcc/stage1/cc1
680858 27.4327 0.0000 /home/wcohen/gcc31/native/gcc/stage2/cc1
239265 9.6403 0.0000 /home/wcohen/gcc31/native.install/lib/gcc-lib/i686-pc-linux-gnu/3.1/cc1
92502 3.7270 0.0000 /usr/bin/as
63091 2.5420 0.0000 /lib/modules/2.4.9-21custom/build/vmlinux
34034 1.3713 0.0000 /home/wcohen/gcc31/native/gcc/genattrtab
28366 1.1429 0.0000 /home/wcohen/gcc31/native/gcc/fixinc/fixincl
9777 0.3939 0.0000 /usr/bin/ld
9127 0.3677 0.0000 /home/wcohen/gcc31/native/gcc/cc1
7339 0.2957 0.0000 /usr/lib/mozilla/mozilla-bin
....