This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Large 3.1 performance anomalies on sparc


I'm going through a fair number of performance tests with 3.1 on sparc
in various configurations.  Some of these tests indicate slowdowns of 20
times in some cases; it would be good if gcc 3.1 could be ruled out as
the culprit.  The anomalies do reveal themselves with gprof in my tests.

The results of the runtime tests are at

http://www.math.purdue.edu/~lucier/runtimes.html

The tests were run as follows.  The Gambit-C runtime was compiled with
the same options and linked to a shared library in my home directory.
My LD_LIBRARY_PATH is set to

/home/c/lucier/local/lib:/pkgs/gcc-3.1/lib:/usr/openwin/lib:/usr/lib:/home/c/lucier/local/gambit/lib

The Gambit runtime is in the last directory.

For 32-bit builds, the tests show a consistent 5-10% speedup from 3.0 to
3.1 with -mcpu=supersparc and -mtune=ultrasparc, together with
consistently smaller binaries.  For 32-bit codes, -mcpu=ultrasparc yields
even more significant reductions in code size, but not always an increase
in speed.

However, for 64-bit ultrasparc builds, the speedups range from none to over
a facter of 20.  That is, the 64-bit code on ultrasparc is > 20 times faster
than the 32-bit code on ultrasparc.

I tried to analyze why this was, so I built a 32-bit profiled runtime library
and binary for fft, as one of the significant examples.  Here is the fft line
from the table:

fft  8.1  93044  15.4  7.6  74192  14.6  7.7  29848  14.0  2.3  40376  0.9   
                 ^^^^              ^^^^              ^^^^              ^^^
The runtimes are indicated; the first three are 32-bit codes, the last
is the 64-bit result.

The results were as one might expect:

banach-70% gcc -I/home/c/lucier/local/gambit/include -O1 -fschedule-insns2 -fno-strict-aliasing -fno-math-errno -mcpu=ultrasparc -mtune=ultrasparc -m32 -pg -o fft fft.c fft_.c /home/c/lucier/local/gambit/lib/libgambc.so -lm -ldl -lcurses -lsocket -lnsl -lresolv
banach-72% time ./fft
(time (run-bench name count run ok?))
    28957 ms real time
    28910 ms cpu time (28680 user, 230 system)
    3 collections accounting for 13 ms real time (10 user, 10 system)
    66768024 bytes allocated
    no minor faults
    no major faults
28.72u 0.27s 0:29.10 99.6%

and the gprof output told me nothing:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
 76.76     12.22    12.22                             internal_mcount
 13.13     14.31     2.09                             ___H_four1
  9.99     15.90     1.59                             _mcount
  0.06     15.91     0.01                             ___H_main
  0.06     15.92     0.01                             ___H_run_2d_bench
  0.00     15.92     0.00        1     0.00     0.00  call___do_global_ctors_aux
  0.00     15.92     0.00        1     0.00     0.00  main

However, when I took the same .o files and build a static runtime library,
there was a tremendous speedup:

banach-78% gcc -I/home/c/lucier/local/gambit/include -O1 -fschedule-insns2 -fno-strict-aliasing -fno-math-errno -mcpu=ultrasparc -mtune=ultrasparc -m32 -pg -o fft fft.c fft_.c /home/c/lucier/local/gambit/lib/libgambc.a -lm -ldl -lcurses -lsocket -lnsl -lresolv
banach-79%  rm gmon.out
rm: remove gmon.out (yes/no)? y
banach-80% time ./fft
(time (run-bench name count run ok?))
    1650 ms real time
    1650 ms cpu time (1420 user, 230 system)
    3 collections accounting for 13 ms real time (20 user, 0 system)
    66768024 bytes allocated
    no minor faults
    no major faults
1.69u 0.32s 0:02.06 97.5%

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 64.86      1.20     1.20    63530     0.02     0.02  ___H_four1
 25.95      1.68     0.48                             internal_mcount
...

This time (with the same binaries, only linked statically) is significantly
faster than any of the 32-bit runtimes with a dynamically linked library,
and approaches the runtime of the 64-bit binary.

So, is there a problem with dynamically-loaded 32-bit libraries generated
by gcc-3.1 and Solaris as/ld?  One that doesn't show up with 64-bit
libraries and binaries?  Could there be an alignment problem?  Or are all
these questions just too naive to be useful;-)?

Brad


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]