This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

ATLAS degradation with -freduce-all-givs flag for PPC arch


Hello,

we are working tuning some applications/libraries like ATLAS, METIS,
BLAS, FFTW (linear algebra libraries) for the new IBM PowerPC 970
processor included in JS20 Bladecenter machines.

As you may know PPC970 is a 32-/64-bit processor based in Power4 but
with VMX support (aka. AltiVec).

We have got a big degradation of performance when we enable
'-freduce-all-givs' flag in GCC to compile ATLAS libraries. An example
of this is that performance drops from 6.5 GF for a SGEMM operation
(single precision matrix multiplication) of (800 x 800) size to 4.5 GF
when this flag enabled in GCC. Maybe this degradation is produced
because SGEMM kernel is written in PPC assembler and using AltiVec calls
and not is C coded.

I have read (in man gcc) that you are very interested to know when
application runs slower when this flag is enable. So here I am ;).

I attach a tar.gz file where you will find all source code and shell
scripts needed to reproduce the degradation of SGEMM routine when this
flag is enabled or disabled.

In this package you will only find the isolated SGEMM source code
routine (for single precision) extracted from ATLAS library v3.6.0
(http://math-atlas.sourceforge.net/). Also, you will find the required
Makefiles to create the library and tester enabling/disabling
'-freduce-all-givs'. Finally there is also a shell script to run a
benchmark with SGEMM routine (test-ATLAS) for the two versions of
binaries and with different sizes of matrix.

Elapsed time for SGEMM operation is obtained in test-ATLAS using Time
Base Register of PPC processors. If you want to know how many
nanoseconds are each TB tick you have test-TB for this purpose (because
in each PPC processor this value changes). So if we want to know the
GFlops we use the TB ticks got with test-ATLAS and convert to elapsed
time (in seconds) with the value of test-TB. And later we divide 2N^3
(FLOP in a matrix x matrix operations of size N) between total seconds
required to perform SGEMM operation.

Also take in mind this MACROS in test-ATLAS.c code:
#define MAXFDIM  9000    /* Max memory 940 Mbytes = 3 *
MAXFDIM*MAXFDIM*4 */ --> Maximum size of Matrixes for the available
memory of the system
#define NREP       10 /*100*/ --> Number of repetitions to avoid
preloading cache time of CPU.

To finish I must say that obviously this test must be performed in a
PPC970 processor based machine like JS20 Bladecenter or Power G5 with
GCC v3.3 or higher with AltiVec support.


Please let me know if you need more information to reproduce this behavior.


Sorry if you know about this behavior of ATLAS libraries when this flag
is enabled, and this information is not useful for GCC team now ;).

Regards,
Bye!
--
     My current project: Tuning for IBM PowerPC 970-based Blades
                (http://www.ciri.upc.es/cela_pblade)
----------------------------------------------------------------------
  Copyright protects software. Patents protect software monopolies.
          -- NO to software patents (http://swpat.ffii.org)
----------------------------------------------------------------------
   o//o     Raúl de la Cruz Martínez   E-mail: delacruz at ac upc es
  o//o      CEPBA-IBM Research Institute (CIRI)  www.cepba.upc.es/ciri
 o//o       http://people.ac.upc.es/delacruz     Tel: +34 93-401 16 49
o//o  CIRI  C/Jordi Girona 1-3, Despatx C6-S103. BCN, Catalonia, SPAIN
----------------------------------------------------------------------

Attachment: atlas-degradation.tar.gz
Description: GNU Zip compressed data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]