This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [tree-ssa] performance with loops

From: Toon Moene <toon at moene dot indiv dot nluug dot nl>
To: Steven Bosscher <s dot bosscher at student dot tudelft dot nl>
Cc: dnovillo at redhat dot com, gcc at gcc dot gnu dot org
Date: Fri, 04 Jul 2003 21:56:00 +0200
Subject: Re: [tree-ssa] performance with loops
Organization: Moene Computational Physics, Maartensdijk, The Netherlands
References: <1057331717.3640.33.camel@steven.lr-s.tudelft.nl>

Steven Bosscher wrote:

Diego,

A C++ fluid dynamics code I am working with performs really bad when
compiled with tree-ssa compared to mainline.  It takes about 3 times as
long for tree-ssa (that's the difference between running a simulation
overnight or having to wait a whole day and not being able to use your
computer...).

I have tried to narrow down the code to a small test case, and I have
found an example that shows ~66% slowdown; the code I used for these
timings is attached.

Timings for tree-ssa:
real    0m6.852s	0m6.846s	0m6.882s
user    0m6.680s	0m6.690s	0m0.160s
sys     0m0.170s	0m6.690s	0m0.190s

Timings for mainline:
real    0m4.090s	0m4.086s	0m4.087s
user    0m3.910s	0m3.880s	0m3.870s
sys     0m0.180s	0m0.200s	0m0.220s

Ratio: 1,67 1.68 1.68

(The machine is an Athlon XP2000, 256MB ram)

Maybe something like this can also explain the slowdown in 183.equake
(which is the only SPECfp2000 benchmark that slows down with
tree-ssa)???

Gr.
Steven


---------------------------------------
#define L 1000
#define W 200
#define H 200

float ***data1, ***data2, ***data3;

void __attribute__((noinline))
foo (void)
{
  int i, j, k;

  for (i = 0; i < L; i++)
    for (j = 0; j < W; j++)
      for (k = 0; k < H; k++)
	{
	  data1[i][j][k] = data2[i][j][k] * data3[i][j][k];
	}
}

int
main (void)
{
  float ***x;
  int i, j;

  x = (float ***) malloc (L*sizeof(float**));
  for (i = 0; i < L; i++)
    {
      x[i] = (float **) malloc (W*sizeof(float*));
      for (j = 0; j < W; j++)
	x[i][j] = (float *) malloc (H*sizeof(float));
    }
  data1 = data2 = data3 = x;

  /* Loops 10 times to spread the
     overhead of the malloc.  */
  for (i = 1; i < 10; i++)
    foo ();
  free (data1);
}

Hmm, how about initializing the data you use (malloc just allocates the space) - without it you could run into NaNs which would distort the timing picture completely.

BTW, what's wrong with:

PROGRAM TEST
REAL, ALLOCATABLE :: A(:,:,:), B(:,:,:), C(:,:,:)
READ*,L,M,N
ALLOCATE(A(L,M,N),B(L,N,M),C(L,N,M))
A=1.0;B=2.0;C=A+B
PRINT*,SUM(C)
END PROGRAM TEST

--
Toon Moene - mailto:toon@moene.indiv.nluug.nl - phoneto: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
Maintainer, GNU Fortran 77: http://gcc.gnu.org/onlinedocs/g77_news.html
GNU Fortran 95: http://gcc-g95.sourceforge.net/ (under construction)

Follow-Ups:
- Re: [tree-ssa] performance with loops
  - From: Steven Bosscher

References:
- [tree-ssa] performance with loops
  - From: Steven Bosscher

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]