Bug 29543 - Poor code generated for arrays with variable bounds
Summary: Poor code generated for arrays with variable bounds
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: ada (show other bugs)
Version: 4.2.0
: P3 enhancement
Target Milestone: 4.7.0
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-10-22 02:00 UTC by jeff
Modified: 2012-01-22 19:38 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build: i686-pc-linux-gnu
Known to work:
Known to fail:
Last reconfirmed: 2006-10-22 06:45:13


Attachments
Source code and very simple build script for Ada and FORTRAN code capable of showing poor tree and runtime performance. (1.42 KB, application/octet-stream)
2006-10-22 02:02 UTC, jeff
Details

Note You need to log in before you can comment on or make changes to this bug.
Description jeff 2006-10-22 02:00:22 UTC
I understand comparing very very small benchmarks like this can be misleading but I believe I've looked at this enough to have a sense that it is demonstrating a basic truth and not a narrow performance issue.

The test case that has been attached shows a FORTRAN and Ada program that are equivalent (within their matrix multiply loop). The Ada one runs about 2x slower with about 3x the number of machine instructions in the inner loop. (Note that running with Ada run time checks disabled).

I dumped the optimized trees (as the original tree of the Ada version was difficult to read because of the node types not being known to the pretty printer). The Ada tree is certainly a mess compared to the FORTRAN version.

The core of the FORTRAN code looks like

   do I = 1,N
      do J = 1,N
         sum = 0.0
         do R = 1,N
            sum = sum + A(I,R)*B(R,J)
         end do
         C(I,J) = sum
      end do
   end do


With the resulting optimized tree fragment (of the inner most loop) being

<L25>:;
  sum = MEM[base: (real4 *) ivtmp.97] * MEM[base: (real4 *) pretmp.81, index: (real4 *) ivtmp.161 + (real4 *) ivtmp.94, step: 4B, offset: 4B] + sum;
  ivtmp.94 = ivtmp.94 + 1;
  ivtmp.97 = ivtmp.97 + ivtmp.157;
  if (ivtmp.94 == (<unnamed type>) D.1273) goto <L29>; else goto <L25>;

While the core of the Ada code looks like:

   for I in A'range(1) loop
      for J in A'range(2) loop
         Sum := 0.0;
         for R in A'range(2) loop
            Sum := Sum + A(I,R)*B(R,J);
         end loop;
         C(I,J) := Sum;
      end loop;
   end loop;

With the resulting optimized tree fragment of the inner most loop being :

<L15>:;
  D.2370 = (*D.2277)[pretmp.627]{lb: tst_array__L_3__T16b___L sz: pretmp.709 * 4}[(<unnamed type>) r]{lb: tst_array__L_4__T17b___L sz: 4};

<bb 51>:
  temp.721 = D.2344->LB0;

<bb 52>:
  temp.720 = D.2344->UB1;

<bb 53>:
  temp.719 = D.2344->LB1;

<bb 54>:
  j.73 = (<unnamed type>) j;
  D.2373 = (*D.2298)[(<unnamed type>) r]{lb: temp.721 sz: MAX_EXPR <(temp.720 + 1 - temp.719) * 4, 0> + 3 & -4}[j.73]{lb: temp.719 sz: 4};

<bb 55>:
  D.2374 = D.2370 * D.2373;

<bb 56>:
  sum = D.2374 + sum;

<bb 57>:
  if (r == tst_array__L_4__T17b___U) goto <L17>; else goto <L16>;

<L16>:;
  r = r + 1;
  goto <bb 50> (<L15>);

Now, I'll be the first to admit that I know very little about the innards of compiler technology but that tree looks like a horrible mess. It is no wonder the resulting assembly is such a mess.

I am attaching a tar file that has the complete source for the Ada and the FORTRAN version.
Comment 1 jeff 2006-10-22 02:02:38 UTC
Created attachment 12473 [details]
Source code and very simple build script for Ada and FORTRAN code capable of showing poor tree and runtime performance.
Comment 2 Andrew Pinski 2006-10-22 02:20:25 UTC
Well actually Fortran front-end flattens the arrays which causes the difference.  In fact the Fortran front-end is incorrect in flattening it for reasons of debugging.
Comment 3 Andrew Pinski 2006-10-22 02:23:42 UTC
But the real problem is that IV-opts does not know how to handle non-constant array sizes and bounds.
Comment 4 Eric Botcazou 2006-10-22 06:45:13 UTC
> The test case that has been attached shows a FORTRAN and Ada program that are
> equivalent (within their matrix multiply loop). The Ada one runs about 2x
> slower with about 3x the number of machine instructions in the inner loop.

Why do you go through an indirection in Ada?  Just write:

   type Real_Matrix is array(Integer range <>,Integer range <>) of Float;
   N : Positive := Positive'Value (Argument (1));
   G : Ada.Numerics.Float_Random.Generator;

   A,B,C : Real_Matrix (1..N, 1..N);
   Start, Finish : Ada.Calendar.Time;
   Sum : Float := 0.0;

and you'll probably significantly narrow the gap.
Comment 5 Laurent GUERBY 2006-10-22 12:04:08 UTC
Eric, if the type is Long_Float your solution will be much worse since the alignement on the stack of the matrices will be 4-bytes so you have 50% chance the code will be N times slower (when not 8-bytes aligned).

So in practice you have to go malloc in order to get proper alignment for Long_Float, and there is no reason the proposed code should perform badly.
Comment 6 Eric Botcazou 2006-10-22 12:42:05 UTC
> Eric, if the type is Long_Float your solution will be much worse since the
> alignement on the stack of the matrices will be 4-bytes so you have 50% chance
> the code will be N times slower (when not 8-bytes aligned).

You're over-generalizing.  Choose a saner architecture or use -malign-double.

> So in practice you have to go malloc in order to get proper alignment for
> Long_Float, and there is no reason the proposed code should perform badly.

No, but there is an obvious one why the Fortran version trounces the Ada
version.  Let's not compare apples with oranges.
Comment 7 Eric Botcazou 2006-10-22 15:03:04 UTC
> No, but there is an obvious one why the Fortran version trounces the Ada
> version.  Let's not compare apples with oranges.

The adverse effect of the indirection can be alleviated though:

   N : Positive := Positive'Value (Argument (1));
   G : Ada.Numerics.Float_Random.Generator;

   type Real_Matrix is array (1..N, 1..N) of Float;
   type Matrix_Access is access Real_Matrix;

   A,B,C : Matrix_Access;
   Start, Finish : Ada.Calendar.Time;
   Sum : Float := 0.0;
begin
   A := new Real_Matrix;
   B := new Real_Matrix;
   C := new Real_Matrix;
Comment 8 jeff 2006-10-22 15:50:02 UTC
(In reply to comment #7)
> > No, but there is an obvious one why the Fortran version trounces the Ada
> > version.  Let's not compare apples with oranges.
> 
> The adverse effect of the indirection can be alleviated though:
> 
>    N : Positive := Positive'Value (Argument (1));
>    G : Ada.Numerics.Float_Random.Generator;
> 
>    type Real_Matrix is array (1..N, 1..N) of Float;
>    type Matrix_Access is access Real_Matrix;
> 
>    A,B,C : Matrix_Access;
>    Start, Finish : Ada.Calendar.Time;
>    Sum : Float := 0.0;
> begin
>    A := new Real_Matrix;
>    B := new Real_Matrix;
>    C := new Real_Matrix;
> 


That is a good approach and it significantly narrows the gap though I would argue that a general user would not see the original version as "different" than the FORTRAN version. Both of them are dynamically allocating the size at run time based on an input parameter. The FORTRAN arrays could indeed be passed to other procedures (such as matmul) that take variable sized arrays while this modified version (in its current form) could not.

So, from an understanding of the internals of the compiler, I think you have a reasonable argument that the original Ada and FORTRAN were somewhat different but I think this modified version is actually further from the FORTRAN. (Though I must admit I know very little about FORTRAN 95).
Comment 9 Laurent GUERBY 2006-10-22 16:13:35 UTC
Eric, build is "i686-pc-linux-gnu" I don't understand why you speak of generalization :).

Anyway, I changed the code to match more likely use, that is by using a subprogram:

   procedure Compute (A, B : in Real_Matrix; C : out Real_Matrix) is
      Sum : Float := 0.0;
   begin
      for I in A'range(1) loop
         for J in A'range(2) loop
            Sum := 0.0;
            for R in A'range(2) loop
               Sum := Sum + A(I,R)*B(R,J);
            end loop;
            C(I,J) := Sum;
         end loop;
      end loop;
   end Compute;

...

   Start := Ada.Calendar.Clock;
   Compute (A.all, B.all, C.all);
   Finish := Ada.Calendar.Clock;

On my x86_64 box at -O2 (-gnatp) Ada is still three times slower

cd ada ; ./tst_array 600
 1.65129E+02 1.47839E+02
 1.57425E+02 1.40635E+02
Time:           1.541291000

cd fortran ; ./tst_array 600
   146.0253       145.0025       149.3322       150.8924
 Time:   0.5400340

I don't know how to write the equivalent fortran code, but it's likely better to concentrate on the subprogram version for optimization purpose.
Comment 10 Eric Botcazou 2006-10-22 18:33:54 UTC
> I don't know how to write the equivalent fortran code, but it's likely better
> to concentrate on the subprogram version for optimization purpose.

Sorry, I don't see your point.
Comment 11 Laurent GUERBY 2006-10-22 20:55:25 UTC
My point is that the subprogram I wrote is probably the way most Ada programmers will write when they want to multiply matrices (if not using BLAS/Lapack).
Comment 12 Eric Botcazou 2006-10-22 21:27:04 UTC
> My point is that the subprogram I wrote is probably the way most Ada
> programmers will write when they want to multiply matrices (if not using
> BLAS/Lapack).

OK, but you have not simplified the problem in doing so. :-)
Comment 13 Laurent GUERBY 2006-10-22 21:55:43 UTC
Indeed :)
Comment 14 jeff 2006-10-23 23:50:36 UTC
(In reply to comment #1)
> Created an attachment (id=12473) [edit]
> Source code and very simple build script for Ada and FORTRAN code capable of
> showing poor tree and runtime performance.
> 

Note, while it does not fully close the gap, I can confirm that a large portion of this the poor performance is a regression when compared to GCC 4.0.3.

The 4.0.3 is about 1/3 faster than the 4.2.0 which also corresponds to a significantly reduced number of generated instructions for the inner loop.



Comment 15 Eric Botcazou 2012-01-22 19:38:00 UTC
Run-time performances should be comparable now.