I understand comparing very very small benchmarks like this can be misleading but I believe I've looked at this enough to have a sense that it is demonstrating a basic truth and not a narrow performance issue. The test case that has been attached shows a FORTRAN and Ada program that are equivalent (within their matrix multiply loop). The Ada one runs about 2x slower with about 3x the number of machine instructions in the inner loop. (Note that running with Ada run time checks disabled). I dumped the optimized trees (as the original tree of the Ada version was difficult to read because of the node types not being known to the pretty printer). The Ada tree is certainly a mess compared to the FORTRAN version. The core of the FORTRAN code looks like do I = 1,N do J = 1,N sum = 0.0 do R = 1,N sum = sum + A(I,R)*B(R,J) end do C(I,J) = sum end do end do With the resulting optimized tree fragment (of the inner most loop) being <L25>:; sum = MEM[base: (real4 *) ivtmp.97] * MEM[base: (real4 *) pretmp.81, index: (real4 *) ivtmp.161 + (real4 *) ivtmp.94, step: 4B, offset: 4B] + sum; ivtmp.94 = ivtmp.94 + 1; ivtmp.97 = ivtmp.97 + ivtmp.157; if (ivtmp.94 == (<unnamed type>) D.1273) goto <L29>; else goto <L25>; While the core of the Ada code looks like: for I in A'range(1) loop for J in A'range(2) loop Sum := 0.0; for R in A'range(2) loop Sum := Sum + A(I,R)*B(R,J); end loop; C(I,J) := Sum; end loop; end loop; With the resulting optimized tree fragment of the inner most loop being : <L15>:; D.2370 = (*D.2277)[pretmp.627]{lb: tst_array__L_3__T16b___L sz: pretmp.709 * 4}[(<unnamed type>) r]{lb: tst_array__L_4__T17b___L sz: 4}; <bb 51>: temp.721 = D.2344->LB0; <bb 52>: temp.720 = D.2344->UB1; <bb 53>: temp.719 = D.2344->LB1; <bb 54>: j.73 = (<unnamed type>) j; D.2373 = (*D.2298)[(<unnamed type>) r]{lb: temp.721 sz: MAX_EXPR <(temp.720 + 1 - temp.719) * 4, 0> + 3 & -4}[j.73]{lb: temp.719 sz: 4}; <bb 55>: D.2374 = D.2370 * D.2373; <bb 56>: sum = D.2374 + sum; <bb 57>: if (r == tst_array__L_4__T17b___U) goto <L17>; else goto <L16>; <L16>:; r = r + 1; goto <bb 50> (<L15>); Now, I'll be the first to admit that I know very little about the innards of compiler technology but that tree looks like a horrible mess. It is no wonder the resulting assembly is such a mess. I am attaching a tar file that has the complete source for the Ada and the FORTRAN version.
Created attachment 12473 [details] Source code and very simple build script for Ada and FORTRAN code capable of showing poor tree and runtime performance.
Well actually Fortran front-end flattens the arrays which causes the difference. In fact the Fortran front-end is incorrect in flattening it for reasons of debugging.
But the real problem is that IV-opts does not know how to handle non-constant array sizes and bounds.
> The test case that has been attached shows a FORTRAN and Ada program that are > equivalent (within their matrix multiply loop). The Ada one runs about 2x > slower with about 3x the number of machine instructions in the inner loop. Why do you go through an indirection in Ada? Just write: type Real_Matrix is array(Integer range <>,Integer range <>) of Float; N : Positive := Positive'Value (Argument (1)); G : Ada.Numerics.Float_Random.Generator; A,B,C : Real_Matrix (1..N, 1..N); Start, Finish : Ada.Calendar.Time; Sum : Float := 0.0; and you'll probably significantly narrow the gap.
Eric, if the type is Long_Float your solution will be much worse since the alignement on the stack of the matrices will be 4-bytes so you have 50% chance the code will be N times slower (when not 8-bytes aligned). So in practice you have to go malloc in order to get proper alignment for Long_Float, and there is no reason the proposed code should perform badly.
> Eric, if the type is Long_Float your solution will be much worse since the > alignement on the stack of the matrices will be 4-bytes so you have 50% chance > the code will be N times slower (when not 8-bytes aligned). You're over-generalizing. Choose a saner architecture or use -malign-double. > So in practice you have to go malloc in order to get proper alignment for > Long_Float, and there is no reason the proposed code should perform badly. No, but there is an obvious one why the Fortran version trounces the Ada version. Let's not compare apples with oranges.
> No, but there is an obvious one why the Fortran version trounces the Ada > version. Let's not compare apples with oranges. The adverse effect of the indirection can be alleviated though: N : Positive := Positive'Value (Argument (1)); G : Ada.Numerics.Float_Random.Generator; type Real_Matrix is array (1..N, 1..N) of Float; type Matrix_Access is access Real_Matrix; A,B,C : Matrix_Access; Start, Finish : Ada.Calendar.Time; Sum : Float := 0.0; begin A := new Real_Matrix; B := new Real_Matrix; C := new Real_Matrix;
(In reply to comment #7) > > No, but there is an obvious one why the Fortran version trounces the Ada > > version. Let's not compare apples with oranges. > > The adverse effect of the indirection can be alleviated though: > > N : Positive := Positive'Value (Argument (1)); > G : Ada.Numerics.Float_Random.Generator; > > type Real_Matrix is array (1..N, 1..N) of Float; > type Matrix_Access is access Real_Matrix; > > A,B,C : Matrix_Access; > Start, Finish : Ada.Calendar.Time; > Sum : Float := 0.0; > begin > A := new Real_Matrix; > B := new Real_Matrix; > C := new Real_Matrix; > That is a good approach and it significantly narrows the gap though I would argue that a general user would not see the original version as "different" than the FORTRAN version. Both of them are dynamically allocating the size at run time based on an input parameter. The FORTRAN arrays could indeed be passed to other procedures (such as matmul) that take variable sized arrays while this modified version (in its current form) could not. So, from an understanding of the internals of the compiler, I think you have a reasonable argument that the original Ada and FORTRAN were somewhat different but I think this modified version is actually further from the FORTRAN. (Though I must admit I know very little about FORTRAN 95).
Eric, build is "i686-pc-linux-gnu" I don't understand why you speak of generalization :). Anyway, I changed the code to match more likely use, that is by using a subprogram: procedure Compute (A, B : in Real_Matrix; C : out Real_Matrix) is Sum : Float := 0.0; begin for I in A'range(1) loop for J in A'range(2) loop Sum := 0.0; for R in A'range(2) loop Sum := Sum + A(I,R)*B(R,J); end loop; C(I,J) := Sum; end loop; end loop; end Compute; ... Start := Ada.Calendar.Clock; Compute (A.all, B.all, C.all); Finish := Ada.Calendar.Clock; On my x86_64 box at -O2 (-gnatp) Ada is still three times slower cd ada ; ./tst_array 600 1.65129E+02 1.47839E+02 1.57425E+02 1.40635E+02 Time: 1.541291000 cd fortran ; ./tst_array 600 146.0253 145.0025 149.3322 150.8924 Time: 0.5400340 I don't know how to write the equivalent fortran code, but it's likely better to concentrate on the subprogram version for optimization purpose.
> I don't know how to write the equivalent fortran code, but it's likely better > to concentrate on the subprogram version for optimization purpose. Sorry, I don't see your point.
My point is that the subprogram I wrote is probably the way most Ada programmers will write when they want to multiply matrices (if not using BLAS/Lapack).
> My point is that the subprogram I wrote is probably the way most Ada > programmers will write when they want to multiply matrices (if not using > BLAS/Lapack). OK, but you have not simplified the problem in doing so. :-)
Indeed :)
(In reply to comment #1) > Created an attachment (id=12473) [edit] > Source code and very simple build script for Ada and FORTRAN code capable of > showing poor tree and runtime performance. > Note, while it does not fully close the gap, I can confirm that a large portion of this the poor performance is a regression when compared to GCC 4.0.3. The 4.0.3 is about 1/3 faster than the 4.2.0 which also corresponds to a significantly reduced number of generated instructions for the inner loop.
Run-time performances should be comparable now.