This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: patches for increased performance of matmul, dotprod, transpose


Tobias Schlüter wrote:

Hi Tim,

Tim Prince wrote:


I ran the complete build and test on i686-pc-cygwin gcc 4.0.2, also tested with 4.1.0 and on ia64- and x86-64 linux. As the other work mentioned on this list no doubt will have to be completed, followed by determining whether any of these are still relevant, I expect any changelog entries would be obsolete before these could be considered. I haven't heard of any checks on status of my paperwork for several years.
Additional unrolling is intended only to be enough to approach full pipeline on the simpler x86 CPUs, which can issue more than 1 floating point multiply instruction within the latency of addition. This reasoning doesn't apply to integer add. For matmul, memory accesses should be cut nearly by 2. Problems which are large with respect to cache size aren't addressed, except that the boundary of "large" is pushed up somewhat. Likewise, with transpose taken in an order more favorable to machines with write combine buffering.



If the manual loop unrolling is really helpful, I think an optimizer bug
should be filed. On IRC our optimizer guys told me that at least the
modification to matmul should already be done automatically, so I'm wondering,
if you have any benchmark numbers supporting this modification? Can't the
same effect be obtained by building libgfortran with -funroll-loops?


-funroll-loops unrolls the inner loop. It doesn't look for opportunities to unroll multiple levels of loops, combined with loop interchanges. I assumed there is no plan for gcc to perform multiple level loop unrolling and related optimization.

For library code such as this, it seems much more direct to write in such optimizations, if they apply to most architectures of interest. I already mentioned the concern that they may not be desired for complex arithmetic on 32-bit platforms.

Another question is whether the build process provides a way to turn on -funroll-loops for specific library builds, such as libgfortran. It does have some value. I agree, there is no point in doing manual unroll which duplicates the style of unrolling which unroll-loops performs automatically.

A procedural point: your patch doesn't conform to the GNU coding style
(comments should begin with a capital letter and end in "punctuation + two
spaces + */", always put blanks before and after operators, indent comments to
code); also if you find you really need to hand-optimize code, please add a
comment explaining why, adding "FIXME" and a PR number if you think you're
working around an optimizer bug.



I will fix up the comments. I tried to fix the style to conform with what was already in use. The existing code seems short of useful comments.


WRT your mail:  paragraphs make reading and also responding to emails much
easier.  ChangeLogs are also useful for referencing parts when discussing the
patch.



As my past experience is that gcc patches don't get considered for many months, and we are already on notice of intended changes which will require further adjustments, I thought changelogs might be premature.

I haven't been able to locate specific versions of benchmarks mentioned on the mailing list. I am going in part by performance on Livermore Fortran Kernel, but others have chosen a different f90 translation of LFK, which I don't find publicly available.

Thanks.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]