This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
optimization question
- From: VandeVondele Joost <vondele at pci dot uzh dot ch>
- To: gcc at gcc dot gnu dot org
- Date: Sat, 16 May 2009 10:28:27 +0200 (MEST)
- Subject: optimization question
the attached code (see contract_pppp_sparse) is a kernel which I hope gets
optimized well. Unfortunately, compiling (on opteron or core2) it as
gfortran -O3 -march=native -ffast-math -funroll-loops
-ffree-line-length-200 test.f90
./a.out
Sparse: time[s] 0.66804099
New: time[s] 0.20801300
speedup 3.2115347
Glfops 3.1151900
Error: 1.11022302462515654E-016
shows that the hand-optimized version (see contract_pppp_test) is about 3x
faster. I played around with options, but couldn't get gcc to generate
fast code for the original source. I think that this would involve
unrolling a loop and scalarizing the scratch arrays buffer1 and buffer2
(as done in the hand-optimized version). So, is there any combination of
options to get that effect?
Second question, even the code generated for the hand-optimized version is
not quite ideal. The asm of the inner loop appears (like the source) to
contain about 4*81 multiplies. However, a 'smarter' way to do the
calculation would be to compute the constants used for multiplying work(i)
by retaining common subexpressions (i.e. all values of sa_i * sb_j * sc_k
* sd_l * work[n] can be computed in 9+9+81+81 multiplies instead of the
current scheme, which has 4*81). That could bring another factor of 2
speedup. Is there a chance to have gcc see this, or does this need to be
done on the source level ?
If considered useful, I can add a PR to bugzilla with the testcase.
Joost
Attachment:
test.f90
Description: Text document