This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

optimization question


the attached code (see contract_pppp_sparse) is a kernel which I hope gets optimized well. Unfortunately, compiling (on opteron or core2) it as

gfortran -O3 -march=native -ffast-math -funroll-loops -ffree-line-length-200 test.f90

./a.out
 Sparse: time[s]   0.66804099
 New: time[s]   0.20801300
     speedup    3.2115347
      Glfops    3.1151900
 Error:   1.11022302462515654E-016

shows that the hand-optimized version (see contract_pppp_test) is about 3x faster. I played around with options, but couldn't get gcc to generate fast code for the original source. I think that this would involve unrolling a loop and scalarizing the scratch arrays buffer1 and buffer2 (as done in the hand-optimized version). So, is there any combination of options to get that effect?

Second question, even the code generated for the hand-optimized version is not quite ideal. The asm of the inner loop appears (like the source) to contain about 4*81 multiplies. However, a 'smarter' way to do the calculation would be to compute the constants used for multiplying work(i) by retaining common subexpressions (i.e. all values of sa_i * sb_j * sc_k * sd_l * work[n] can be computed in 9+9+81+81 multiplies instead of the current scheme, which has 4*81). That could bring another factor of 2 speedup. Is there a chance to have gcc see this, or does this need to be done on the source level ?

If considered useful, I can add a PR to bugzilla with the testcase.

Joost

Attachment: test.f90
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]