This is the mail archive of the
mailing list for the GCC project.
Re: optimization question
- From: Richard Guenther <richard dot guenther at gmail dot com>
- To: VandeVondele Joost <vondele at pci dot uzh dot ch>
- Cc: gcc at gcc dot gnu dot org
- Date: Sat, 16 May 2009 11:24:58 +0200
- Subject: Re: optimization question
- References: <Pine.A41.email@example.com>
On Sat, May 16, 2009 at 10:28 AM, VandeVondele Joost <firstname.lastname@example.org> wrote:
> the attached code (see contract_pppp_sparse) is a kernel which I hope gets
> optimized well. Unfortunately, compiling (on opteron or core2) it as
> gfortran -O3 -march=native -ffast-math -funroll-loops -ffree-line-length-200
> ?Sparse: time[s] ? 0.66804099
> ?New: time[s] ? 0.20801300
> ? ? speedup ? ?3.2115347
> ? ? ?Glfops ? ?3.1151900
> ?Error: ? 1.11022302462515654E-016
> shows that the hand-optimized version (see contract_pppp_test) is about 3x
> faster. I played around with options, but couldn't get gcc to generate fast
> code for the original source. I think that this would involve unrolling a
> loop and scalarizing the scratch arrays buffer1 and buffer2 (as done in the
> hand-optimized version). So, is there any combination of options to get that
> Second question, even the code generated for the hand-optimized version is
> not quite ideal. The asm of the inner loop appears (like the source) to
> contain about 4*81 multiplies. However, a 'smarter' way to do the
> calculation would be to compute the constants used for multiplying work(i)
> by retaining common subexpressions (i.e. all values of sa_i * sb_j * sc_k *
> sd_l * work[n] can be computed in 9+9+81+81 multiplies instead of the
> current scheme, which has 4*81). That could bring another factor of 2
> speedup. Is there a chance to have gcc see this, or does this need to be
> done on the source level ?
> If considered useful, I can add a PR to bugzilla with the testcase.
I think it is useful to have a bugzilla here.
The loop bodies are too big to be unrolled early, the temporary array
is unfortunately addressable (and thus not scalarized early) because of
# D.2754_658 = PHI <D.2754_130(11)>
__builtin_memset (&buffer2, 0, 648);
repeating all over the place. Final unrolling unrolls some of the loops,
but that is way too late for followup optimizations. The factoring you
mention should be already done by reassoc/FRE - it would be interesting
to see why it does not work.
I tested 4.4, what did you test?