This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug libfortran/51119] MATMUL slow for large matrices
- From: "jb at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Tue, 24 Nov 2015 12:22:46 +0000
- Subject: [Bug libfortran/51119] MATMUL slow for large matrices
- Auto-submitted: auto-generated
- References: <bug-51119-4 at http dot gcc dot gnu dot org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #25 from Janne Blomqvist <jb at gcc dot gnu.org> ---
(In reply to Jerry DeLisle from comment #24)
> (In reply to Jerry DeLisle from comment #16)
> > For what its worth:
> >
> > $ gfc pr51119.f90 -lblas -fno-external-blas -Ofast -march=native
> > $ ./a.out
> > Time, MATMUL: 21.2483196 21.254449646000001 1.5055670945599979
> >
> > Time, dgemm: 33.2441711 33.243087289000002 .96260614189671445
> >
>
> Running a sample matrix multiply program on this same platform using the
> default OpenCL (Mesa on Fedora 22) the machine is achieving:
>
> 64 x 64 2.76 Gflops
> 1000 x 1000 14.10
> 2000 x 2000 24.4
But, that is not particularly impressive, is it? I don't know about current low
end graphics adapters, but at least the high end GPU cards (Tesla) are capable
of several Tflops. Of course, there is a non-trivial threshold size to amortize
the data movement to/from the GPU.
With the test program from #12, with OpenBLAS (which BTW should be available in
Fedora 22 as well) I get 337 Gflops/s, or 25 Gflops/s if I restrict it to a
single core with the OMP_NUM_THREADS=1 environment variable. This on a machine
with 20 2.8 GHz Ivy bridge cores.
I'm not per se against using GPU's, but I think there's a lot of low hanging
fruit to be had just by making it easier for users to use a high performance
BLAS implementation.