"float complex" arithmetic performance much slower than expected
Tim Prince
n8tm@aol.com
Wed Mar 6 17:08:00 GMT 2013
On 3/6/2013 11:37 AM, Michele Martone wrote:
> Hi,
>
> I have four functions written in C99: FD,FS,FC,FZ.
>
> All implement the same algorithm.
>
> This algorithm reads a handful of integer and floating point numeric
> arrays, and updates one floating point numeric array with the results
> of the computation.
> In the case of FD the floating point numeric array is of "double"; in
> the case of FZ is of "double complex"; FS uses "float"; FC uses
> "float complex".
> Apart from the floating point arrays (and some scalar argument), the
> rest of the code is identical.
>
> Assume providing the same input to FD/FS/FC/FZ, except for the type of
> the numerical arrays of course.
> I compile two versions of the code the code: one with with gcc-4.7.2,
> and the other with Intels icc 13.1 .
>
> Now, on the same input:
> FD/gcc takes ~ the same time as FD/icc
> FS/gcc takes ~ the same time as FS/icc
> FZ/gcc takes ~ twice the time of FZ/icc
> FC/gcc takes ~ six times the time of FC/icc
>
> In other words, my experiments suggest that the my "double complex" (or,
> "double _Complex") code is quite slower when compiled with gcc.
> And the implementation for "float complex" seems even slower.
>
> Some additional details:
>
> . Executions here are 'single threaded'
> . performed on an Intel's Sandy Bridge CPU
> . The {FD,FZ,FS,FC} share the same source file
> . CFLAGS for gcc:
> "-O3 -pipe -march=native -mtune=native -mavx -std=c99 -fno-unroll-loops"
> . CFLAGS for icc: "-O3 -xAVX -restrict -unroll=0"
> . together, these functions are some 160 lines long (so, short)
> . I'm using loop unrolling in the code
> . argument arrays are specified as e.g.: "double complex * restrict x"
> . if I were to run ~3-4 instances of any of the above routine in parallel,
> the memory bandwidth of the CPU would be saturated.
>
> Now, one may argue about the "optimality" of my implementation of the
> four above routines. Regarding this, I also benchmarked an
> implementation of the same algorithm from the Intel's MKL library.
> One may assume that MKL is "highly optimized":
>
> So, with regards to Intel's implementation:
> FD/icc and FZ/icc are ~20% slower than the MKL counterpart
> FS/icc and FC/icc are ~35% slower than the MKL counterpart
>
> But the gcc-compiled one:
> FD/gcc is ~20% slower than the MKL counterpart
> FS/gcc is ~35% slower than the MKL counterpart
> FZ/gcc is ~60% slower than the MKL counterpart (!)
> FC/gcc is ~90% slower than the MKL counterpart (!!)
>
> So it seems like the "float complex" compiled code is much slower wich
> gcc than with icc, while this is not so for other integral types.
>
>
> Do you find this consistent with your experience in "complex" and gcc,
> or it may be the case I am ignoring some basic rule in using gcc ?
>
In the absence of -fcx-limited-range, gcc may protect divide and sqrt by
using library functions, where icc would simply widen to double. You
would see any such library function usage if you profiled by gprof, at
least when the library is static linked. Also, the library functions
used by gcc aren't vectorized, while icc would go further toward
promoting vectorization by in-lining code or calling vector math
functions. Vectorization reports for both compilers would shed light on
this question.
--
Tim Prince
More information about the Gcc-help
mailing list