This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
On Thu, Mar 02, 2017 at 11:45:59AM +0100, Thomas Koenig wrote:
> Here's the updated version, which just uses FMA for AVX2.
>
> OK for trunk?
>
> Regards
>
> Thomas
>
> 2017-03-01 Thomas Koenig <tkoenig@gcc.gnu.org>
>
> PR fortran/78379
> * m4/matmul.m4: (matmul_'rtype_code`_avx2): Also generate for
> reals. Add fma to target options.
> (matmul_'rtype_code`): Call AVX2 only if FMA is available.
> * generated/matmul_c10.c: Regenerated.
> * generated/matmul_c16.c: Regenerated.
> * generated/matmul_c4.c: Regenerated.
> * generated/matmul_c8.c: Regenerated.
> * generated/matmul_i1.c: Regenerated.
> * generated/matmul_i16.c: Regenerated.
> * generated/matmul_i2.c: Regenerated.
> * generated/matmul_i4.c: Regenerated.
> * generated/matmul_i8.c: Regenerated.
> * generated/matmul_r10.c: Regenerated.
> * generated/matmul_r16.c: Regenerated.
> * generated/matmul_r4.c: Regenerated.
> * generated/matmul_r8.c: Regenerated.
Actually, I see a problem, but not related to this patch.
I bet e.g. tsan would complain heavily on the wrappers, because the code
is racy:
static void (*matmul_p) ('rtype` * const restrict retarray,
'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
int blas_limit, blas_call gemm) = NULL;
if (matmul_p == NULL)
{
matmul_p = matmul_'rtype_code`_vanilla;
if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
{
matmul_p = matmul_'rtype_code`_avx512f;
goto tailcall;
}
#endif /* HAVE_AVX512F */
...
}
tailcall:
(*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
So, even when assuming all matmul_p = stores are atomic, e.g. if you call
matmul from 2 or more threads about the same time for the first time,
it could be that the first one sets matmul_p to vanilla and then another
thread runs it (uselessly slow), etc.
As you don't care about the if (matmul_p == NULL) part being done in
multiple threads concurrently, I guess you could e.g. do:
static void (*matmul_p) ('rtype` * const restrict retarray,
'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
int blas_limit, blas_call gemm); // <--- No need for NULL initializer for static var
void (*matmul_fn) ('rtype` * const restrict retarray,
'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
int blas_limit, blas_call gemm);
matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
if (matmul_fn == NULL)
{
matmul_fn = matmul_'rtype_code`_vanilla;
if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
{
matmul_fn = matmul_'rtype_code`_avx512f;
goto finish;
}
#endif /* HAVE_AVX512F */
...
finish:
__atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
}
(*matmul_fn) (retarray, a, b, try_blas, blas_limit, gemm);
(i.e. make sure you read matmul_p in each call exactly once and store at
most once per thread).
Jakub