This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul


On Thu, Mar 02, 2017 at 11:45:59AM +0100, Thomas Koenig wrote:
> Here's the updated version, which just uses FMA for AVX2.
> 
> OK for trunk?
> 
> Regards
> 
> 	Thomas
> 
> 2017-03-01  Thomas Koenig  <tkoenig@gcc.gnu.org>
> 
>         PR fortran/78379
>         * m4/matmul.m4: (matmul_'rtype_code`_avx2): Also generate for
>         reals.  Add fma to target options.
>         (matmul_'rtype_code`):  Call AVX2 only if FMA is available.
>         * generated/matmul_c10.c: Regenerated.
>         * generated/matmul_c16.c: Regenerated.
>         * generated/matmul_c4.c: Regenerated.
>         * generated/matmul_c8.c: Regenerated.
>         * generated/matmul_i1.c: Regenerated.
>         * generated/matmul_i16.c: Regenerated.
>         * generated/matmul_i2.c: Regenerated.
>         * generated/matmul_i4.c: Regenerated.
>         * generated/matmul_i8.c: Regenerated.
>         * generated/matmul_r10.c: Regenerated.
>         * generated/matmul_r16.c: Regenerated.
>         * generated/matmul_r4.c: Regenerated.
>         * generated/matmul_r8.c: Regenerated.

Actually, I see a problem, but not related to this patch.
I bet e.g. tsan would complain heavily on the wrappers, because the code
is racy:
  static void (*matmul_p) ('rtype` * const restrict retarray,
        'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
        int blas_limit, blas_call gemm) = NULL;

  if (matmul_p == NULL)
    {
      matmul_p = matmul_'rtype_code`_vanilla;
      if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
        {
          /* Run down the available processors in order of preference.  */
#ifdef HAVE_AVX512F
          if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
            {
              matmul_p = matmul_'rtype_code`_avx512f;
              goto tailcall;
            }
            
#endif  /* HAVE_AVX512F */
...
    }

tailcall:
   (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);

So, even when assuming all matmul_p = stores are atomic, e.g. if you call
matmul from 2 or more threads about the same time for the first time,
it could be that the first one sets matmul_p to vanilla and then another
thread runs it (uselessly slow), etc.

As you don't care about the if (matmul_p == NULL) part being done in
multiple threads concurrently, I guess you could e.g. do:
  static void (*matmul_p) ('rtype` * const restrict retarray,
        'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
        int blas_limit, blas_call gemm); //  <--- No need for NULL initializer for static var
  void (*matmul_fn) ('rtype` * const restrict retarray,
        'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
        int blas_limit, blas_call gemm);

  matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
  if (matmul_fn == NULL)
    {
      matmul_fn = matmul_'rtype_code`_vanilla;
      if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
        {
          /* Run down the available processors in order of preference.  */
#ifdef HAVE_AVX512F
          if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
            {
              matmul_fn = matmul_'rtype_code`_avx512f;
              goto finish;
            }
            
#endif  /* HAVE_AVX512F */
...
  finish:
      __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
    }
  (*matmul_fn) (retarray, a, b, try_blas, blas_limit, gemm);

(i.e. make sure you read matmul_p in each call exactly once and store at
most once per thread).

	Jakub


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]