Bug 79876 - [7 Regression] FAIL: libgomp.fortran/strassen.f90 -O execution test on x86_64-apple-darwin16
Summary: [7 Regression] FAIL: libgomp.fortran/strassen.f90 -O execution test on x86...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: libgomp (show other bugs)
Version: 7.0.1
: P3 normal
Target Milestone: 7.0
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-03-05 14:39 UTC by Dominique d'Humieres
Modified: 2017-04-04 10:55 UTC (History)
5 users (show)

See Also:
Host: x86_64-apple-darwin16
Target: x86_64-apple-darwin16
Build: x86_64-apple-darwin16
Known to work:
Known to fail:
Last reconfirmed: 2017-03-05 00:00:00


Attachments
gcc7-pr79876.patch (1.48 KB, patch)
2017-04-01 09:12 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dominique d'Humieres 2017-03-05 14:39:24 UTC
Since r245745 I see the following failure on x86_64-apple-darwin16 with -m32/64

FAIL: libgomp.fortran/strassen.f90   -O  execution test 

This is a latent bug exposed by this revisions and which appeared between revisions r242391 (2016-11-14, OK) and r242872 (2016-11-25, Illegal instruction) when the test is compiled with -finline-matmul-limit<32 (I didn't test the 32 values, but only 0, 16, and 31). 

% /opt/gcc/gcc7p-242872p3/bin/gfortran -O2 strassen_red.f90 -finline-matmul-limit=31 -fopenmp
% ./a.out
Illegal instruction

The test succeeds with OMP_NUM_THREADS set to 1 on a machine with four cores, eight threads.

I don't see the failure on another machine with darwin10 and two cores/threads.

The test strassen_red.f90 is

program strassen_matmul
  use omp_lib
  integer, parameter :: N = 1024
  double precision, save :: A(N,N), B(N,N), C(N,N), D(N,N)
  double precision :: start, end

  call random_seed
  call random_number (A)

contains

  recursive subroutine strassen (A, B, C, N)
    integer, intent(in) :: N
    double precision, intent(in) :: A(N,N), B(N,N)
    double precision, intent(out) :: C(N,N)
    double precision :: T(N/2,N/2,7)
    integer :: K, L

    if (iand (N,1) .ne. 0 .or. N < 64) then
      C = matmul (A, B)
      return
    end if
    K = N / 2
    L = N / 2 + 1
!$omp task shared (A, B, T)
    call strassen (A(:K,:K) + A(L:,L:), B(:K,:K) + B(L:,L:), T(:,:,1), K)
!$omp end task
!$omp task shared (A, B, T)
    call strassen (A(L:,:K) + A(L:,L:), B(:K,:K), T(:,:,2), K)
!$omp end task
!$omp task shared (A, B, T)
    call strassen (A(:K,:K), B(:K,L:) - B(L:,L:), T(:,:,3), K)
!$omp end task
!$omp task shared (A, B, T)
    call strassen (A(L:,L:), B(L:,:K) - B(:K,:K), T(:,:,4), K)
!$omp end task
!$omp task shared (A, B, T)
    call strassen (A(:K,:K) + A(:K,L:), B(L:,L:), T(:,:,5), K)
!$omp end task
!$omp task shared (A, B, T)
    call strassen (A(L:,:K) - A(:K,:K), B(:K,:K) + B(:K,L:), T(:,:,6), K)
!$omp end task
!$omp task shared (A, B, T)
    call strassen (A(:K,L:) - A(L:,L:), B(L:,:K) + B(L:,L:), T(:,:,7), K)
!$omp end task
!$omp taskwait
    C(:K,:K) = T(:,:,1) + T(:,:,4) - T(:,:,5) + T(:,:,7)
    C(L:,:K) = T(:,:,2) + T(:,:,4)
    C(:K,L:) = T(:,:,3) + T(:,:,5)
    C(L:,L:) = T(:,:,1) - T(:,:,2) + T(:,:,3) + T(:,:,6)
  end subroutine strassen
end
Comment 1 Dominique d'Humieres 2017-03-05 20:09:22 UTC
Revisions r242462 and r242518 are within the range.

The failure triggers also if the tests are compiled with -O0 or with -fno-frontend-optimize.
Comment 2 Thomas Koenig 2017-03-05 23:47:09 UTC
(In reply to Dominique d'Humieres from comment #1)
> Revisions r242462 and r242518 are within the range.
> 
> The failure triggers also if the tests are compiled with -O0 or with
> -fno-frontend-optimize.

A few questions:

- Still happening after r245839 (which fixed a potential race condition)?

- What / where is the illegal instruction happening?  Is it in
  matmul_*_avx, matmul_*_avx2, ...?

- What CPU are you running?

- If you print __cpu_model.__cpu_vendor and __cpu_model.__cpu_features[0]
from a debugger, what is the output?

- What is the stack size for each thread?  Are you maybe running
  into a stack limitation?
Comment 3 Thomas Koenig 2017-03-06 15:55:25 UTC
On my Linux system, I can get a crash with

OMP_STACKSIZE=500k ./a.out

and successfull execution with

OMP_STACKSIZE=1M ./a.out

What happens if you try these commands?
Comment 4 Dominique d'Humieres 2017-03-06 22:35:23 UTC
> Le 6 mars 2017 à 16:55, tkoenig at gcc dot gnu.org <gcc-bugzilla@gcc.gnu.org> a écrit :
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79876
> 
> --- Comment #3 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
> On my Linux system, I can get a crash with
> 
> OMP_STACKSIZE=500k ./a.out
> 
> and successfull execution with
> 
> OMP_STACKSIZE=1M ./a.out
> 
> What happens if you try these commands?

I see the same behavior: crash up to 799k, successful execution above 800k.

Dominique

> 
> -- 
> You are receiving this mail because:
> You reported the bug.
Comment 5 Jakub Jelinek 2017-03-07 08:35:16 UTC
Does Darwin have so ridiculously low stack size limits by default, or is that just some setting you've done?
Comment 6 Dominique d'Humieres 2017-03-07 11:01:18 UTC
> Does Darwin have so ridiculously low stack size limits by default,
> or is that just some setting you've done?

On darwin1(0|6) ulimit gives

stack size              (kbytes, -s) 65532

and the environment variable OMP_STACKSIZE is empty. 

Note that on darwin10 (revision r245900) the reduced test succeeds with OMP_STACKSIZE set to 80k and for smaller values I get an error

libgomp: Stack size larger than system limit

at run time.
Comment 7 Jakub Jelinek 2017-03-07 11:22:16 UTC
stack size              (kbytes, -s) 65532
is 64M, that should be more than enough.  Of course unless darwin pthread_create does something dumb, like using far smaller stack sizes for threads by default.
In that case, the question is if we shouldn't override that on darwin to something larger, or just use
! { dg-set-target-env-var OMP_STACKSIZE "8M" }
on the test.  Note that dg-set-target-env-var doesn't (yet?) support target specification as 3rd argument, so I'd prefer not to add that.
Or we could add -finline-matmul-limit=64 or whatever works.
Comment 8 Thomas Koenig 2017-03-07 16:39:15 UTC
I have read people complaining about very low OMP stack sizes
on OSX.

Should we set this to a more reasonable default value in libgfortran?
Less than 800k is quite ridiculous.

Alternatively, is there a way to detect stack overrun in a case like
this, and abort with something more helpful?
Comment 9 Dominique d'Humieres 2017-03-07 17:44:27 UTC
> I have read people complaining about very low OMP stack sizes
> on OSX.

What is setting the limit?

> Should we set this to a more reasonable default value in libgfortran?
> Less than 800k is quite ridiculous.

If possible, it would be a great idea.

> Alternatively, is there a way to detect stack overrun in a case like
> this, and abort with something more helpful?

Indeed, but is it worth the pain?

Final note, it would be interesting to understand why r242391 is OK, but not r242872.
Comment 10 Jakub Jelinek 2017-03-30 09:29:06 UTC
(In reply to Dominique d'Humieres from comment #9)
> > I have read people complaining about very low OMP stack sizes
> > on OSX.
> 
> What is setting the limit?

Probably OSX thread library?

Can you try something like:
#define _GNU_SOURCE
#include <limits.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <pthread.h>

static void *
tf (void *arg)
{
#ifdef __APPLE__
  fprintf (stderr, "other thread pthread_get_stacksize_np %lld\n",
           pthread_get_stacksize_np (pthread_self ()));
#else
  pthread_attr_t at;
  pthread_getattr_np (pthread_self (), &at);
  size_t sz;
  pthread_attr_getstacksize (&at, &sz);
  fprintf (stderr, "other thread pthread_getattr_np + "
           "pthread_attr_getstacksize %zd\n", sz);
  pthread_attr_destroy (&at);
#endif
  return NULL;
}

int
main ()
{
  struct rlimit rl;
  if (getrlimit (RLIMIT_STACK, &rl))
    fprintf (stderr, "err1\n");
  else
    fprintf (stderr, "RLIMIT_STACK cur %lld max %lld\n",
             (long long) rl.rlim_cur, (long long) rl.rlim_max);
  fprintf (stderr, "PTHREAD_STACK_MIN %lld\n", (long long) PTHREAD_STACK_MIN);
#ifdef __APPLE__
  fprintf (stderr, "main thread pthread_get_stacksize_np %lld\n",
           pthread_get_stacksize_np (pthread_self ()));
#else
  pthread_attr_t at;
  pthread_getattr_np (pthread_self (), &at);
  size_t sz;
  pthread_attr_getstacksize (&at, &sz);
  fprintf (stderr, "main thread pthread_getattr_np + "
           "pthread_attr_getstacksize %zd\n", sz);
  pthread_attr_destroy (&at);
#endif
  pthread_t th;
  pthread_create (&th, NULL, tf, NULL);
  pthread_join (th, NULL);
  pthread_attr_t at2;
  pthread_attr_init (&at2);
  pthread_attr_setstacksize (&at2, 2048 * 1024);
  pthread_create (&th, &at2, tf, NULL);
  pthread_join (th, NULL);
  pthread_attr_destroy (&at2);
  return 0;
}

and see what it prints?
Comment 11 Dominique d'Humieres 2017-03-30 13:33:56 UTC
> and see what it prints?

RLIMIT_STACK cur 67104768 max 67104768
PTHREAD_STACK_MIN 8192
main thread pthread_get_stacksize_np 67104768
other thread pthread_get_stacksize_np 524288
other thread pthread_get_stacksize_np 2097152

and it does not depend on OMP_STACKSIZE.
Comment 12 Jakub Jelinek 2017-03-30 14:06:48 UTC
Of course, the testcase was not using libgomp at all and OMP_STACKSIZE is an env var used by libgomp only.
So, this means whatever darwin libpthread are using is using extremely small default for the pthread_create stack size (.5MB) unless overridden (i.e. when
pthread_create is called with NULL as attribute; maybe also when called with
pthread_attr_init created attr without pthread_attr_setstacksize - you could test this by commenting out
  pthread_attr_setstacksize (&at2, 2048 * 1024);
line, recompile/relink/rerun the testcase.
Can you try some other darwin versions if you have access to them?
Comment 13 Jakub Jelinek 2017-03-30 14:13:19 UTC
Seems libgomp has a bug that without {,G}OMP_STACKSIZE and with OMP_DISPLAY_ENV it displays random (uninitialized) value for the stack size, so I need to test something like:
2017-03-30  Jakub Jelinek  <jakub@redhat.com>

	* env.c (initialize_env): Initialize stacksize to 0.

--- libgomp/env.c.jj	2017-01-01 12:45:52.000000000 +0100
+++ libgomp/env.c	2017-03-30 16:07:29.800247913 +0200
@@ -1187,7 +1187,7 @@ handle_omp_display_env (unsigned long st
 static void __attribute__((constructor))
 initialize_env (void)
 {
-  unsigned long thread_limit_var, stacksize;
+  unsigned long thread_limit_var, stacksize = 0;
   int wait_policy;
 
   /* Do a compile time check that mkomp_h.pl did good job.  */

If we figure out what Darwin versions have so insanely low defaults and agree on some more reasonable one (1M or 2M?), then we could change this to something like:
  stacksize = GOMP_DEFAULT_STACKSIZE;
and later on where it parse_stacksize add || stacksize to the condition last,
then define GOMP_DEFAULT_STACKSIZE 0 by default and let some config/darwin/ header override it.
Comment 14 Jakub Jelinek 2017-04-01 09:12:20 UTC
Created attachment 41098 [details]
gcc7-pr79876.patch

Found https://developer.apple.com/library/content/qa/qa1419/_index.html that documents the default to be 512KB which is way too low (though it is unclear if it is the same on all Darwin versions).
Anyway, here is an untested patch to bump the default on Darwin only.
Comment 15 Dominique d'Humieres 2017-04-01 12:34:53 UTC
> Anyway, here is an untested patch to bump the default on Darwin only.

The patch fixes the issue on darwin16. Testing on darwin10 in progress.
Comment 16 Jakub Jelinek 2017-04-04 10:42:05 UTC
Author: jakub
Date: Tue Apr  4 10:41:33 2017
New Revision: 246675

URL: https://gcc.gnu.org/viewcvs?rev=246675&root=gcc&view=rev
Log:
	PR libgomp/79876
	* config/posix/thread-stacksize.h: New file.
	* config/darwin/thread-stacksize.h: New file.
	* config/nvptx/thread-stacksize.h: New file.
	* env.c: Include thread-stacksize.h.
	(initialize_env): Initialize stacksize to GOMP_DEFAULT_STACKSIZE
	instead of 0.  Call pthread_attr_setstacksize even if
	GOMP_DEFAULT_STACKSIZE is non-zero.

Added:
    trunk/libgomp/config/darwin/thread-stacksize.h
    trunk/libgomp/config/nvptx/thread-stacksize.h
    trunk/libgomp/config/posix/thread-stacksize.h
Modified:
    trunk/libgomp/ChangeLog
    trunk/libgomp/env.c
Comment 17 Jakub Jelinek 2017-04-04 10:55:21 UTC
Fixed.