Bug 103985 - segfault in finalize_transfer (fbuf_destroy) on (parallel) writing into character / string function argument
Summary: segfault in finalize_transfer (fbuf_destroy) on (parallel) writing into chara...
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: libfortran (show other bugs)
Version: 10.3.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: openmp
Depends on:
Blocks:
 
Reported: 2022-01-11 22:45 UTC by Dr. Thomas Orgis
Modified: 2022-01-17 13:57 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dr. Thomas Orgis 2022-01-11 22:45:51 UTC
I noticed an infrequent crash of an OpenMP-parallelized scientific application that is observable when running hundreds of instances. A recent reproduction attempt on 250 Haswell nodes on Linux 4.14.259 with a loop over simulation setup and some short computation showed 4 cases of the crash vs. 3176 successful runs. Each run used 16 threads on a dual-socket node.

This chance means getting a reproducer is tricky. This is a rather trivial part of a rather elaborate codebase (which triggered a number of Fortran compiler bugs before, regardless of compiler vendor), so it might be extracted, but the odds may mean that certain structures in memory are necessary to really trip up the gfortran runtime.

So I'll report the backtrace and some code fragments here. Maybe this already points into some direction where one can home in on the underlying corruption.

This is the crash (leading the lines is a node name and seconds roughly since program start):

node292 @ 14.482751:  model_setup: Done.
node292 @ 14.482965:  memory/kiB  peak=5924012 size=5924012 resp=3262324 ress=3225348
node292 @ 14.504973: 
node292 @ 14.505052: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
node292 @ 14.505071: 
node292 @ 14.505083: Backtrace for this error:
node292 @ 15.149332: #0  0x15523d2cd3ff in ???
node292 @ 15.149365: #1  0x15523d3174a9 in ???
node292 @ 15.149376: #2  0x15523d31820d in ???
node292 @ 15.150486: #3  0x15523e46f469 in finalize_transfer
node292 @ 15.150499:    at ../../../gcc-10.3.0/libgfortran/io/transfer.c:4228
node292 @ 15.150586: #4  0x430f6d in __stringtools_MOD_str_integer
node292 @ 15.150599:    at .preprocessed/misc/stringtools.f90:147
node292 @ 15.150789: #5  0x546b3f in find_worst_speeds
node292 @ 15.150800:    at .preprocessed/dgmodel/cfl.f90:309
node292 @ 15.150808: #6  0x546b3f in __cfl_MOD_cfl_timestep._omp_fn.0
node292 @ 15.150817:    at .preprocessed/dgmodel/cfl.f90:198
node292 @ 15.150939: #7  0x15523d8de4d1 in GOMP_parallel
node292 @ 15.150951:    at ../../../gcc-10.3.0/libgomp/parallel.c:171
node292 @ 15.150961: #8  0x547404 in __cfl_MOD_cfl_timestep
node292 @ 15.150969:    at .preprocessed/dgmodel/cfl.f90:195
node292 @ 15.151289: #9  0x548f59 in find_new_step
node292 @ 15.151300:    at .preprocessed/dgmodel/time_bdf.f90:635
node292 @ 15.151310: #10  0x54d98b in __time_bdf_MOD_time_bdf_explicit_part
node292 @ 15.151318:    at .preprocessed/dgmodel/time_bdf.f90:399
node292 @ 15.151326: #11  0x54d98b in __time_bdf_MOD_time_bdf_step
node292 @ 15.151334:    at .preprocessed/dgmodel/time_bdf.f90:489
node292 @ 15.151442: #12  0x526acc in __time_integration_MOD_timeint_step
node292 @ 15.151454:    at .preprocessed/dgmodel/time_integration.f90:141
node292 @ 15.151743: #13  0x40b058 in dgmodel
node292 @ 15.151754:    at .preprocessed/programs/dgmodel.f90:371
node292 @ 15.151764: #14  0x40760c in main
node292 @ 15.151772:    at .preprocessed/programs/dgmodel.f90:46
node292 @ 17.984833: 85.74user 6.77system 0:16.76elapsed 551%CPU (0avgtext+0avgdata 3262324maxresident)k
node292 @ 17.984892: 64680inputs+8737984outputs (18major+1123633minor)pagefaults 0swaps

The memory stats before/after the crash might be useful, too … I reckon that some significant memory action is necessary to observe the issue.

The respective code is inside an OpenMP section (line numbers prefixed):

186 !$omp parallel num_threads(cfl%threadcount) private(loopcount, ti)
187    loopcount = size(state%setup%grid%elements)
188    ti = ompapi_threadnum()
189 
190 
191    cfl%maxei(:,ti) = -1
192    cfl%max_e_speeds(:,ti) = 0._dp
193    cfl%speedratio(:,ti) = m_big
194 
195 !$omp do
196    do ei=1,loopcount
197       call find_worst_speeds(state%setup%grid%elements(ei), state, mytype, cfl%maxei(:,ti), cfl%max_e_speeds(:,ti), cfl%speedratio&
198 &(:,ti), directions)
199    end do
200 !$omp end do
201 !$omp end parallel

(The loopcount business is a workaround for an earlier compiler bug with Sun Studio that messed up parallel loop boundaries, AFAIR.)

The routine find_worst_speeds does some computations on the handed-in structures and then does this:

308 
309    if(err_follow_line('dgmodel/cfl.F90', 265, 'Trouble in element ' // trim(adjustl(str_number(element%id))))) return
310 end subroutine

The function err_follow_line checks a global error counter and returns true and prints the constructed string that is passed if there has been an error raised before. So, a simple type of exception handling, sort of.

148 function err_follow_line(file, line, cmessage) result(lstatus)
149    character(*), intent(in) :: file
150    integer, intent(in) :: line
151    character(*), intent(in) :: cmessage
152    logical :: lstatus
153 
154    if( .not. err_count == 0 ) then
155 
156 
157       if( (par_err_firstonly == 0 .or. backtrace) &
158          .and. (par_err_maxfollow < 1 .or. followed < par_err_maxfollow) ) then
159          followed = followed + (1)
160          write(0,'(A,A,A,A,A)') err_marker, ' error(follow), ', &
161             trim(err_filestring(file, line)), ' ', trim(cmessage)
162       end if
163       lstatus = .true.
164    else
165       lstatus = .false.
166    end if
167 end function err_follow_line

The construction of this string argument is where the trouble occurs. This is str_number, in this case implemented for integers for the common interface:

 53 integer, parameter :: str_bufferlen = 100

143 function str_integer(number) result(out)
144    integer, intent(in) :: number
145    character(str_bufferlen) :: out
146 
147    write(out, *) number
148 end function

And here, the relevant section of libgfortran is invoked on the write call to the character variable. For keeping the context together, this is the block where the crash occurs, the call to fbuf_destroy() seems to be a sensible place for such. The position in the backtrace seems to be stable, always pointing here.

4220  done:
4221 
4222   if (dtp->u.p.unit_is_internal)
4223     {
4224       /* The unit structure may be reused later so clear the
4225          internal unit kind.  */
4226       dtp->u.p.current_unit->internal_unit_kind = 0;
4227 
4228       fbuf_destroy (dtp->u.p.current_unit);
4229       if (dtp->u.p.current_unit
4230           && (dtp->u.p.current_unit->child_dtio  == 0)
4231           && dtp->u.p.current_unit->s)
4232         {
4233           sclose (dtp->u.p.current_unit->s);
4234           dtp->u.p.current_unit->s = NULL;
4235         }
4236     }

I am not sure how much influence the parallel nature of the OpenMP section has, but I bet that it's causal. Hardware issues seem unlikely given that the crash is always at the same place and we'd notice failures in ECC memory.
Comment 1 Andrew Pinski 2022-01-12 01:02:57 UTC
There could be a lock missing when it comes to the write. I remember in the past there were some issues with writes not being thread safe (maybe they still are not).
Comment 2 Dr. Thomas Orgis 2022-01-12 10:30:45 UTC
You mean every call to a write(), even to a buffer variable, should be a thread synchronization point? That would mean that I swiftly should get rid of those in the performance-critical paths. I hoped it's not more elaborate than sprintf() (which is harmless, I really, really, hope;-).
Comment 3 anlauf 2022-01-12 11:07:17 UTC
Can you try gcc-11 or gcc-12?

There were some fixes for asynchronous I/O (pr100352) touching the locking
mechanism.
Comment 4 Dr. Thomas Orgis 2022-01-12 22:20:03 UTC
I can try to roll a build and squeeze some test runs in. Naturally, this will take a few days to get some statistical confidence and it depends on cluster load, too.

So a snapshot of 12 would be preferred?

But also: Is it settled that there should even be any locking in the code path to basically do sprintf() using a write to a character variable?
Comment 5 Dr. Thomas Orgis 2022-01-17 13:57:33 UTC
OK, I tested with gcc-12-20220109. I repeated the test job, got 17102 successful runs and no crash. So one can hope that this particular one has been fixed at some point. All good? Is the specific change easily identified for backporting to a bugfix release? I wonder if it is really a regression. I did not try so many runs with earlier GCC versions and cannot really keep the cluster busy just for that.

That keeps my performance worry open: Is it really the case that the write to a character variable needs a (global?) lock?


PS: I had trouble building netcdf 4.6.3 with this snapshot … rather basic test failures. I did not investigate this further and just went along as the crash is not in the file storage parts. I hope this is just an artifact of my setup and not some very basic actual breakage.