[PING] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?)

Wed Jan 11 12:06:18 GMT 2023

Hi!

Ping -- the '-mframe-malloc-threshold' idea, at least.

Note that while this issue originally did pop up for Fortran I/O, it's
likewise relevant for other functions that maintain big frames, for
example in newlib:

    libc/string/libc_a-memmem.o:.local .align 16 .b8 %frame_ar[2064];
    libc/string/libc_a-strcasestr.o:.local .align 16 .b8 %frame_ar[2064];
    libc/string/libc_a-strstr.o:.local .align 16 .b8 %frame_ar[2064];
    libm/math/libm_a-k_rem_pio2.o:.local .align 16 .b8 %frame_ar[560];

Therefore a generic solution (or, workaround if you'd like) does seem
appropriate.

Grüße
 Thomas

On 2022-12-23T15:08:06+0100, I wrote:
> Hi!
>
> On 2022-11-11T15:35:44+0100, Richard Biener via Fortran <fortran@gcc.gnu.org> wrote:
>> On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge <thomas@codesourcery.com> wrote:
>>> For example, for Fortran code like:
>>>
>>>     write (*,*) "Hello world"
>>>
>>> ..., 'gfortran' creates:
>>>
>>>     struct __st_parameter_dt dt_parm.0;
>>>
>>>     try
>>>       {
>>>         dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1};
>>>         dt_parm.0.common.line = 29;
>>>         dt_parm.0.common.flags = 128;
>>>         dt_parm.0.common.unit = 6;
>>>         _gfortran_st_write (&dt_parm.0);
>>>         _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11);
>>>         _gfortran_st_write_done (&dt_parm.0);
>>>       }
>>>     finally
>>>       {
>>>         dt_parm.0 = {CLOBBER(eol)};
>>>       }
>>>
>>> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
>>> really! -- there's a lot of state in Fortran I/O apparently).  That's a
>>> problem for GPU execution -- here: OpenACC/nvptx -- where typically you
>>> have small stacks.  (For example, GCC/OpenACC/nvptx: 1 KiB per thread;
>>> GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack'
>>> "Use custom stacks instead of local memory for automatic storage".)
>>>
>>> Now, the Nvidia Driver tries to accomodate for such largish stack usage,
>>> and dynamically increases the per-thread stack as necessary (thereby
>>> potentially reducing parallelism) -- if it manages to understand the call
>>> graph.  In case of libgfortran I/O, it evidently doesn't.  Not being able
>>> to disprove existance of recursion is the common problem, as I've read.
>>> At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example:
>>>
>>>     warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined
>>>
>>> That's still not an actual problem: if the GPU kernel's stack usage still
>>> fits into 1 KiB.  Very often it does, but if, as happens in libgfortran
>>> I/O handling, there is another such 'dt_parm' put onto the stack, the
>>> stack then overflows; device-side SIGSEGV.
>>>
>>> (There is, by the way, some similar analysis by Tom de Vries in
>>> <https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite]
>>> Recursive tests may fail due to thread stack limit".)
>>>
>>> Of course, you shouldn't really be doing I/O in GPU kernels, but people
>>> do like their occasional "'printf' debugging", so we ought to make that
>>> work (... without pessimizing any "normal" code).
>>>
>>> I assume that generally reducing the size of 'dt_parm' etc. is out of
>>> scope.
>>>
>>> There is a way to manually set a per-thread stack size, but it's not
>>> obvious which size to set: that sizes needs to work for the whole GPU
>>> kernel, and should be as low as possible (to maximize parallelism).
>>> I assume that even if GCC did an accurate call graph analysis of the GPU
>>> kernel's maximum stack usage, that still wouldn't help: that's before the
>>> PTX JIT does its own code transformations, including stack spilling.
>>>
>>> There exists a 'CU_JIT_LTO' flag to "Enable link-time optimization
>>> (-dlto) for device code".  This might help, assuming that it manages to
>>> simplify the libgfortran I/O code such that the PTX JIT then understands
>>> the call graph.  But: that's available only starting with recent
>>> CUDA 11.4, so not a general solution -- if it works at all, which I've
>>> not tested.
>>>
>>> Similarly, we could enable GCC's LTO for device code generation -- but
>>> that's a big project, out of scope at this time.  And again, we don't
>>> know if that at all helps this case.
>>>
>>> I see a few options:
>>>
>>> (a) Figure out what it is in the libgfortran I/O implementation that
>>> causes "Stack size [...] cannot be statically determined", and re-work
>>> that code to avoid that, or even disable certain things for nvptx, if
>>> feasible.
>
>> Shrink st_parameter_dt (it's part of the ABI though, kind of).  Lots of the
>> bloat is from things that are unused for simpler I/O cases (so some
>> "inheritance" could help), and lots of the bloat is from using
>> string/length pairs using char * + size_t for what looks like could be
>> encoded a lot more efficiently.
>>
>> There's probably not much low-hanging fruit.
>
> (Similarly comments in Janne's email.)
>
>
> Well, as had to be expected, libgfortran I/O is really just one example,
> but the underlying problem may also be triggered in other ways (via other
> newlib/libc functions, for example).
>
> So, really a generic solution seems to be called for.
>
>>> (b) Also for GCC/OpenACC/nvptx use the GCC/OpenMP/nvptx '-msoft-stack'.
>>> I don't really want to do that however: it does introduce a bit of
>>> complexity in all the generated device code and run-time overhead that we
>>> generally would like to avoid.
>
> Directly using '-msoft-stack' isn't actually possible: it does implement
> "one stack per 32-threads warp", but for OpenACC we need "one stack per
> thread of a warp" (that is, each OpenACC 'vector' independently), and
> pre-allocating from device memory all those stacks (which may be a lot!)
> I foresee to really negatively impact overall performance?
>
>>> (c) I'm contemplating a tweak/compiler pass for transforming such large
>>> stack objects into heap allocation (during nvptx offloading compilation).
>>> 'malloc'/'free' do exist; they're slow, but that's not a problem for the
>>> code paths this is to affect.  (Might also add some compile-time
>>> diagnostic, of course.)  Could maybe even limit this to only be used
>>> during libgfortran compilation?  This is then conceptually a bit similar
>>> to (b), but localized to relevant parts only.  Has such a thing been done
>>> before in GCC, that I could build upon?
>>>
>>> Any other clever ideas?
>
>> Converting to heap allocation is difficult outside of the frontend and you
>> have to be very careful with memleaks.
>
> Heh, in fact it seems to be pretty simple!  (Famous last words?)  See
> "[WIP] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold'"
> attached.  What do people think about such a thing?
>
> Still to be discussed are '-Wframe-malloc-threshold' (default-on vs.
> '-Wextra'; or '-fopt-info' 'missed: [...]' or 'note: [...]' instead?),
> default value for '-mframe-malloc-threshold=[...]' (potentially different
> for GCC/nvptx target libraries build vs. user-compiled code?), etc.
>
>
>> The library is written in C and
>> I see heap allocated temporaries there but in at least one
>> place a stack one is used:
>>
>> void
>> st_endfile (st_parameter_filepos *fpp)
>> {
>> ...
>>       if (u->current_record)
>>         {
>>           st_parameter_dt dtp;
>>           dtp.common = fpp->common;
>>           memset (&dtp.u.p, 0, sizeof (dtp.u.p));
>>           dtp.u.p.current_unit = u;
>>           next_record (&dtp, 1);
>>
>> that might be a mistake though - maybe it's enough to change that
>> to a heap allocation?  It might be also totally superfluous since
>> only 'u' should matter here ... (not sure if the above is the case
>> you are running into).
>
> (Have not yet looked into that; won't solve the general issue.)
>
>
> Grüße
>  Thomas

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-WIP-nvptx-mframe-malloc-threshold-Wframe-malloc-thre.patch
Type: text/x-diff
Size: 16586 bytes
Desc: not available
URL: <https://gcc.gnu.org/pipermail/gcc-patches/attachments/20230111/bdd80ea9/attachment-0001.bin>