Bug 101544 - [OpenMP][AMDGCN][nvptx] C++ offloading: unresolved _Znwm = "operator new(unsigned long)"
Summary: [OpenMP][AMDGCN][nvptx] C++ offloading: unresolved _Znwm = "operator new(unsi...
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 12.0
: P3 normal
Target Milestone: ---
Assignee: Thomas Schwinge
URL:
Keywords: openacc, openmp, wrong-code
Depends on: 92713
Blocks:
  Show dependency treegraph
 
Reported: 2021-07-21 06:40 UTC by Tobias Burnus
Modified: 2023-06-01 19:27 UTC (History)
7 users (show)

See Also:
Host:
Target: nvptx-none,amdgcn-amdhsa
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-07-26 00:00:00


Attachments
Testcase declare_target_base_class.cpp (683 bytes, text/x-csrc)
2021-07-21 06:40 UTC, Tobias Burnus
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tobias Burnus 2021-07-21 06:40:07 UTC
Created attachment 51184 [details]
Testcase declare_target_base_class.cpp

From https://github.com/SOLLVE/sollve_vv/pull/364
which is for https://github.com/SOLLVE/sollve_vv/issues/105

The code compiles with LLVM >= 12 but fails with LLVM < 12 according to the sollve_vv issue.


The testcase – allegedly OpenMP 4.5 – fails to link with:

unresolved symbol _Znwm
collect2: error: ld returned 1 exit status
mkoffload: fatal error: x86_64-none-linux-gnu-accel-nvptx-none-gcc returned 1 exit status

where  "_Znwm" = "operator new(unsigned long)"


The code has:

+#pragma omp declare target
+class S {
...
+#pragma omp target map(ptr)
+      ptr = new S();
...
+#pragma omp end declare target
Comment 1 Tobias Burnus 2021-07-21 07:58:39 UTC
Looking at it more closely, the problem seems to be that
  libstdc++{.so,.a}
does not exist for neither nvptx nor amdgcn – and the ME handles this
correctly.

The question is what's needed to get better libstdc++ support – either full support or like with libgfortran for nvptx stripped-down minimal version.
Comment 2 Jakub Jelinek 2021-07-21 08:05:53 UTC
Yeah.  For this exact case even just libsupc++.a would be enough, but I bet for other cases we can't away without libstdc++.  Would be nice if it could be configured in some embed-ish way to make it smaller, stuff like std::filesystem on the offloading target is unlikely to be useful etc.
Exceptions aren't supported either, are they.  And I think at least in OpenMP 5.2 that is allowed, throwing an exception in offloading region may be treated like #pragma omp error at(runtime) severity(fatal).
Comment 3 Andrew Stubbs 2021-07-21 09:06:22 UTC
The standalone amdgcn configuration does not support C++. There are a number of technical reasons why it doesn't Just Work, but basically it comes down to no-one ever working on it. Our customers were primarily interested in Fortran with C second.

C++ offloading works fine provided that there are no library calls or exceptions.

Ignoring unsupported C++ language features, for now, I don't think there's any reason why libstdc++ would need to be cut down. We already build the full libgfortran for amdgcn. System calls that make no sense on the GPU were implemented as stubs in Newlib (mostly returning some reasonable errno value), and it would be straight-forward to implement more the same way.

I believe static constructors work (libgfortran uses some), but exception handling does not. I'm not sure what other exotica C++ might need?

As for exceptions, set-jump-long-jump is not implemented because there was no call for it and I didn't know how to handle the GCN register files properly. Not only are they variable-sized, they're also potentially very large: ranging from ~6KB up to ~65KB, I think (102 32-bit scalar, and 256 2048-bit vector registers, for single-threaded mode, but only 80 scalar and 24 vector registers in maximum occupancy mode, in which case per-thread stack space is also quite limited). I'm not sure now the other exception implementations work.
Comment 4 Jonathan Wakely 2021-07-21 09:33:28 UTC
(In reply to Andrew Stubbs from comment #3)
> C++ offloading works fine provided that there are no library calls or
> exceptions.

There's no reason std::pair, std::tuple, std::optional and types like that shouldn't work.

Just making it possible to compile with -fno-rtti -fno-exceptions would be a start, and would avoid the need for exception handling. Libstdc++ headers already work fine with those options, and it should be possible to build the library itself that way (or it's a bug that can be fixed).

> Ignoring unsupported C++ language features, for now, I don't think there's
> any reason why libstdc++ would need to be cut down. We already build the
> full libgfortran for amdgcn. System calls that make no sense on the GPU were
> implemented as stubs in Newlib (mostly returning some reasonable errno
> value), and it would be straight-forward to implement more the same way.

But it's a waste of space in the .so to build lots of symbols that use the stubs.

There are other reasons it might be nice to be able to configure libstdc++ for something in between a full hosted environment and a minimal freestanding one.

> I believe static constructors work (libgfortran uses some), but exception
> handling does not. I'm not sure what other exotica C++ might need?

Ideally, __cxa_atexit and __cxa_thread_atexit for static and thread-local destructors, but we can survive without them (and have not-fully-conforming destruction ordering).
Comment 5 Andrew Stubbs 2021-07-21 10:03:32 UTC
[Note: all of my comments refer to the amdgcn case. nvptx has somewhat different support in this area.]

(In reply to Jonathan Wakely from comment #4)
> But it's a waste of space in the .so to build lots of symbols that use the
> stubs.

DSOs are not supported. This is strictly for static linking only.

> There are other reasons it might be nice to be able to configure libstdc++
> for something in between a full hosted environment and a minimal
> freestanding one.

If it isn't a horrible hack, like libgfortran minimal mode, then fine.

> > I believe static constructors work (libgfortran uses some), but exception
> > handling does not. I'm not sure what other exotica C++ might need?
> 
> Ideally, __cxa_atexit and __cxa_thread_atexit for static and thread-local
> destructors, but we can survive without them (and have not-fully-conforming
> destruction ordering).

Offload kernels are just fragments of programs, so this is tricky in those cases. Libgomp explicitly calls _init_array and _fini_array as single-threaded kernel launches. Actually, it's not clear that deconstruction is in any way interesting, given that code running on the GPU has no external access and the resources are all released when the host program exits.

Similarly, C++ threads are not interesting in the GPU-offload case. There are a fixed number or threads launched on entry and they are managed by libgomp. In theory it would be possible to code gthreads/libstdc++ to use them in standalone mode, but really that mode only exists to facilitate compiler testing.
Comment 6 Richard Biener 2021-07-21 13:27:08 UTC
IIRC libstdc++ had a freestanding mode that could serve as base.
Comment 7 Jonathan Wakely 2021-07-21 13:37:13 UTC
Yes, --disable-libstdcxx-hosted will build the freestanding version of libstdc++
Comment 8 Thomas Schwinge 2022-07-22 13:32:17 UTC
I looked into this, with the goal of estimating the effort necessary for what you might call "GPU support for minimal C++ library".  That is, allow use of standard C++ for host code (in particular, no '-fno-exceptions' required for host code, etc.), and in offloaded regions support a "reasonable subset of C++" (not exactly defined at this time).

For my experimenting, I default GCN, nvptx to '-fno-rtti', '-fno-exceptions', address a few bugs and offload target misconfigurations, small libsupc++ changes, configure GCN, nvptx libstdc++ "freestanding" (manually specifying '--disable-hosted-libstdcxx'; thus just 'libsupc++.a'), hard-code '-foffload-options=nvptx-none=-mptx=6.3\ -malias' to work around GCC PR105018 "[nvptx] Need better alias support", and with some more manual hand-holding etc., I do have the code working that originated this issue, which uses C++ 'new' in OpenMP 'target'.  (Yay.)  Resolving all these things properly, I'm estimating to easily turn into a multi-week effort -- but hey, it's C++, so...  ;-)

---

However...  As soon as you start doing a little bit more C++ in offloaded regions (like, defining a simple 'std::vector<int> v(100);', how dare you...), it becomes apparent that the configuration mismatch between host and offload targets is problematic: GCC/C++ synthesizes/emits constructs that the offload targets are not prepared for, leading to a good mixture ICEs, 'sorry's, undefined symbols.  While certainly there are genuine bugs to be addressed, I'm not convinced anymore that such a setup is going to work easily.

Building full libstdc++ for the offload targets runs into similar ICEs, 'sorry's, still with default '-fno-rtti', '-fno-exceptions'.  We'd need to resolve these issues, by improving the offload target back ends etc., and/or GCC/C++ front end down to libstdc++, as appropriate.  This sounds doable, but not trivial, but we'd still have the issue of, for example, mismatching '-fexceptions' in host code (thus, exceptions codes may appear in the offloading code stream), and '-fno-exceptions' configuration for offload targets.

I'm toying with the idea of looking into "sanitizing" the offloading code stream (easly pass in offload compilation pipeline), so that it's more "amenable" for offload compilation, but again I don't know whether that's going to work out, and how much effort that requires.

---

So, this is just a brain-dump, to report where we are, and/or in case anyone has any great ideas.  :-)
Comment 9 Thomas Schwinge 2022-07-22 14:04:30 UTC
One concrete question, on example of:

(In reply to myself from comment #8)
> hard-code '-foffload-options=nvptx-none=-mptx=6.3\ -malias' to work around GCC PR105018 "[nvptx] Need better alias support"

... this.

Simplified, the GCC/nvptx back end currently doesn't support symbol aliases.

Now, I found that there is a misconfiguration in the GCC/nvptx back end, so that GCC proper thinks that it actually does support these even if the experimental/limited support is disabled (details to be shared later).  This means: with that GCC/nvptx back end misconfiguration fixed, I get libsupc++ built, and the GCC/C++ front end for nvptx target (real target, not offload target!) works fine with the '!TARGET_SUPPORTS_ALIASES' configuration (just like for other "embedded" targets, etc., I suppose).  For example, see 'gcc/cp/optimize.cc:can_alias_cdtor':

    /* Returns true iff we can make the base and complete [cd]tor aliases of
       the same symbol rather than separate functions.  */
    
    static bool
    can_alias_cdtor (tree fn)
    {
      /* If aliases aren't supported by the assembler, fail.  */
      if (!TARGET_SUPPORTS_ALIASES)
        return false;
    [...]

That's fine for GCC/C++ for nvptx target.  However, in a nvptx offload configuration, the GCC/C++ front end runs on the host (x86_64-pc-linux-gnu or whatever), and thus has a 'TARGET_SUPPORTS_ALIASES' configuration, and GCC/C++ front end happily generates symbol aliases -- which the nvptx offload compilation then later falls over.

This problem (and I suspect a few other similar ones) could be avoided if the GCC/C++ front end didn't look up target properties (such as 'TARGET_SUPPORTS_ALIASES'), and instead left such things in some "defer" state, and they get resolved/lowered per actual target properties once the offloading code stream has been spawned off.  (We've similarly changed a few other items, where the lowering was "too early", as far as I remember.)  At that point then, the host would decide for 'TARGET_SUPPORTS_ALIASES', and the nvptx offload target would decide for '!TARGET_SUPPORTS_ALIASES', and lower the "defer" construct as appropriate.  This of course is more difficult if the target property causes more compilcated decisions in the GCC/C++ front end.

But, generally (and of course I understand it eventually has to be decided case by case), is it acceptable to defer such early target property lookup until later?  Of course, next problem then is potential missed optimizations due to later lowering, diagnostics changes, and all these things, uh...

(Like, poison all "early" target macro etc. usage, and then selectively re-enable when appropriate (data types have to match between host and offload targets, etc.), and otherwise adjust code to resolve "late".  Sounds like a fun multi-year project, doesn't it...)

Thoughts, on this specific example as well as generally?
Comment 10 Richard Biener 2022-07-25 11:20:16 UTC
Things like TARGET_SUPPORTS_ALIASES (or comdat or ...) really are core features of the callgraph which is built very early.  Some of the actual transforms and
optimizations the C++ frontend does _might_ be deferable (like aliases could
just become clones if later on we discover they are not available).  I'm not
sure though that you will not run into properties that are too hard to change.

After all we already require quite some matching of hoast to offload target.

So what I'd probably try to do is assume TARGET_SUPPORTS_ALIASES early and
"undo" that at IPA transform time by duplicating bodies (or emitting
thunks).  But as you said, it's just the first knob to tackle.

Honza?