Created attachment 51184 [details] Testcase declare_target_base_class.cpp From https://github.com/SOLLVE/sollve_vv/pull/364 which is for https://github.com/SOLLVE/sollve_vv/issues/105 The code compiles with LLVM >= 12 but fails with LLVM < 12 according to the sollve_vv issue. The testcase – allegedly OpenMP 4.5 – fails to link with: unresolved symbol _Znwm collect2: error: ld returned 1 exit status mkoffload: fatal error: x86_64-none-linux-gnu-accel-nvptx-none-gcc returned 1 exit status where "_Znwm" = "operator new(unsigned long)" The code has: +#pragma omp declare target +class S { ... +#pragma omp target map(ptr) + ptr = new S(); ... +#pragma omp end declare target
Looking at it more closely, the problem seems to be that libstdc++{.so,.a} does not exist for neither nvptx nor amdgcn – and the ME handles this correctly. The question is what's needed to get better libstdc++ support – either full support or like with libgfortran for nvptx stripped-down minimal version.
Yeah. For this exact case even just libsupc++.a would be enough, but I bet for other cases we can't away without libstdc++. Would be nice if it could be configured in some embed-ish way to make it smaller, stuff like std::filesystem on the offloading target is unlikely to be useful etc. Exceptions aren't supported either, are they. And I think at least in OpenMP 5.2 that is allowed, throwing an exception in offloading region may be treated like #pragma omp error at(runtime) severity(fatal).
The standalone amdgcn configuration does not support C++. There are a number of technical reasons why it doesn't Just Work, but basically it comes down to no-one ever working on it. Our customers were primarily interested in Fortran with C second. C++ offloading works fine provided that there are no library calls or exceptions. Ignoring unsupported C++ language features, for now, I don't think there's any reason why libstdc++ would need to be cut down. We already build the full libgfortran for amdgcn. System calls that make no sense on the GPU were implemented as stubs in Newlib (mostly returning some reasonable errno value), and it would be straight-forward to implement more the same way. I believe static constructors work (libgfortran uses some), but exception handling does not. I'm not sure what other exotica C++ might need? As for exceptions, set-jump-long-jump is not implemented because there was no call for it and I didn't know how to handle the GCN register files properly. Not only are they variable-sized, they're also potentially very large: ranging from ~6KB up to ~65KB, I think (102 32-bit scalar, and 256 2048-bit vector registers, for single-threaded mode, but only 80 scalar and 24 vector registers in maximum occupancy mode, in which case per-thread stack space is also quite limited). I'm not sure now the other exception implementations work.
(In reply to Andrew Stubbs from comment #3) > C++ offloading works fine provided that there are no library calls or > exceptions. There's no reason std::pair, std::tuple, std::optional and types like that shouldn't work. Just making it possible to compile with -fno-rtti -fno-exceptions would be a start, and would avoid the need for exception handling. Libstdc++ headers already work fine with those options, and it should be possible to build the library itself that way (or it's a bug that can be fixed). > Ignoring unsupported C++ language features, for now, I don't think there's > any reason why libstdc++ would need to be cut down. We already build the > full libgfortran for amdgcn. System calls that make no sense on the GPU were > implemented as stubs in Newlib (mostly returning some reasonable errno > value), and it would be straight-forward to implement more the same way. But it's a waste of space in the .so to build lots of symbols that use the stubs. There are other reasons it might be nice to be able to configure libstdc++ for something in between a full hosted environment and a minimal freestanding one. > I believe static constructors work (libgfortran uses some), but exception > handling does not. I'm not sure what other exotica C++ might need? Ideally, __cxa_atexit and __cxa_thread_atexit for static and thread-local destructors, but we can survive without them (and have not-fully-conforming destruction ordering).
[Note: all of my comments refer to the amdgcn case. nvptx has somewhat different support in this area.] (In reply to Jonathan Wakely from comment #4) > But it's a waste of space in the .so to build lots of symbols that use the > stubs. DSOs are not supported. This is strictly for static linking only. > There are other reasons it might be nice to be able to configure libstdc++ > for something in between a full hosted environment and a minimal > freestanding one. If it isn't a horrible hack, like libgfortran minimal mode, then fine. > > I believe static constructors work (libgfortran uses some), but exception > > handling does not. I'm not sure what other exotica C++ might need? > > Ideally, __cxa_atexit and __cxa_thread_atexit for static and thread-local > destructors, but we can survive without them (and have not-fully-conforming > destruction ordering). Offload kernels are just fragments of programs, so this is tricky in those cases. Libgomp explicitly calls _init_array and _fini_array as single-threaded kernel launches. Actually, it's not clear that deconstruction is in any way interesting, given that code running on the GPU has no external access and the resources are all released when the host program exits. Similarly, C++ threads are not interesting in the GPU-offload case. There are a fixed number or threads launched on entry and they are managed by libgomp. In theory it would be possible to code gthreads/libstdc++ to use them in standalone mode, but really that mode only exists to facilitate compiler testing.
IIRC libstdc++ had a freestanding mode that could serve as base.
Yes, --disable-libstdcxx-hosted will build the freestanding version of libstdc++
I looked into this, with the goal of estimating the effort necessary for what you might call "GPU support for minimal C++ library". That is, allow use of standard C++ for host code (in particular, no '-fno-exceptions' required for host code, etc.), and in offloaded regions support a "reasonable subset of C++" (not exactly defined at this time). For my experimenting, I default GCN, nvptx to '-fno-rtti', '-fno-exceptions', address a few bugs and offload target misconfigurations, small libsupc++ changes, configure GCN, nvptx libstdc++ "freestanding" (manually specifying '--disable-hosted-libstdcxx'; thus just 'libsupc++.a'), hard-code '-foffload-options=nvptx-none=-mptx=6.3\ -malias' to work around GCC PR105018 "[nvptx] Need better alias support", and with some more manual hand-holding etc., I do have the code working that originated this issue, which uses C++ 'new' in OpenMP 'target'. (Yay.) Resolving all these things properly, I'm estimating to easily turn into a multi-week effort -- but hey, it's C++, so... ;-) --- However... As soon as you start doing a little bit more C++ in offloaded regions (like, defining a simple 'std::vector<int> v(100);', how dare you...), it becomes apparent that the configuration mismatch between host and offload targets is problematic: GCC/C++ synthesizes/emits constructs that the offload targets are not prepared for, leading to a good mixture ICEs, 'sorry's, undefined symbols. While certainly there are genuine bugs to be addressed, I'm not convinced anymore that such a setup is going to work easily. Building full libstdc++ for the offload targets runs into similar ICEs, 'sorry's, still with default '-fno-rtti', '-fno-exceptions'. We'd need to resolve these issues, by improving the offload target back ends etc., and/or GCC/C++ front end down to libstdc++, as appropriate. This sounds doable, but not trivial, but we'd still have the issue of, for example, mismatching '-fexceptions' in host code (thus, exceptions codes may appear in the offloading code stream), and '-fno-exceptions' configuration for offload targets. I'm toying with the idea of looking into "sanitizing" the offloading code stream (easly pass in offload compilation pipeline), so that it's more "amenable" for offload compilation, but again I don't know whether that's going to work out, and how much effort that requires. --- So, this is just a brain-dump, to report where we are, and/or in case anyone has any great ideas. :-)
One concrete question, on example of: (In reply to myself from comment #8) > hard-code '-foffload-options=nvptx-none=-mptx=6.3\ -malias' to work around GCC PR105018 "[nvptx] Need better alias support" ... this. Simplified, the GCC/nvptx back end currently doesn't support symbol aliases. Now, I found that there is a misconfiguration in the GCC/nvptx back end, so that GCC proper thinks that it actually does support these even if the experimental/limited support is disabled (details to be shared later). This means: with that GCC/nvptx back end misconfiguration fixed, I get libsupc++ built, and the GCC/C++ front end for nvptx target (real target, not offload target!) works fine with the '!TARGET_SUPPORTS_ALIASES' configuration (just like for other "embedded" targets, etc., I suppose). For example, see 'gcc/cp/optimize.cc:can_alias_cdtor': /* Returns true iff we can make the base and complete [cd]tor aliases of the same symbol rather than separate functions. */ static bool can_alias_cdtor (tree fn) { /* If aliases aren't supported by the assembler, fail. */ if (!TARGET_SUPPORTS_ALIASES) return false; [...] That's fine for GCC/C++ for nvptx target. However, in a nvptx offload configuration, the GCC/C++ front end runs on the host (x86_64-pc-linux-gnu or whatever), and thus has a 'TARGET_SUPPORTS_ALIASES' configuration, and GCC/C++ front end happily generates symbol aliases -- which the nvptx offload compilation then later falls over. This problem (and I suspect a few other similar ones) could be avoided if the GCC/C++ front end didn't look up target properties (such as 'TARGET_SUPPORTS_ALIASES'), and instead left such things in some "defer" state, and they get resolved/lowered per actual target properties once the offloading code stream has been spawned off. (We've similarly changed a few other items, where the lowering was "too early", as far as I remember.) At that point then, the host would decide for 'TARGET_SUPPORTS_ALIASES', and the nvptx offload target would decide for '!TARGET_SUPPORTS_ALIASES', and lower the "defer" construct as appropriate. This of course is more difficult if the target property causes more compilcated decisions in the GCC/C++ front end. But, generally (and of course I understand it eventually has to be decided case by case), is it acceptable to defer such early target property lookup until later? Of course, next problem then is potential missed optimizations due to later lowering, diagnostics changes, and all these things, uh... (Like, poison all "early" target macro etc. usage, and then selectively re-enable when appropriate (data types have to match between host and offload targets, etc.), and otherwise adjust code to resolve "late". Sounds like a fun multi-year project, doesn't it...) Thoughts, on this specific example as well as generally?
Things like TARGET_SUPPORTS_ALIASES (or comdat or ...) really are core features of the callgraph which is built very early. Some of the actual transforms and optimizations the C++ frontend does _might_ be deferable (like aliases could just become clones if later on we discover they are not available). I'm not sure though that you will not run into properties that are too hard to change. After all we already require quite some matching of hoast to offload target. So what I'd probably try to do is assume TARGET_SUPPORTS_ALIASES early and "undo" that at IPA transform time by duplicating bodies (or emitting thunks). But as you said, it's just the first knob to tackle. Honza?