Bug 96265 - offloading to nvptx-none from aarch64-linux-gnu (and riscv*-linux-gnu) does not work
Summary: offloading to nvptx-none from aarch64-linux-gnu (and riscv*-linux-gnu) does n...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: ipa (show other bugs)
Version: 11.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: ice-on-valid-code, lto, openacc, openmp
: 114174 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-07-21 11:55 UTC by Matthias Klose
Modified: 2024-10-08 07:54 UTC (History)
10 users (show)

See Also:
Host: aarch64-linux
Target: nvptx
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-04-09 00:00:00


Attachments
Verbose compile output after rebuilding the latest GCC snapshot with the patch applied (2.07 KB, text/plain)
2024-08-08 15:01 UTC, Jan André Reuter
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Klose 2020-07-21 11:55:22 UTC
The nvptx-target usually is built on x86_64-linux-gnu, but searching the web you'll see that these GPUs are also used in aarch64-linux-gnu and powerpc64le-linux-gnu systems.

building the nvptx offload compiler on powerpc64le, you see reasonable test results for libgomp,  and I see at last one powerpc64le related commit:

2015-10-09  James Norris  <jnorris@codesourcery.com>

        * config/rs6000/rs6000.c (rs6000_offload_options): New.
        (TARGET_OFFLOAD_OPTIONS): New.

However adding an aarch64_offload_options hook doesn't look that well for AArch64, there are still two type of issues triggered in the testsuite:

FAIL: libgomp.c/../libgomp.c-c++-common/for-11.c (test for excess errors)
Excess errors:
lto1: fatal error: nvptx-none - 0-bit integer numbers unsupported (mode 'SI')


and:

NRESOLVED: libgomp.c/../libgomp.c-c++-common/for-9.c compilation failed to produce executable
UNSUPPORTED: libgomp.c/../libgomp.c-c++-common/function-not-offloaded-aux.c

spawn -ignore SIGHUP /home/ubuntu/gcc-10-10.1.0/build/./gcc/xgcc -B/home/ubuntu/gcc-10-10.1.0/build/./gcc/ -B/usr/aarch64-linux-gnu/b
in/ -B/usr/aarch64-linux-gnu/lib/ -isystem /usr/aarch64-linux-gnu/include -isystem /usr/aarch64-linux-gnu/sys-include -isystem /home/
ubuntu/gcc-10-10.1.0/build/sys-include -fchecking=1 offload_device_nonshared_as411951.c -B/home/ubuntu/gcc-10-10.1.0/build/aarch64-li
nux-gnu/./libgomp/ -B/home/ubuntu/gcc-10-10.1.0/build/aarch64-linux-gnu/./libgomp/.libs -I/home/ubuntu/gcc-10-10.1.0/build/aarch64-li
nux-gnu/./libgomp -I../../../../src/libgomp/testsuite/../../include -I../../../../src/libgomp/testsuite/.. -Lno -fmessage-length=0 -f
no-diagnostics-show-caret -Wno-hsa -fdiagnostics-color=never -B/home/ubuntu/gcc-10-10.1.0/debian/tmp-nvptx/usr/libexec/gcc/aarch64-li
nux-gnu/10 -B/home/ubuntu/gcc-10-10.1.0/debian/tmp-nvptx/usr/bin -fopenmp -L/home/ubuntu/gcc-10-10.1.0/build/aarch64-linux-gnu/./libg
omp/.libs -lm -o offload_device_nonshared_as411951.exe
lto1: internal compiler error: bytecode stream: string too long for the string table
0x62559f string_for_index
        ../../src-nvptx/gcc/data-streamer-in.c:53
0x62559f bp_unpack_indexed_string(data_in*, bitpack_d*, unsigned int*)
        ../../src-nvptx/gcc/data-streamer-in.c:97
0x87a39b lto_input_mode_table(lto_file_decl_data*)
        ../../src-nvptx/gcc/lto-streamer-in.c:1685
0x5a076f lto_file_finalize
        ../../src-nvptx/gcc/lto/lto-common.c:2217
0x5a076f lto_create_files_from_ids
        ../../src-nvptx/gcc/lto/lto-common.c:2240
0x5a076f lto_file_read
        ../../src-nvptx/gcc/lto/lto-common.c:2295
0x5a076f read_cgraph_and_symbols(unsigned int, char const**)
        ../../src-nvptx/gcc/lto/lto-common.c:2747
0x58f523 lto_main()
        ../../src-nvptx/gcc/lto/lto.c:625
Please submit a full bug report,
with preprocessed source if appropriate.
Comment 1 Matthias Klose 2020-07-23 13:58:57 UTC
patch for the target hook posted at
https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550534.html
Comment 2 Tobias Burnus 2020-11-13 09:12:18 UTC
(Remove powerpc64le-linux-gnu from the summary as this PR is only about aarch64-linux and GCC is known to work on powerpc64le-linux-gnu.)

(In reply to Matthias Klose from comment #1)
> patch for the target hook posted at
> https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550534.html

This patch has been committed on Fri Jul 24 16:17:44 2020 +0200 as
https://gcc.gnu.org/g:29a14a1a907947fe9e43bce62d3468559f17da97
Comment 3 Xiao Ma 2023-10-23 14:30:38 UTC
I think this issue and #111937 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111937) have the same root cause: 
aarch64 also sets NUM_POLY_INT_COEFFS to 2, which makes it incompatible with the default value for nvptx (which is 1).
Comment 4 Andrew Pinski 2024-02-29 17:50:26 UTC
*** Bug 114174 has been marked as a duplicate of this bug. ***
Comment 5 Andrew Pinski 2024-04-09 01:14:25 UTC
Confirmed.
Comment 6 Andrew Pinski 2024-04-09 01:14:50 UTC
*** Bug 111937 has been marked as a duplicate of this bug. ***
Comment 7 Jan André Reuter 2024-07-11 15:36:37 UTC
I ran into the same issue both with GCC 12.3.0 and 14.1.0 on a GH200 system. However, the message with 14.1.0 is a bit different:

Segmentation fault
0xafc473 crash_signal
    ../.././gcc/toplev.cc:319
0x145ce3b pp_quoted_string
    ../.././gcc/pretty-print.cc:2284
0x145e333 pp_format(pretty_printer*, text_info*, urlifier const*)
    ../.././gcc/pretty-print.cc:1634
0x144b003 diagnostic_context::report_diagnostic(diagnostic_info*)
    ../.././gcc/diagnostic.cc:1611
0x144b3cf diagnostic_impl
    ../.././gcc/diagnostic.cc:1774
0x144d4b7 fatal_error(unsigned int, char const*, ...)
    ../.././gcc/diagnostic.cc:2217
0x9ad95f lto_input_mode_table(lto_file_decl_data*)
    ../.././gcc/lto-streamer-in.cc:2121
0x67f2bf lto_file_finalize
    ../.././gcc/lto/lto-common.cc:2278
0x67f2bf lto_create_files_from_ids
    ../.././gcc/lto/lto-common.cc:2302
0x67f2bf lto_file_read
    ../.././gcc/lto/lto-common.cc:2357
0x67f2bf read_cgraph_and_symbols(unsigned int, char const**)
    ../.././gcc/lto/lto-common.cc:2805
0x66adff lto_main()
    ../.././gcc/lto/lto.cc:656
Please submit a full bug report, with preprocessed source (by using -freport-bug).
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.
nvptx mkoffload: fatal error: aarch64-unknown-linux-gnu-accel-nvptx-none-gcc returned 1 exit status
compilation terminated.
lto-wrapper: fatal error: /p/usersoftware/[...]/easybuild/jedi/software/GCCcore/14.1.0/libexec/gcc/aarch64-unknown-linux-gnu/14.1.0//accel/nvptx-none/mkoffload returned 1 exit status
compilation terminated.
/p/usersoftware/[...]/easybuild/jedi/software/binutils/2.42-GCCcore-14.1.0/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status
Comment 8 GCC Commits 2024-08-07 18:21:39 UTC
The master branch has been updated by Prathamesh Kulkarni <prathamesh3492@gcc.gnu.org>:

https://gcc.gnu.org/g:38900247f3880d6eca2e364a000e5898f8deae64

commit r15-2801-g38900247f3880d6eca2e364a000e5898f8deae64
Author: Prathamesh Kulkarni <prathameshk@nvidia.com>
Date:   Wed Aug 7 23:45:38 2024 +0530

    Partially support streaming of poly_int for offloading.
    
    When offloading is enabled, the patch streams out host
    NUM_POLY_INT_COEFFS, and changes streaming in as follows:
    
    if (host_num_poly_int_coeffs <= NUM_POLY_INT_COEFFS)
    {
      for (i = 0; i < host_num_poly_int_coeffs; i++)
        poly_int.coeffs[i] = stream_in coeff;
      for (; i < NUM_POLY_INT_COEFFS; i++)
        poly_int.coeffs[i] = 0;
    }
    else
    {
      for (i = 0; i < NUM_POLY_INT_COEFFS; i++)
        poly_int.coeffs[i] = stream_in coeff;
    
      /* Ensure that degree of poly_int <= accel NUM_POLY_INT_COEFFS.  */
      for (; i < host_num_poly_int_coeffs; i++)
        {
          val = stream_in coeff;
          if (val != 0)
            error ();
        }
    }
    
    gcc/ChangeLog:
            PR ipa/96265
            PR ipa/111937
            * data-streamer-in.cc (streamer_read_poly_uint64): Remove code for
            streaming, and call poly_int_read_common instead.
            (streamer_read_poly_int64): Likewise.
            * data-streamer.cc (host_num_poly_int_coeffs): Conditionally define
            new variable if ACCEL_COMPILER is defined.
            * data-streamer.h (host_num_poly_int_coeffs): Declare.
            (poly_int_read_common): New function template.
            (bp_unpack_poly_value): Remove code for streaming and call
            poly_int_read_common instead.
            * lto-streamer-in.cc (lto_input_mode_table): Stream-in host
            NUM_POLY_INT_COEFFS into host_num_poly_int_coeffs if ACCEL_COMPILER
            is defined.
            * lto-streamer-out.cc (lto_write_mode_table): Stream out
            NUM_POLY_INT_COEFFS if offloading is enabled.
            * poly-int.h (MAX_NUM_POLY_INT_COEFFS_BITS): New macro.
            * tree-streamer-in.cc (lto_input_ts_poly_tree_pointers): Adjust
            streaming-in of poly_int.
    
    Signed-off-by: Prathamesh Kulkarni <prathameshk@nvidia.com>
Comment 9 Jan André Reuter 2024-08-08 14:58:15 UTC
Thanks a lot for the patch Prathamesh Kulkarni. There seems to be some progress, which is great to see!

I've tried your patch. I applied it to the latest snapshot and also to GCC 14.2.0 and GCC 14.1.0 to see what happens. In general, all three versions seem to come a bit further towards getting offloading to work. The GCC 15 snapshot seems closest, but now fails with an unknown argument error. In all cases, I built GCC with the EasyBlock of Easybuild, though I'm not sure if that's the cause why the flag is here.

GCC 14.2.0 (built with EasyBuild, applied patch):
====

```console
$ gcc -fopenmp -foffload=nvptx-none test.c
lto1: internal compiler error: in lto_read_decls, at lto/lto-common.cc:1970
0x68110f lto_read_decls
	../.././gcc/lto/lto-common.cc:1970
0x68110f lto_file_finalize
	../.././gcc/lto/lto-common.cc:2292
0x68110f lto_create_files_from_ids
	../.././gcc/lto/lto-common.cc:2302
0x68110f lto_file_read
	../.././gcc/lto/lto-common.cc:2357
0x68110f read_cgraph_and_symbols(unsigned int, char const**)
	../.././gcc/lto/lto-common.cc:2805
0x66b13f lto_main()
	../.././gcc/lto/lto.cc:656
Please submit a full bug report, with preprocessed source (by using -freport-bug).
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.
nvptx mkoffload: fatal error: aarch64-unknown-linux-gnu-accel-nvptx-none-gcc returned 1 exit status
compilation terminated.
lto-wrapper: fatal error: /p/usersoftware/cstpa/reuter1/EasyBuild/easybuild/jedi/software/GCCcore/14.2.0/libexec/gcc/aarch64-unknown-linux-gnu/14.2.0//accel/nvptx-none/mkoffload returned 1 exit status
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status
```

---

GCC 14.1.0 (built with EasyBuild, applied patch):
====

```console
$ gcc -fopenmp -foffload=nvptx-none test.c
lto1: internal compiler error: in lto_read_decls, at lto/lto-common.cc:1970
0x680eaf lto_read_decls
        ../.././gcc/lto/lto-common.cc:1970
0x680eaf lto_file_finalize
        ../.././gcc/lto/lto-common.cc:2292
0x680eaf lto_create_files_from_ids
        ../.././gcc/lto/lto-common.cc:2302
0x680eaf lto_file_read
        ../.././gcc/lto/lto-common.cc:2357
0x680eaf read_cgraph_and_symbols(unsigned int, char const**)
        ../.././gcc/lto/lto-common.cc:2805
0x66aebf lto_main()
        ../.././gcc/lto/lto.cc:656
Please submit a full bug report, with preprocessed source (by using -freport-bug).
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.
nvptx mkoffload: fatal error: aarch64-unknown-linux-gnu-accel-nvptx-none-gcc returned 1 exit status
compilation terminated.
lto-wrapper: fatal error: /p/usersoftware/cstpa/reuter1/EasyBuild/easybuild/jedi/software/GCCcore/14.1.0/libexec/gcc/aarch64-unknown-linux-gnu/14.1.0//accel/nvptx-none/mkoffload returned 1 exit status
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status

```

Execution no longer segmentation faults, but compilation still fails in LTO. No changes from 14.1.0 to 14.2.0.

---

GCC 15.0.0 (gcc-15-20240804, built with EasyBuild using adapted GCC 14.2.0 EasyConfig and the patch applied):
====

```console
$ gcc -fopenmp -foffload=nvptx-none test.c
gcc: error: unrecognized command-line option ‘-m64’
nvptx mkoffload: fatal error: gcc returned 1 exit status
compilation terminated.
lto-wrapper: fatal error: /p/usersoftware/cstpa/reuter1/EasyBuild/easybuild/jedi/software/GCCcore/15.0.0/libexec/gcc/aarch64-unknown-linux-gnu/15.0.0//accel/nvptx-none/mkoffload returned 1 exit status
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status
$ gcc --version
gcc (GCC) 15.0.0 20240804 (experimental)
Copyright (C) 2024 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
```
Comment 10 Jan André Reuter 2024-08-08 15:01:11 UTC
Created attachment 58875 [details]
Verbose compile output after rebuilding the latest GCC snapshot with the patch applied

New error message when trying to build OpenMP offload code on aarch64 with the latest GCC snapshot and the patch applied.

The build mainly failed due to 'unrecognized command-line option ‘-m64’'.
Comment 11 prathamesh3492 2024-08-08 15:04:37 UTC
Hi,
Yes, those two errors are expected.
I posted RFC discussion about AArch64/nvptx offloading issues here:
https://gcc.gnu.org/pipermail/gcc/2024-July/244466.html

For the unrecognized command line -m64 option, I have a WIP patch posted upstream:
https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659866.html

Thanks,
Prathamesh
Comment 12 Jan André Reuter 2024-08-09 11:50:04 UTC
> Hi,
> Yes, those two errors are expected.
> I posted RFC discussion about AArch64/nvptx offloading issues here:
> https://gcc.gnu.org/pipermail/gcc/2024-July/244466.html
> 
> For the unrecognized command line -m64 option, I have a WIP patch posted upstream:
> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659866.html
> 
> Thanks,
> Prathamesh

Thanks a lot for the update and your work towards resolving these issues. It's really appreciated.

I was not aware of the ongoing discussion and the WIP patch. I'll keep an eye on those two and continue to test new patches when they're pushed to master. There's not much more I can do unfortunately, as I'm not familiar with GCC internals at all.
Comment 13 Thomas Schwinge 2024-09-10 14:39:48 UTC
commit r15-3034-gdb2e9a2a46f64b037494e8300c46f2d90a9fa55c
Author: Prathamesh Kulkarni <prathameshk@nvidia.com>
Date:   Tue Aug 20 12:54:02 2024 +0530

    [optc-save-gen.awk] Fix streaming of command line options for offloading.

    The patch modifies optc-save-gen.awk to generate if (!lto_stream_offload_p)
    check before streaming out target-specific opt in cl_optimization_stream_out,
    when offloading is enabled.

    Also, it modifies cl_optimization_stream_in to issue an error during build time
    if accelerator backend defines a target-specific Optimization option. This
    restriction currently is in place to maintain consistency for streaming of
    Optimization options between host and accelerator. A proper fix would be
    to merge target-specific Optimization options for host and accelerators
    enabled for offloading.

    gcc/ChangeLog:
	    * optc-save-gen.awk: New array var_target_opt. Use it to generate
	    if (!lto_stream_offload_p) check in cl_optimization_stream_out,
	    and generate a diagnostic with #error if accelerator backend uses
	    Optimization for target-specifc options in cl_optimization_stream_in.

    Signed-off-by: Prathamesh Kulkarni <prathameshk@nvidia.com>
Comment 14 Thomas Schwinge 2024-09-10 14:40:17 UTC
commit r15-3093-g792adb8d222d0d1d16b182871e105f47823b8e72
Author: Prathamesh Kulkarni <prathameshk@nvidia.com>
Date:   Thu Aug 22 19:25:20 2024 +0530

    Recompute TYPE_MODE and DECL_MODE for aggregate type for acclerator.

    The patch streams out VOIDmode for aggregate types with offloading enabled,
    and recomputes appropriate TYPE_MODE and DECL_MODE while streaming-in on accel
    side. The rationale for this change is to avoid streaming out host-specific
    modes that may be used for aggregate types, which may not be representable on
    the accelerator. For eg, AArch64 uses OImode for ARRAY_TYPE whose size is 256-bits,
    and nvptx doesn't have OImode, and thus ends up emitting an error from
    lto_input_mode_table.

    gcc/ChangeLog:
	    * lto-streamer-in.cc: (lto_read_tree_1): Set DECL_MODE (expr) to
	    TREE_TYPE (TYPE_MODE (expr)) if TREE_TYPE (expr) is aggregate type and
	    offloading is enabled.
	    * stor-layout.cc (layout_type): Move computation of mode for
	    ARRAY_TYPE from ...
	    (compute_array_mode): ... to here.
	    * stor-layout.h (compute_array_mode): Declare.
	    * tree-streamer-in.cc: Include stor-layout.h.
	    (unpack_ts_common_value_fields): Call compute_array_mode if offloading
	    is enabled.
	    * tree-streamer-out.cc (pack_ts_fixed_cst_value_fields): Stream out
	    VOIDmode if decl has aggregate type and offloading is enabled.
	    (pack_ts_type_common_value_fields): Stream out VOIDmode for aggregate
	    type if offloading is enabled.

    Signed-off-by: Prathamesh Kulkarni <prathameshk@nvidia.com>
Comment 15 Thomas Schwinge 2024-09-10 14:40:56 UTC
commit r15-3488-gae88e91938af364ef5613e5461b12b484b578bc5
Author: Prathamesh Kulkarni <prathameshk@nvidia.com>
Date:   Thu Sep 5 18:52:53 2024 +0530

    Avoid ICE when passing VLA vector to accelerator.

    gcc/ChangeLog:
	    * gimplify.cc (omp_add_variable): Check if decl size is not poly_int_tree_p.
	    (gimplify_adjust_omp_clauses): Likewise.
	    * omp-low.cc (scan_sharing_clauses): Likewise.
	    (lower_omp_target): Likewise.

    Signed-off-by: Prathamesh Kulkarni <prathameshk@nvidia.com>
Comment 16 GCC Commits 2024-09-10 15:44:01 UTC
The master branch has been updated by Prathamesh Kulkarni <prathamesh3492@gcc.gnu.org>:

https://gcc.gnu.org/g:e783a4a683762487cb003ae48235f3d44875de1b

commit r15-3571-ge783a4a683762487cb003ae48235f3d44875de1b
Author: Prathamesh Kulkarni <prathameshk@nvidia.com>
Date:   Tue Sep 10 21:01:58 2024 +0530

    Pass host specific ABI opts from mkoffload.
    
    The patch adds an option -foffload-abi-host-opts, which
    is set by host in TARGET_OFFLOAD_OPTIONS, and mkoffload then passes its value
    to host_compiler.
    
    gcc/ChangeLog:
            PR target/96265
            * common.opt (foffload-abi-host-opts): New option.
            * config/aarch64/aarch64.cc (aarch64_offload_options): Pass
            -foffload-abi-host-opts.
            * config/i386/i386-options.cc (ix86_offload_options): Likewise.
            * config/rs6000/rs6000.cc (rs6000_offload_options): Likewise.
            * config/nvptx/mkoffload.cc (offload_abi_host_opts): Define.
            (compile_native): Append offload_abi_host_opts to argv_obstack.
            (main): Handle option -foffload-abi-host-opts.
            * config/gcn/mkoffload.cc (offload_abi_host_opts): Define.
            (compile_native): Append offload_abi_host_opts to argv_obstack.
            (main): Handle option -foffload-abi-host-opts.
            * lto-wrapper.cc (merge_and_complain): Handle
            -foffload-abi-host-opts.
            (append_compiler_options): Likewise.
            * opts.cc (common_handle_option): Likewise.
    
    Signed-off-by: Prathamesh Kulkarni <prathameshk@nvidia.com>
Comment 17 Jan André Reuter 2024-09-11 11:03:08 UTC
Good news! I built GCC trunk on a GH200 system via EasyBuild and tried a few examples. A very basic example (more or less a "Hello World") worked just fine.

Afterwards, I tried a few of our offload examples used in our internal CI. Those worked fine as well, both for a single GPU (our login node) and on four GPUs (one of our compute nodes).

This is just a small sample size, but a huge step towards offloading support on aarch64.
Comment 18 GCC Commits 2024-10-08 07:54:08 UTC
The master branch has been updated by Prathamesh Kulkarni <prathamesh3492@gcc.gnu.org>:

https://gcc.gnu.org/g:ae88da5e070659d37b3c3daa4b880531769183bf

commit r15-4133-gae88da5e070659d37b3c3daa4b880531769183bf
Author: Prathamesh Kulkarni <prathameshk@nvidia.com>
Date:   Tue Oct 8 12:38:31 2024 +0530

    Recompute TYPE_MODE and DECL_MODE for vector_type for accelerator.
    
    gcc/ChangeLog:
            PR ipa/96265
            * lto-streamer-in.cc (lto_read_tree_1): Set TYPE_MODE and DECL_MODE
            for vector_type if offloading is enabled.
            (lto_input_mode_table): Remove handling of vector modes.
            * tree-streamer-out.cc (pack_ts_decl_common_value_fields): Stream out
            VOIDmode for vector_type if offloading is enabled.
            (pack_ts_decl_common_value_fields): Likewise.
    
    Signed-off-by: Prathamesh Kulkarni <prathameshk@nvidia.com>