Offloading Support in GCC

/!\ This needs to be updated some more for OpenACC.


Host compiler — a regular compiler. Not to be confused with build/host/target configure terms.

Accel compiler — a compiler that reads intermediate representation from the special LTO sections, and generates code for the accelerator device. Also called the "offload compiler".

OpenMP — open multi-processing, supporting vector, thread and offloading directives/pragmas

OpenACC — open accelerators, supporting offloading directives/pragmas

Building host and accel compilers

The host and offload compilers need to be able to find each other. This is achieved by installing the offload compiler into special locations, and informing each about the presence of the other. All available offload compilers must first be configured with "--enable-as-accelerator-for=host-triplet", and installed into the same prefix as the host compiler. Then the host compiler is built with "--enable-offload-targets=target1,target2,..." which identifies the offload compilers that have already been built and installed.

The install locations for the offload compilers differ from those of a normal cross toolchain, by the following mapping:







It may be necessary to compile offload compilers with a sysroot, since otherwise install locations for libgomp could clash (maybe that library needs to move into lib/gcc/..?)

A target needs to provide a mkoffload tool if it wishes to be usable as an accelerator. It is installed as one of EXTRA_PROGRAMS, and the host lto-wrapper knows how to find it from the paths described above. mkoffload will invoke the offload compiler in LTO mode to produce an offload binary from the host object files, then post-process this to produce a new object file that can be linked in with the host executable. It can find the host compiler by examining the COLLECT_GCC environment variable, and it must take care to clear this and certain other environment variables when executing the offload compiler so as to not confuse it.

Compilation process

Host compiler performs the following actions:

  1. After #pragma omp target lowering and expansion, a new outlined function with the attribute "omp declare target" emerges — it will be later compiled both by host and accel compilers to produce two versions (or N+1 versions in case of N different accel targets).
    The decls for all global variables marked with "omp declare target" attribute, as well as decls for outlined target regions, are inserted into offload_vars and offload_funcs arrays.

  2. The expansion phase replaces pragmas with corresponding calls to the runtime library libgomp (GOMP_target, GOMP_target_data + GOMP_target_end_data, GOMP_target_update). These calls are preceded by initialization of special structures, containing arguments for outlined functions (.omp_data_arr.*, .omp_data_sizes.*, .omp_data_kinds.*).

  3. During the ipa_write_summaries pass the intermediate representation of outlined functions is streamed out into the .gnu.offload_lto_* sections of the "fat" object file. This object file also may contain .gnu.lto_* sections for the regular link-time optimizations.
    Also the decls from offload_funcs and offload_vars are streamed out into the .gnu.offload_lto_.offload_table section. Later an accel compiler will read this section to produce target's mapping table.

  4. In omp_finish_file function the addresses from offload_funcs and offload_vars are written into the .gnu.offload_funcs and .gnu.offload_vars sections correspondingly.
    Optionally, if -flto is present, the decls from offload_funcs and offload_vars are streamed out into the .gnu.lto_.offload_table section. Later the host compiler in LTO mode will use them to produce the final host's table with addresses.

  5. When all source files are compiled, pre-linker driver collect2 is invoked. There are 4 different scenarios: -flto option is present/absent, linker plugin is supported/not supported. These scenarios are described in detail below. If linker plugin is available, collect2 runs the linker, which loads, which runs lto-wrapper. In case if linker plugin is not available, collect2 runs lto-wrapper directly.

  6. lto-wrapper runs mkoffload for each accel target, specified during the configuration.

  7. mkoffload runs accel compiler, which reads IR from the .gnu.offload_lto_* sections and compiles it for the accel target. Then mkoffload packs this target code (image) into the special section of a new host's object file. The object file produced with mkoffload should contain a constructor that calls GOMP_offload_register to identify itself at run-time. Arguments to that function are a symbol called __OFFLOAD_TABLE__ (provided by libgcc and unique per shared object), a target identifier, and some other data needed for a particular target (a pointer to the image, a table with information about mappings between host and offload functions and variables).

  8. Linker adds new object files, produced by mkoffloads, to the list of host's input object files.

Compilation without -flto using linker plugin

Compilation without -flto and without linker plugin

Compilation with -flto using linker plugin

Compilation with -flto without linker plugin

Compilation options

There are two offload-related options implemented in GCC.

  1. -foffload=<targets>=<options>
    By default, GCC will build offload images for all offload targets specified in configure with non-target-specific options passed to host compiler. This option is used to control offload targets and options for them. It can be used in a few ways:

    • -foffload=disable
      Tells GCC to disable offload support. Target regions will be run in host fallback mode.

    • -foffload=<targets>
      Tells GCC to build offload images for <targets>. They will be built with non-target-specific options passed to host compiler.

    • -foffload=<options>
      Tells GCC to build offload images for all targets specified in configure. They will be built with non-target-specific options passed to host compiler plus <options>.

    • -foffload=<targets>=<options>
      Tells GCC to build offload images for <targets>. They will be built with non-target-specific options passed to host compiler plus <options>.

    Options specified by -foffload are appended to the end of option set, so in case of option conflicts they have more priority.

  2. -foffload-abi=[lp64|ilp32]
    This option is supposed to tell mkoffload (and offload compiler) which ABI is used in streamed GIMPLE. This option is desirable, because host and offload compilers must have the same ABI. The option is generated by the host compiler automatically, it should not be specified by user.


Runtime support in libgomp

libgomp plugins

libgomp is designed to be independent of accelerator type it work with. In order to make it possible, plugins are used, while the libgomp itself contains only a generic interface and callbacks to the plugin for invoking target-dependent functionality. Plugins are shared object, implementing a set of routines:


libgomp gets the list of offload targets from the configure (specified by --enable-offload-targets=target1,target2,...). During the offload initialization, it tries to load plugins named libgomp-plugin-<target>.so.1 from standard dynamic linker paths. The plugins can use third-party target-dependent libraries to perform low-level interaction with the accel devices. E.g., the plugin for Intel MIC devices uses for implementing libgomp callbacks, and the plugin for Nvidia PTX devices uses

Address translation

When #pragma omp target is expanded, the host_addr of outlined function is passed to GOMP_target. If target device is not available, libgomp just performs host fallback using host_addr. But to run the function on target, it needs to translate host_addr into the corresponding target_addr. The idea is to have [ host_addr, size ] arrays in .gnu.offload_{funcs,vars} sections which are ordered exactly the same as corresponding [ target_addr ] arrays inside the target images (size is needed only for vars).

To keep this host_addr -> target_addr mapping at runtime, each device descriptor gomp_device_descr contains a splay tree. When gomp_init_device performs initialization, it walks the whole array and in each iteration picks n-th host pair host_start/host_end plus corresponding n-th target pair tgt_start/tgt_end, and inserts it into the splay tree.

Execution process

When an executable or dynamic shared object is loaded, it calls GOMP_offload_register N times, where N is number of accel images, embedded into this exec/dso. This function stores the pointers to the images and other data needed by accel plugin into offload_images.

The first call to GOMP_target, GOMP_target_data or GOMP_target_update performs corresponding device initialization: it calls GOMP_OFFLOAD_init_device from the plugin, and then stores address mapping table in the splay tree.

In case of Intel MIC, GOMP_OFFLOAD_init_device creates a new process on the device, and then offloads the accel images with the type == OFFLOAD_TARGET_TYPE_INTEL_MIC. All accel images, even inside the executable, represent dynamic shared objects, which are loaded into the newly created process.

GOMP_target looks up the host_addr passed to it in the splay tree and passes corresponding target_addr to plugin's GOMP_OFFLOAD_run function.

How to try offloading enabled GCC

Patches enabling OpenMP 4.0 offloading to Intel MIC are merged to trunk. They include general infrastructure changes, mkoffload tool, libgomp plugin, Intel MIC runtime offload library liboffloadmic and an emulator. This emulator lies under liboffloadmic and reproduces MIC's HW and SW stack behavior allowing to run offloaded code in a separate address space using the host machine. The emulator consists of 4 shared libraries which replace COI and MYO libraries from Intel Manycore Platform Software Stack (MPSS). In case of real offloading, user is supposed to specify path to MPSS libraries in LD_LIBRARY_PATH, this will overload emulator libraries on runtime.

1. Building accel compiler:

For Intel MIC:

../configure --build=x86_64-intelmicemul-linux-gnu --host=x86_64-intelmicemul-linux-gnu --target=x86_64-intelmicemul-linux-gnu --enable-as-accelerator-for=x86_64-pc-linux-gnu
make install DESTDIR=/install

For Nvidia PTX (also see

../configure --target=nvptx-none --enable-as-accelerator-for=x86_64-pc-linux-gnu --with-build-time-tools=[install-nvptx-tools]/nvptx-none/bin --disable-sjlj-exceptions --enable-newlib-io-long-long
make install DESTDIR=/install

2. Building host compiler:

../configure --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --target=x86_64-pc-linux-gnu --enable-offload-targets=x86_64-intelmicemul-linux-gnu=/install/prefix,nvptx-none=/install/prefix
make install DESTDIR=/install

If you install both compilers without DESTDIR, then there is no need to specify the paths to accel install trees in the --enable-offload-targets option.

3. Building an application:

/install/bin/gcc -fopenmp test.c
/install/bin/gcc -fopenacc test.c

4. Running an application using the emulator:

export LD_LIBRARY_PATH="/install/lib64/"

For Intel MIC offloading emulation, the debugger can be attached to the target process by:

OFFLOAD_EMUL_RUN=gdb ./a.out

..., and multiple devices can be emulated by:


Running 'make check'

make check-target-libgomp

Known issues

nvptx Offloading

For nvptx offloading, the following issues still need to be resolved: