Differences between revisions 8 and 57 (spanning 49 versions)
Revision 8 as of 2014-08-04 19:16:32
Size: 12472
Editor: IlyaVerbin
Comment: Describe address translation
Revision 57 as of 2019-02-08 18:42:03
Size: 34547
Comment: Updates to "See Also" section
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= OpenMP 4.0 and OpenACC offloading support in GCC = = Offloading Support in GCC =

GCC 5 and later support two offloading configurations:

 * [[OpenMP]] to Intel MIC targets (upcoming Intel Xeon Phi products codenamed KNL) as well as MIC emulation on host.
 * [[OpenACC]] to [[nvptx|Nvidia PTX]] targets.

GCC 7 and later supports further:

 * [[OpenMP]] to [[nvptx|Nvidia PTX]] targets.
 * [[OpenMP]] to [[hsa|AMD HSAIL]] targets.

Planned for GCC 10:

 * [[OpenMP]] and [[OpenACC]] to AMD GCN targets (non-offloading back-end introduced in GCC 9).

/!\ This needs to be updated some more for OpenACC.
Line 9: Line 25:
Accel compiler — a compiler that reads intermediate representation from the special LTO sections, and generates code for the accelerator device. Also called the "offload compiler". Accel compiler — a compiler that reads intermediate representation from the special LTO sections, and generates code for the accelerator device. Also called the "offload compiler" or "target compiler".

[[openmp|OpenMP]] — open multi-processing, supporting vector, thread and offloading directives/pragmas.

[[OpenACC]] — open accelerators, supporting offloading directives/pragmas.
Line 26: Line 46:
''Currently offloading works only with `-flto`.''
Line 31: Line 49:
 The decls for all functions and global variables marked with `"omp declare target"` attribute are inserted into `offload_funcs` and `offload_vars` arrays.
 2. The expansion phase replaces pragmas with corresponding calls to the runtime library `libgomp` (`GOMP_target`, `GOMP_target_data` + `GOMP_target_end_data`, `GOMP_target_update`). These calls are preceded by initialization of special structures, containing arguments for outlined functions (`.omp_data_arr.*`, `.omp_data_sizes.*`, `.omp_data_kinds.*`).
 3. During the `ipa_write_summaries` pass the intermediate representation of outlined functions is streamed out into the `.gnu.target_lto_*` sections of the "fat" object file. This object file also may contain `.gnu.lto_*` sections for the regular link-time optimizations.<<BR>>
 Also the decls from `offload_funcs` and `offload_vars` are streamed out into the `.gnu.target_lto_.offload_table` section. Later an accel compiler will read this section to produce target's mapping table.<<BR>>
 Optionally, if `-flto` is present, the same decls are streamed out into the `.gnu.lto_.offload_table` section. Later the host compiler will use them to produce the final host's table with addresses.
 4. In `omp_finish_file` the addresses from `offload_funcs` and `offload_vars` are written into the `__gnu_offload_funcs` and `__gnu_offload_vars` sections correspondingly.
 5. When all source files are compiled, pre-
linker driver `collect2` is invoked. If linker plugin is available, `collect2` runs the linker, which loads `liblto_plugin.so`, which runs `lto-wrapper`. In case if linker plugin is not available, `collect2` runs `lto-wrapper` directly.
 The decls for all global variables marked with `"omp declare target"` attribute, as well as decls for outlined target regions, are inserted into `offload_vars` and `offload_funcs` arrays.
 2. The expansion phase replaces pragmas with corresponding calls to the runtime library `libgomp` (`GOMP_target{,_ext}`, `GOMP_target_data{,_ext}` + `GOMP_target_end_data`, `GOMP_target_update{,_ext}`, `GOMP_target_enter_exit_data`). These calls are preceded by initialization of special structures, containing arguments for outlined functions (`.omp_data_arr.*`, `.omp_data_sizes.*`, `.omp_data_kinds.*`).
 3. During the `ipa_write_summaries` pass the intermediate representation of outlined functions is streamed out into the `.gnu.offload_lto_*` sections of the "fat" object file. This object file also may contain `.gnu.lto_*` sections for the regular link-time optimizations.<<BR>>
 Also the decls from `offload_funcs` and `offload_vars` are streamed out into the `.gnu.offload_lto_.offload_table` section. Later an accel compiler will read this section to produce target's mapping table.
 4. In `omp_finish_file` function the addresses from `offload_funcs` and `offload_vars` are written into the `.gnu.offload_funcs` and `.gnu.offload_vars` sections correspondingly
.<<BR>>
 Optionally, if `-flto` is present, the decls from `offload_funcs` and `offload_vars` are streamed out into the `.gnu.lto_.offload_table` section. Later the host compiler in LTO mode will use them to produce the final host's table with addresses.
 5. When all source files are compiled, pre-linker driver `collect2` is invoked. It runs the linker, which loads linker plugin `liblto_plugin.so`, which runs `lto-wrapper`. Without offloading the `lto-wrapper` is called for link-time recompilation if at least one object file contains `.gnu.lto_*` sections. If some files contain offloading, then linker plugin will execute `lto-wrapper` even if there are no `.gnu.lto_*` sections. Offloading without linker plugin is not supported.
Line 39: Line 57:
 7. `mkoffload` runs accel compiler, which reads IR from the `.gnu.target_lto_*` sections and compiles it for the accel target. Then `mkoffload` packs this target code (image) into the special section of a new host's object file. The object file produced with `mkoffload` should contain a constructor that calls `GOMP_offload_register` to identify itself at run-time. Arguments to that function are a symbol called `__OPENMP_TARGET__` (provided by libgcc and unique per shared object), a target identifier, and some other data needed for a particular target (a pointer to the image, a table with information about mappings between host and offload functions and variables).  7. `mkoffload` runs accel compiler, which reads IR from the `.gnu.offload_lto_*` sections and compiles it for the accel target. Then `mkoffload` packs this target code (image) into the special section of a new host's object file. The object file produced with `mkoffload` should contain a constructor that calls `GOMP_offload_register{,_ver}` to identify itself at run-time. Arguments to that function are a symbol called `__OFFLOAD_TABLE__` (provided by libgcc and unique per shared object), a target identifier, and some other data needed for a particular target (a pointer to the image, a table with information about mappings between host and offload functions and variables).
Line 42: Line 60:
=== Compilation with offloading using linker plugin: === === Address mapping tables ===

This example shows how the tables with addresses are created. It consists of 3 source files: `apple.c`, `banana.c` and `citron.c`. Each of them contains 2 outlined target regions `*._omp_fn.{0,1}`. Global variables are handled in a similar manner. There are 3 different ways of compilation, which are described in detail below.

||Section name||Content||
||<style="background-color: #1569C7; color: #FFFFFF;"> `.gnu.lto_.offload_table`||Contains IR of the decls for the host compiler||
||<style="background-color: #437C17; color: #FFFFFF;"> `.gnu.offload_lto_.offload_table`||Contains IR of the decls for the accel compiler||
||<#C2DFFF> `.gnu.offload_funcs`||Contains addresses in the host binary||
||<#C3FDB8> `<Target section>`||Contains addresses in the target image<<BR>>* For Intel MIC targets the addresses are stored in ELF binary similar to host addresses<<BR>>* For Nvidia PTX targets the addresses are stored in PTX assembly||

==== All files without -flto ====

First,
{{{
gcc -c -fopenmp apple.c banana.c citron.c
}}}
produces 3 object files with the following sections:

||||`apple.o`||
||Section name||Content||
||<style="background-color: #437C17; color: #FFFFFF;"> `.gnu.offload_lto_.offload_table`||`apple._omp_fn.0`<<BR>>`apple._omp_fn.1`||
||<#C2DFFF> `.gnu.offload_funcs`||`apple._omp_fn.0`<<BR>>`apple._omp_fn.1`||

||||`banana.o`||
||Section name||Content||
||<style="background-color: #437C17; color: #FFFFFF;"> `.gnu.offload_lto_.offload_table`||`banana._omp_fn.0`<<BR>>`banana._omp_fn.1`||
||<#C2DFFF> `.gnu.offload_funcs`||`banana._omp_fn.0`<<BR>>`banana._omp_fn.1`||

||||`citron.o`||
||Section name||Content||
||<style="background-color: #437C17; color: #FFFFFF;"> `.gnu.offload_lto_.offload_table`||`citron._omp_fn.0`<<BR>>`citron._omp_fn.1`||
||<#C2DFFF> `.gnu.offload_funcs`||`citron._omp_fn.0`<<BR>>`citron._omp_fn.1`||

Next,
{{{
gcc -fopenmp apple.o banana.o citron.o
}}}
runs an accel compiler, which reads IR from `.gnu.offload_lto_.offload_table` and produces the final target table:

||||Target image||
||Section name||Content||
||<#C3FDB8> `<Target section>`||`apple._omp_fn.0`<<BR>>`apple._omp_fn.1`<<BR>>`banana._omp_fn.0`<<BR>>`banana._omp_fn.1`<<BR>>`citron._omp_fn.0`<<BR>>`citron._omp_fn.1`||

Finally, the host linker joins these 3 objects and therefore `.gnu.offload_funcs` sections into the host binary:

||||Host binary||
||Section name||Content||
||<#C2DFFF> `.gnu.offload_funcs`||`<__offload_func_table>`<<BR>>`apple._omp_fn.0`<<BR>>`apple._omp_fn.1`<<BR>>`banana._omp_fn.0`<<BR>>`banana._omp_fn.1`<<BR>>`citron._omp_fn.0`<<BR>>`citron._omp_fn.1`<<BR>>`<__offload_funcs_end>`||

`__offload_func_table` and `__offload_funcs_end` are special symbols, defined in `crtoffloadbegin.o` and `crtoffloadend.o` respectively.

==== All files with -flto ====

First,
{{{
gcc -c -fopenmp -flto apple.c banana.c citron.c
}}}
produces 3 object files with the following sections:

||||`apple.o`||
||Section name||Content||
||<style="background-color: #1569C7; color: #FFFFFF;"> `.gnu.lto_.offload_table`||`apple._omp_fn.0`<<BR>>`apple._omp_fn.1`||
||<style="background-color: #437C17; color: #FFFFFF;"> `.gnu.offload_lto_.offload_table`||`apple._omp_fn.0`<<BR>>`apple._omp_fn.1`||

||||`banana.o`||
||Section name||Content||
||<style="background-color: #1569C7; color: #FFFFFF;"> `.gnu.lto_.offload_table`||`banana._omp_fn.0`<<BR>>`banana._omp_fn.1`||
||<style="background-color: #437C17; color: #FFFFFF;"> `.gnu.offload_lto_.offload_table`||`banana._omp_fn.0`<<BR>>`banana._omp_fn.1`||

||||`citron.o`||
||Section name||Content||
||<style="background-color: #1569C7; color: #FFFFFF;"> `.gnu.lto_.offload_table`||`citron._omp_fn.0`<<BR>>`citron._omp_fn.1`||
||<style="background-color: #437C17; color: #FFFFFF;"> `.gnu.offload_lto_.offload_table`||`citron._omp_fn.0`<<BR>>`citron._omp_fn.1`||

Next,
{{{
gcc -fopenmp apple.o banana.o citron.o
}}}
runs an accel compiler, which produces the final target table, like in the previous case:

||||Target image||
||Section name||Content||
||<#C3FDB8> `<Target section>`||`apple._omp_fn.0`<<BR>>`apple._omp_fn.1`<<BR>>`banana._omp_fn.0`<<BR>>`banana._omp_fn.1`<<BR>>`citron._omp_fn.0`<<BR>>`citron._omp_fn.1`||

Next, host compiler is executed in LTO WPA mode, i.e. it reads IR from `.gnu.lto_.offload_table` from `apple.o`, `banana.o`, `citron.o`, and writes the joint table into `.gnu.lto_.offload_table` in the temporary object `ccXXXXXX.ltrans0.o`:

||||`ccXXXXXX.ltrans0.o`||
||Section name||Content||
||<style="background-color: #1569C7; color: #FFFFFF;"> `.gnu.lto_.offload_table`||`apple._omp_fn.0`<<BR>>`apple._omp_fn.1`<<BR>>`banana._omp_fn.0`<<BR>>`banana._omp_fn.1`<<BR>>`citron._omp_fn.0`<<BR>>`citron._omp_fn.1`||

In case of multiple partitions the joint table is written into the first partition only.

Next, host compiler is executed in LTO LTRANS mode. It reads the temporary table from `.gnu.lto_.offload_table` and writes the final table into the final object `ccXXXXXX.ltrans0.ltrans.o`:

||||`ccXXXXXX.ltrans0.ltrans.o`||
||Section name||Content||
||<#C2DFFF> `.gnu.offload_funcs`||`apple._omp_fn.0`<<BR>>`apple._omp_fn.1`<<BR>>`banana._omp_fn.0`<<BR>>`banana._omp_fn.1`<<BR>>`citron._omp_fn.0`<<BR>>`citron._omp_fn.1`||

Finally, the host linker joins `crtoffloadbegin.o`, `ccXXXXXX.ltrans0.ltrans.o` and `crtoffloadend.o`:

||||Host binary||
||Section name||Content||
||<#C2DFFF> `.gnu.offload_funcs`||`<__offload_func_table>`<<BR>>`apple._omp_fn.0`<<BR>>`apple._omp_fn.1`<<BR>>`banana._omp_fn.0`<<BR>>`banana._omp_fn.1`<<BR>>`citron._omp_fn.0`<<BR>>`citron._omp_fn.1`<<BR>>`<__offload_funcs_end>`||

==== Some files with and some without -flto ====

First,
{{{
gcc -c -fopenmp banana.c
gcc -c -fopenmp -flto apple.c citron.c
}}}
produces 3 object files with the following sections:

||||`apple.o`||
||Section name||Content||
||<style="background-color: #1569C7; color: #FFFFFF;"> `.gnu.lto_.offload_table`||`apple._omp_fn.0`<<BR>>`apple._omp_fn.1`||
||<style="background-color: #437C17; color: #FFFFFF;"> `.gnu.offload_lto_.offload_table`||`apple._omp_fn.0`<<BR>>`apple._omp_fn.1`||

||||`banana.o`||
||Section name||Content||
||<style="background-color: #437C17; color: #FFFFFF;"> `.gnu.offload_lto_.offload_table`||`banana._omp_fn.0`<<BR>>`banana._omp_fn.1`||
||<#C2DFFF> `.gnu.offload_funcs`||`banana._omp_fn.0`<<BR>>`banana._omp_fn.1`||

||||`citron.o`||
||Section name||Content||
||<style="background-color: #1569C7; color: #FFFFFF;"> `.gnu.lto_.offload_table`||`citron._omp_fn.0`<<BR>>`citron._omp_fn.1`||
||<style="background-color: #437C17; color: #FFFFFF;"> `.gnu.offload_lto_.offload_table`||`citron._omp_fn.0`<<BR>>`citron._omp_fn.1`||

Next, while running
{{{
gcc -fopenmp apple.o banana.o citron.o
}}}
the linker plugin creates a list of objects with offload sections and passes it to `lto-wrapper`. The order must be exactly the same as the final order after recompilation and linking. In this example it is: `apple.o`, `citron.o` and `banana.o`. Therefore, the accel compiler will produce the following target table:

||||Target image||
||Section name||Content||
||<#C3FDB8> `<Target section>`||`apple._omp_fn.0`<<BR>>`apple._omp_fn.1`<<BR>>`citron._omp_fn.0`<<BR>>`citron._omp_fn.1`<<BR>>`banana._omp_fn.0`<<BR>>`banana._omp_fn.1`||

Next, host compiler will recompile LTO objects (`apple.o` and `citron.o`) into `ccXXXXXX.ltrans0.ltrans.o` with the following table:

||||`ccXXXXXX.ltrans0.ltrans.o`||
||Section name||Content||
||<#C2DFFF> `.gnu.offload_funcs`||`apple._omp_fn.0`<<BR>>`apple._omp_fn.1`<<BR>>`citron._omp_fn.0`<<BR>>`citron._omp_fn.1`||

Finally, the host linker joins all objects in this order: `crtoffloadbegin.o`, `ccXXXXXX.ltrans0.ltrans.o`, `banana.o`, `crtoffloadend.o`; with the following host table:

||||Host binary||
||Section name||Content||
||<#C2DFFF> `.gnu.offload_funcs`||`<__offload_func_table>`<<BR>>`apple._omp_fn.0`<<BR>>`apple._omp_fn.1`<<BR>>`citron._omp_fn.0`<<BR>>`citron._omp_fn.1`<<BR>>`banana._omp_fn.0`<<BR>>`banana._omp_fn.1`<<BR>>`<__offload_funcs_end>`||

=== Compilation without -flto ===

Offloading-related steps are marked '''in bold'''.

 * `gcc`
  * `cc1` # Compile first source file into plain asm '''+ intermediate representation for accel'''
  * `as` # Assemble this asm + IR into object file
  * `...` # Compile and assemble all remaining source files
  * `collect2` # Pre-linker driver
   * `collect-ld` # Simple wrapper over `ld`
    * `ld` with `liblto_plugin.so` # Perform linking
     * '''`lto-wrapper` # Is called from `liblto_plugin.so`'''
      * '''`intelmic/mkoffload` # Prepare offload image for Intel MIC devices'''
       * '''`accel_gcc` # Read target IR from all objects and produce target DSO'''
       * '''`objcopy` # Save target DSO in a special section in a new host's object file'''
      * '''`.../mkoffload` # Prepare images for other targets'''
       * '''`...`'''

=== Compilation with -flto ===
Line 64: Line 250:
=== Compilation with offloading without linker plugin: ===

 * `gcc`
  * `cc1` # Compile first source file into plain asm + intermediate representation '''+ IR for accel'''
  * `as` # Assemble this asm + IR into temporary object file
  * `...` # Compile and assemble all remaining source files
  * `collect2` # Pre-linker driver
   * `lto-wrapper` # Run `lto-wrapper` directly from `collect2`
    * `gcc`
     * `lto1` # Perform whole program analysis and split into new partitions
    * `gcc`
     * `lto1` # Perform local transformations in the first object file
     * `as` # Assemble into final object code
    * `...` # Perform local transformations in each partitioned object file
    * '''`intelmic/mkoffload` # Prepare offload image for Intel MIC devices'''
     * '''`accel_gcc` # Read target IR from all partitions and produce target DSO'''
     * '''`objcopy` # Save target DSO in a special section in a new host's object file'''
    * '''`.../mkoffload` # Prepare images for other targets'''
     * '''`...`'''
   * `collect-ld` # Simple wrapper over `ld`
    * `ld` # Perform final linking
== Compilation options ==

The main option to control offloading is:

 1. `-foffload=<targets>=<options>`<<BR>>
 By default, GCC will build offload images for all offload targets specified in configure with non-target-specific options passed to host compiler. This option is used to control offload targets and options for them. It can be used in a few ways:
  * `-foffload=disable`<<BR>>
  Tells GCC to disable offload support. Target regions will be run in host fallback mode.
  * `-foffload=<targets>`<<BR>>
  Tells GCC to build offload images for `<targets>`. They will be built with non-target-specific options passed to host compiler.
  * `-foffload=<options>`<<BR>>
  Tells GCC to build offload images for all targets specified in configure. They will be built with non-target-specific options passed to host compiler plus `<options>`.
  * `-foffload=<targets>=<options>`<<BR>>
  Tells GCC to build offload images for `<targets>`. They will be built with non-target-specific options passed to host compiler plus `<options>`.
 `<targets>` are separated by commas.
 Several `<options>` can be specified by separating them by spaces.
 Options specified by `-foffload` are appended to the end of option set, so in case of option conflicts they have more priority.
 The `-foffload` flag can be specified several times, and you have to do that to specify different `<options>` for different `<targets>`.

Also there are several internal options, which should not be specified by user:

 1. `-foffload-abi=[lp64|ilp32]`<<BR>>
 The option is generated by the host compiler. It is supposed to tell mkoffload (and offload compiler) which ABI is used in streamed GIMPLE, because host and offload compilers must have the same ABI.

 1. `-foffload-objects=/tmp/ccxxxha`<<BR>>
 This option is generated by linker plugin. It is used to pass the list of object files with offloading to `lto-wrapper`.

=== Examples ===
 * {{{
gcc -fopenmp -c -O2 test1.c
gcc -fopenmp -c -O1 -msse -foffload=-mavx test2.c
gcc -fopenmp -foffload="-O3 -v" test1.o test2.o
}}}
 In this example the offload images will be built with the following options: `"-O2 -mavx -O3 -v"` for targets specified in configure.

 * {{{
gcc -fopenmp -foffload=x86_64-intelmicemul-linux-gnu="-mavx2" -foffload=nvptx-none -foffload="-O3" -O2 test.c
}}}
 In this example 2 offload images will be built: for MIC with `"-O2 -mavx2 -O3"` and for PTX with `"-O2 -O3"`.
Line 90: Line 295:
`libgomp` is designed to be independent of accelerator type it work with. In order to make it possible, plugins are used, while the `libgomp` itself contains only a generic interface and callbacks to the plugin for invoking target-dependent functionality. Plugins are shared object, implementing a set of routines:
{{{
get_type
get_num_devices
offload_register
device_init
device_get_table
device_alloc
device_free
device_dev2host
device_host2dev
device_run
}}}
When required, `libgomp` performs a search in the plugins directory for the plugins named `libgomp-plugin-*.so.1` and for each found file checks whether it is a proper plugin. The plugins use target-dependent libraries and perform low-level interaction with the accel devices. E.g., the plugin for Intel MIC devices uses `liboffloadmic` for implementing `libgomp` callbacks.
`libgomp` is designed to be independent of accelerator type it work with. In order to make it possible, plugins are used, while the `libgomp` itself contains only a generic interface and callbacks to the plugin for invoking target-dependent functionality. Plugins are shared object, implementing a set of routines listed below.

Common for OpenMP and OpenACC:
{{{
GOMP_OFFLOAD_get_name
GOMP_OFFLOAD_get_caps
GOMP_OFFLOAD_get_type
GOMP_OFFLOAD_get_num_devices
GOMP_OFFLOAD_init_device
GOMP_OFFLOAD_fini_device
GOMP_OFFLOAD_version
GOMP_OFFLOAD_load_image
GOMP_OFFLOAD_unload_image
GOMP_OFFLOAD_alloc
GOMP_OFFLOAD_free
GOMP_OFFLOAD_dev2host
GOMP_OFFLOAD_host2dev
}}}

OpenMP specific:
{{{
GOMP_OFFLOAD_run
GOMP_OFFLOAD_async_run
GOMP_OFFLOAD_dev2dev
}}}

OpenACC specific:
{{{
GOMP_OFFLOAD_openacc_parallel
GOMP_OFFLOAD_openacc_register_async_cleanup
GOMP_OFFLOAD_openacc_async_test
GOMP_OFFLOAD_openacc_async_test_all
GOMP_OFFLOAD_openacc_async_wait
GOMP_OFFLOAD_openacc_async_wait_async
GOMP_OFFLOAD_openacc_async_wait_all
GOMP_OFFLOAD_openacc_async_wait_all_async
GOMP_OFFLOAD_openacc_async_set_async
GOMP_OFFLOAD_openacc_create_thread_data
GOMP_OFFLOAD_openacc_destroy_thread_data
GOMP_OFFLOAD_openacc_get_current_cuda_device
GOMP_OFFLOAD_openacc_get_current_cuda_context
GOMP_OFFLOAD_openacc_get_cuda_stream
GOMP_OFFLOAD_openacc_set_cuda_stream
}}}
`libgomp` gets the list of offload targets from the configure (specified by `--enable-offload-targets=target1,target2,...`). During the offload initialization, it tries to load plugins named `libgomp-plugin-<target>.so.1` from standard dynamic linker paths. The plugins can use third-party target-dependent libraries to perform low-level interaction with the accel devices. E.g., the plugin for Intel MIC devices uses `liboffloadmic.so` for implementing `libgomp` callbacks, and the plugin for Nvidia PTX devices uses `libcuda.so`.
Line 107: Line 343:
When `#pragma omp target` is expanded, the `host_addr` of outlined function is passed to `GOMP_target`. If target device is not available, `libgomp` just performs host fallback using `host_addr`. But to run the function on target, it needs to translate `host_addr` into the corresponding `target_addr`. The idea is to have `[ host_addr, size ]` arrays in `__gnu_offload_funcs`/`vars` sections which are ordered exactly the same as corresponding `[ target_addr ]` arrays inside the target images (`size` is needed only for vars). When `#pragma omp target` is expanded, the `host_addr` of outlined function is passed to `GOMP_target{,_ext}`. If target device is not available, `libgomp` just performs host fallback using `host_addr`. But to run the function on target, it needs to translate `host_addr` into the corresponding `target_addr`. The idea is to have `[ host_addr, size ]` arrays in `.gnu.offload_{funcs,vars}` sections which are ordered exactly the same as corresponding `[ target_addr ]` arrays inside the target images (`size` is needed only for vars).
Line 113: Line 349:
When an executable or dynamic shared object is loaded, it calls `GOMP_offload_register` N times, where N is number of accel images, embedded into this exec/dso. This function stores the pointers to the images and other data needed by accel plugin into `offload_images`.

The first call to `GOMP_target`, `GOMP_target_data` or `GOMP_target_update` performs corresponding device initialization: it calls `device_init` from the plugin, and then stores address mapping table in the splay tree.

In case of Intel MIC, `device_init` creates a new process on the device, and then offloads the accel images with the `type == TARGET_TYPE_INTEL_MIC`. All accel images, even inside the executable, represent dynamic shared objects, which are loaded into the newly created process.

`GOMP_target` looks up the `host_addr` passed to it in the splay tree and passes corresponding `target_addr` to plugin's `device_run` function.
When an executable or dynamic shared object is loaded, it calls `GOMP_offload_register{,_ver}` N times, where N is number of accel images, embedded into this exec/dso. This function stores the pointers to the images and other data needed by accel plugin into `offload_images`.

The first call to `GOMP_target{,_ext}`, `GOMP_target_data{,_ext}`, `GOMP_target_update{,_ext}` or `GOMP_target_enter_exit_data` performs corresponding device initialization: it calls `GOMP_OFFLOAD_init_device` from the plugin, and then stores address mapping table in the splay tree.

In case of Intel MIC, `GOMP_OFFLOAD_init_device` creates a new process on the device, and then offloads the accel images with the `type == OFFLOAD_TARGET_TYPE_INTEL_MIC`. All accel images, even inside the executable, represent dynamic shared objects, which are loaded into the newly created process.

`GOMP_target{,_ext}` looks up the `host_addr` passed to it in the splay tree and passes corresponding `target_addr` to plugin's `GOMP_OFFLOAD_run` function.

== Partial Offloading ==

''Partial offloading'' means that for some of the potentially offloadable
regions, offloadable code is not created.
For example:
 1. Parts of an application are compiled with offloading enabled, but other
 parts with offloading disabled.
 1. Usage of constructs in the offloading region that cannot be supported:
  * [[nvptx]] doesn't support `setjmp`/`longjmp`, exceptions (?), `alloca`,
  computed `goto`, non-local `goto`, for example;
  * hsa offloading fails if the compiler can't "gridify" certain loops.
 1. The compiler determines that offloading is not feasible.
 For example, if no parallelism is usable in an offloading region,
 single-threaded offloading execution will typically be slower than
 host-fallback execution because of hardware characteristics.
 Also, on a non-shared memory system, offloading incurs data copy penalties.

In shared memory offloading configurations, the run-time system can just use
host-fallback.
If not expected by a user, this may incur a performance regression, but the
program semantics will not be affected
(unless in the offloading region the program makes use of any program
constructs that exhibit different behavior when executing in offloaded
vs. host-fallback mode).
Doing host-fallback in non-shared memory offloading configurations however may
lead to hard-to-find problems,
if a user expects that all offloading regions are executed on the device,
but in fact some of them are silently executed on the host with different data
environment.

If offloaded code is expected to be run on an accelerator,
but that code is not in fact available,
the run-time system will (silently) resort to host-fallback execution.

Therefore it is important in such cases to emit compile-time diagnostics.

OpenMP, for example, doesn't guarantee that all target regions must be executed on the device,
but in this case a user can't be sure that some library function always will
offload (because the library might be replaced by fallback version),
and they will have to write something like:
{{{
map_data_to_target ();
some_library1_fn_with_offload ();
get_data_from_target (); /* ! */
send_data_to_target (); /* ! */
some_library2_fn_with_offload ();
get_data_from_target (); /* ! */
send_data_to_target (); /* ! */
some_library3_fn_with_offload ();
unmap_data_from_target ();
}}}

It may be worth discussing whether there should be a way to allow the run-time
system to deduce what data needs to be resynced on target region entries/exits
in presence of fallback execution; explicit copying via `map(from/to:...)` is a
too big hammer for that.

In non-shared memory offloading configurations, it is user error if compiling
parts of an application with offloading enabled, but other parts with
offloading disabled.
The compiler/run-time system are not expected to "fix up" any possible
conflicts in data management.

Currently, the compilation process (host compiler) will stop if there is an
error in any offload compilation.
It is under discussion to change this (at least depending on some option):
either downgrade all errors in the offloading compiler into warnings that just
result in the offloading image for the particular accelerator not being
created, or
issue errors, but still allow the linking.
Line 122: Line 430:
Currently all necessary changes could be found in a git branch named [[https://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/kyukhin/gomp4-offload|kyukhin/gomp4-offload]]. It contains set of patches from `gomp-4_0-branch`, some non-committed patches, `libgomp` plugin, Intel MIC runtime offload library `liboffloadmic` and an emulator. This emulator lies under `liboffloadmic` and reproduces MIC's HW and SW stack behavior allowing to run offloaded code in a separate address space using the host machine. Patches enabling OpenMP 4.0 offloading to Intel MIC are merged to trunk. They include general infrastructure changes, `mkoffload` tool, `libgomp` plugin, Intel MIC runtime offload library `liboffloadmic` and an emulator. This emulator lies under `liboffloadmic` and reproduces MIC's HW and SW stack behavior allowing to run offloaded code in a separate address space using the host machine. The emulator consists of 4 shared libraries which replace COI and MYO libraries from [[https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss|Intel Manycore Platform Software Stack (MPSS)]]. In case of real offloading, user is supposed to specify path to MPSS libraries in LD_LIBRARY_PATH, this will overload emulator libraries on runtime.

[[tschwinge]] is using the following build scripts: [[attachment:trunk-offload-big.tar.bz2]] (bootstrap, all languages), [[attachment:trunk-offload-light.tar.bz2]] (no bootstrap, only C, C++, Fortran). Unpack, populate `[...]/source-{gcc,newlib,nvptx-tools}` (for example, using `git-new-workdir`, or symlinks to existing source trees), and then invoke the `RUN` script.

In the following instructions, note that DESTDIR specifies where the toolchain is to be installed. In the steps below, DESTDIR is set to /install, although any directory with sufficient write permissions should work so long as DESTDIR is set to an absolute path. Furthermore, during install DESTDIR may be will be populated with a usr/local/ subdirectories. If your system creates a DESTDIR/usr/local, and assuming that DESTDIR is /install as with the examples below, be sure to replace /install/bin with /install/usr/local/bin and set LD_LIBRARY_PATH to /install/usr/local/lib64 when you follow steps 3 and 4 below.
Line 125: Line 437:
{{{
../configure --build=x86_64-intelmic-linux-gnu --host=x86_64-intelmic-linux-gnu --target=x86_64-intelmic-linux-gnu --enable-as-accelerator-for=x86_64-pc-linux-gnu --prefix=/install target_configargs="--enable-version-specific-runtime-libs"
For Intel MIC:

{{{
../configure --build=x86_64-intelmicemul-linux-gnu --host=x86_64-intelmicemul-linux-gnu --target=x86_64-intelmicemul-linux-gnu --enable-as-accelerator-for=x86_64-pc-linux-gnu
Line 128: Line 442:
make install
}}}
make install DESTDIR=/install
}}}

For Nvidia PTX (also see https://gcc.gnu.org/install/specific.html#nvptx-x-none):

First set up nvptx-tools. Note that ptxas must be in your PATH:

{{{
  ${NVPTX_TOOLS_SRC}/configure
  make
  make install DESTDIR=/install
}}}

Next insert a symbolic to nvptx-newlib's newlib directory into the directory containing the gcc sources. Then proceed to build the nvptx offloading gcc. Note that INSTDIR/usr/local/bin needs to be in your PATH:

{{{
../configure --target=nvptx-none --enable-as-accelerator-for=x86_64-pc-linux-gnu --with-build-time-tools=[install-nvptx-tools]/nvptx-none/bin --disable-sjlj-exceptions --enable-newlib-io-long-long
make
make install DESTDIR=/install
}}}

Finally, remove the newlib symlink from the gcc sources directory.
Line 132: Line 467:
../configure --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --target=x86_64-pc-linux-gnu --enable-offload-targets=x86_64-intelmic-linux-gnu --prefix=/install ../configure --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --target=x86_64-pc-linux-gnu --enable-offload-targets=x86_64-intelmicemul-linux-gnu=/install/prefix,nvptx-none=/install/usr/local/nvptx-none --with-cuda-driver=[cuda_install_path]
Line 134: Line 469:
make install
}}}
make install DESTDIR=/install
}}}
If you install both compilers without `DESTDIR`, then there is no need to specify the paths to accel install trees in the `--enable-offload-targets` option.
Line 138: Line 474:
/install/bin/gcc -fopenmp -flto test.c
}}}
=== 4. Running an application using the emulator: ===
{{{
export LIBGOMP_PLUGIN_PATH="/install/lib64/"
/install/bin/gcc -fopenmp test.c
/install/bin/gcc
-fopenacc test.c
}}}
=== 4. Running an application using the Intel MIC emulator: ===
{{{
Line 144: Line 480:
export MIC_LD_LIBRARY_PATH="/install/lib/gcc/x86_64-intelmic-linux-gnu/4.10.0/"
Line 147: Line 482:
=== Running 'make check' (run tests using the emulator): ===
{{{
configure, make and *install* accel compiler (see #1)
configure and make host compiler (see #2)
export COMPILER_PATH="/install/libexec/gcc/x86_64-pc-linux-gnu/4.10.0"
export LIBGOMP_PLUGIN_PATH="/gcc_build_host/x86_64-pc-linux-gnu/liboffloadmic/plugin/.libs"
export LD_LIBRARY_PATH="/gcc_build_host/x86_64-pc-linux-gnu/liboffloadmic/.libs"
export MIC_LD_LIBRARY_PATH="/install/lib/gcc/x86_64-intelmic-linux-gnu/4.10.0/"
cd gcc_build_host
This creates 2 processes on host: the a.out process and "target" process.

KNL instructions can be emulated by running target process under [[https://software.intel.com/en-us/articles/intel-software-development-emulator|Intel Software Development Emulator (SDE)]]:
{{{
export LD_LIBRARY_PATH="/install/lib64/"
/install/bin/gcc -fopenmp -Ofast -foffload="-march=knl" test.c
OFFLOAD_EMUL_RUN="sde -knl --" ./a.out
}}}
The debugger can be attached to the target process by:
{{{
OFFLOAD_EMUL_RUN=gdb ./a.out
}}}
..., and multiple devices can be emulated by:
{{{
OFFLOAD_EMUL_KNC_NUM=2 ./a.out # For GCC 5
OFFLOAD_EMUL_NUM=2 ./a.out # For GCC 6
}}}
=== Running 'make check' ===
 * configure, make and '''install''' accel compiler (see #1)
 * configure and make host compiler (see #2)
 * From the host gcc build directory run:
{{{
Line 160: Line 508:
 * In-tree testing is not supported yet with the scheme described above. I (Bernd) think that what should happen is that the host compiler configure step copies the already installed accel/ subdirectory from the install location to its build directory. (Ilya) In-tree testing works with `COMPILER_PATH` set to `/install/libexec/gcc/host/ver`, this helps `lto-wrapper` to find `mkoffload`.  * In-tree testing is not supported yet when an accel compiler is not installed. [[https://gcc.gnu.org/ml/gcc-patches/2015-02/msg00050.html|RFC patch]].
 * If something goes wrong during the offloading compilation, the host binary is not created. However it's possible to continue compilation in such cases. [[https://gcc.gnu.org/ml/gcc-patches/2015-02/msg00951.html|Patch is here]].
 * If someone builds an accel compiler without `--enable-languages` or with `--enable-languages` other than `c,c++,fortran,lto`, then `bin` directory will contain redundant drivers. [[https://gcc.gnu.org/ml/gcc-patches/2015-02/msg01536.html|Fix is here]].
 * For OpenACC offloading, `-foffload=disable` [[http://mid.mail-archive.com/87oakeegx8.fsf@kepler.schwinge.homeip.net|does not do the right thing]].
 * We should get rid of the (only) handful of ENABLE_OFFLOADING and ACCEL_COMPILER preprocessor conditionals, [[http://mid.mail-archive.com/874mh43i7q.fsf@kepler.schwinge.homeip.net]].
 * Offloading compilation is slow, [[http://mid.mail-archive.com/87shzfa6z1.fsf@hertz.schwinge.homeip.net]].
 Supposedly, because of having to invoke several tools (LTO streaming -> mkoffload -> offload compilers, assemblers, linkers -> combine the resulting images; but we have not done a detailed analysis on that).

==== Debugging offload compiler invocations ====
 * [[http://mid.mail-archive.com/87h9efwh7u.fsf@kepler.schwinge.homeip.net]].

==== nvptx offloading ====
For nvptx offloading, the following issues still need to be resolved:
 * Add support for OpenMP offloading.

==== Intel MIC offloading ====
 * Intel MIC does not require special sysroot or build-time-tools, therefore the accel compiler should be configured as native (with same target in `--build`, `--host` and `--target` options). Probably it's better to configure it as cross compiler.
 * The host GCC build references/depends on the Intel MIC offloading compiler's installation directory (which thus has to be built and installed earlier), [[http://mid.mail-archive.com/878uaq68fn.fsf@kepler.schwinge.homeip.net]].
 * Add support for OpenACC offloading.

== See also ==

 * '''Accelerator BoF''' (GNU Tools Cauldron 2013) [[http://gcc.gnu.org/ml/gcc/2013-07/msg00428.html|summary]], [[http://www.youtube.ca/watch?v=EkCzBHdsGDU&list=PLsgS8fWwKJZhrjVEN7tsQyj2nLb5z0n70&index=1|video]]
 * '''Accelerator BoF''' (GNU Tools Cauldron 2014) [[http://www.youtube.ca/watch?v=y-oCc2XbY-k&list=PLOGTP9W1DX5U53pPxtWdKkyiqe3GAn6Nd|video]]
 * '''OpenMP 4 Offloading Features implementation in GCC''' (Kirill Yukhin, GNU Tools Cauldron 2015) [[https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=view&target=OpenMP_4_Offloading_in_GCC.Yukhin.pdf|slides]], [[https://www.youtube.ca/watch?v=c5OA8T0HWAo|video]]
 * '''Compiling for HSA accelerators with GCC''' (Martin Jambor, GNU Tools Cauldron 2015) [[https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=view&target=mjambor-hsa-slides.pdf|slides]], [[https://www.youtube.ca/watch?v=3yJdAUrfC0g|video]]
 * '''OpenACC & PTX''' (Nathan Sidwell, GNU Tools Cauldron 2015) [[https://www.youtube.ca/watch?v=SBX6_K1AD7s|video]]
 * '''Accelerator BoF''' (GNU Tools Cauldron 2015) [[https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=view&target=accelerator-bof.pdf|slides1]], [[https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=view&target=amonakov-accelerator-bof.pdf|slides2]], [[https://www.youtube.ca/watch?v=9NWJlbU5Gwg|video]]
 * '''Improving OpenACC kernels support in GCC''' (Thomas Schwinge, GNU Tools Cauldron 2017) [[https://gcc.gnu.org/wiki/cauldron2017?action=AttachFile&do=view&target=OpenACC+kernels.pdf|slides]]
 * '''Future Direction of OpenACC''' (Cesar Philippidis, GNU Tools Cauldron 2018) [[https://gcc.gnu.org/wiki/cauldron2018?action=AttachFile&do=view&target=cjp-openacc-gnu-cauldron2018.pdf|slides]] [[https://ia800707.us.archive.org/10/items/gnu_tools_cauldron_2018/Future%20Direction%20of%20OpenACC%20-%20GNU%20Tools%20Cauldron%202018-rHHN3bbNWnY.mp4|video]]

Offloading Support in GCC

GCC 5 and later support two offloading configurations:

  • OpenMP to Intel MIC targets (upcoming Intel Xeon Phi products codenamed KNL) as well as MIC emulation on host.

  • OpenACC to Nvidia PTX targets.

GCC 7 and later supports further:

Planned for GCC 10:

  • OpenMP and OpenACC to AMD GCN targets (non-offloading back-end introduced in GCC 9).

/!\ This needs to be updated some more for OpenACC.

Terminology

Host compiler — a regular compiler. Not to be confused with build/host/target configure terms.

Accel compiler — a compiler that reads intermediate representation from the special LTO sections, and generates code for the accelerator device. Also called the "offload compiler" or "target compiler".

OpenMP — open multi-processing, supporting vector, thread and offloading directives/pragmas.

OpenACC — open accelerators, supporting offloading directives/pragmas.

Building host and accel compilers

The host and offload compilers need to be able to find each other. This is achieved by installing the offload compiler into special locations, and informing each about the presence of the other. All available offload compilers must first be configured with "--enable-as-accelerator-for=host-triplet", and installed into the same prefix as the host compiler. Then the host compiler is built with "--enable-offload-targets=target1,target2,..." which identifies the offload compilers that have already been built and installed.

The install locations for the offload compilers differ from those of a normal cross toolchain, by the following mapping:

bin/$target-gcc

->

bin/$host-accel-$target-gcc

lib/gcc/$target/$ver/

->

lib/gcc/$host/$ver/accel/$target

It may be necessary to compile offload compilers with a sysroot, since otherwise install locations for libgomp could clash (maybe that library needs to move into lib/gcc/..?)

A target needs to provide a mkoffload tool if it wishes to be usable as an accelerator. It is installed as one of EXTRA_PROGRAMS, and the host lto-wrapper knows how to find it from the paths described above. mkoffload will invoke the offload compiler in LTO mode to produce an offload binary from the host object files, then post-process this to produce a new object file that can be linked in with the host executable. It can find the host compiler by examining the COLLECT_GCC environment variable, and it must take care to clear this and certain other environment variables when executing the offload compiler so as to not confuse it.

Compilation process

Host compiler performs the following actions:

  1. After #pragma omp target lowering and expansion, a new outlined function with the attribute "omp declare target" emerges — it will be later compiled both by host and accel compilers to produce two versions (or N+1 versions in case of N different accel targets).
    The decls for all global variables marked with "omp declare target" attribute, as well as decls for outlined target regions, are inserted into offload_vars and offload_funcs arrays.

  2. The expansion phase replaces pragmas with corresponding calls to the runtime library libgomp (GOMP_target{,_ext}, GOMP_target_data{,_ext} + GOMP_target_end_data, GOMP_target_update{,_ext}, GOMP_target_enter_exit_data). These calls are preceded by initialization of special structures, containing arguments for outlined functions (.omp_data_arr.*, .omp_data_sizes.*, .omp_data_kinds.*).

  3. During the ipa_write_summaries pass the intermediate representation of outlined functions is streamed out into the .gnu.offload_lto_* sections of the "fat" object file. This object file also may contain .gnu.lto_* sections for the regular link-time optimizations.
    Also the decls from offload_funcs and offload_vars are streamed out into the .gnu.offload_lto_.offload_table section. Later an accel compiler will read this section to produce target's mapping table.

  4. In omp_finish_file function the addresses from offload_funcs and offload_vars are written into the .gnu.offload_funcs and .gnu.offload_vars sections correspondingly.
    Optionally, if -flto is present, the decls from offload_funcs and offload_vars are streamed out into the .gnu.lto_.offload_table section. Later the host compiler in LTO mode will use them to produce the final host's table with addresses.

  5. When all source files are compiled, pre-linker driver collect2 is invoked. It runs the linker, which loads linker plugin liblto_plugin.so, which runs lto-wrapper. Without offloading the lto-wrapper is called for link-time recompilation if at least one object file contains .gnu.lto_* sections. If some files contain offloading, then linker plugin will execute lto-wrapper even if there are no .gnu.lto_* sections. Offloading without linker plugin is not supported.

  6. lto-wrapper runs mkoffload for each accel target, specified during the configuration.

  7. mkoffload runs accel compiler, which reads IR from the .gnu.offload_lto_* sections and compiles it for the accel target. Then mkoffload packs this target code (image) into the special section of a new host's object file. The object file produced with mkoffload should contain a constructor that calls GOMP_offload_register{,_ver} to identify itself at run-time. Arguments to that function are a symbol called __OFFLOAD_TABLE__ (provided by libgcc and unique per shared object), a target identifier, and some other data needed for a particular target (a pointer to the image, a table with information about mappings between host and offload functions and variables).

  8. Linker adds new object files, produced by mkoffloads, to the list of host's input object files.

Address mapping tables

This example shows how the tables with addresses are created. It consists of 3 source files: apple.c, banana.c and citron.c. Each of them contains 2 outlined target regions *._omp_fn.{0,1}. Global variables are handled in a similar manner. There are 3 different ways of compilation, which are described in detail below.

Section name

Content

.gnu.lto_.offload_table

Contains IR of the decls for the host compiler

.gnu.offload_lto_.offload_table

Contains IR of the decls for the accel compiler

.gnu.offload_funcs

Contains addresses in the host binary

<Target section>

Contains addresses in the target image
* For Intel MIC targets the addresses are stored in ELF binary similar to host addresses
* For Nvidia PTX targets the addresses are stored in PTX assembly

All files without -flto

First,

gcc -c -fopenmp apple.c banana.c citron.c

produces 3 object files with the following sections:

apple.o

Section name

Content

.gnu.offload_lto_.offload_table

apple._omp_fn.0
apple._omp_fn.1

.gnu.offload_funcs

apple._omp_fn.0
apple._omp_fn.1

banana.o

Section name

Content

.gnu.offload_lto_.offload_table

banana._omp_fn.0
banana._omp_fn.1

.gnu.offload_funcs

banana._omp_fn.0
banana._omp_fn.1

citron.o

Section name

Content

.gnu.offload_lto_.offload_table

citron._omp_fn.0
citron._omp_fn.1

.gnu.offload_funcs

citron._omp_fn.0
citron._omp_fn.1

Next,

gcc -fopenmp apple.o banana.o citron.o

runs an accel compiler, which reads IR from .gnu.offload_lto_.offload_table and produces the final target table:

Target image

Section name

Content

<Target section>

apple._omp_fn.0
apple._omp_fn.1
banana._omp_fn.0
banana._omp_fn.1
citron._omp_fn.0
citron._omp_fn.1

Finally, the host linker joins these 3 objects and therefore .gnu.offload_funcs sections into the host binary:

Host binary

Section name

Content

.gnu.offload_funcs

<__offload_func_table>
apple._omp_fn.0
apple._omp_fn.1
banana._omp_fn.0
banana._omp_fn.1
citron._omp_fn.0
citron._omp_fn.1
<__offload_funcs_end>

__offload_func_table and __offload_funcs_end are special symbols, defined in crtoffloadbegin.o and crtoffloadend.o respectively.

All files with -flto

First,

gcc -c -fopenmp -flto apple.c banana.c citron.c

produces 3 object files with the following sections:

apple.o

Section name

Content

.gnu.lto_.offload_table

apple._omp_fn.0
apple._omp_fn.1

.gnu.offload_lto_.offload_table

apple._omp_fn.0
apple._omp_fn.1

banana.o

Section name

Content

.gnu.lto_.offload_table

banana._omp_fn.0
banana._omp_fn.1

.gnu.offload_lto_.offload_table

banana._omp_fn.0
banana._omp_fn.1

citron.o

Section name

Content

.gnu.lto_.offload_table

citron._omp_fn.0
citron._omp_fn.1

.gnu.offload_lto_.offload_table

citron._omp_fn.0
citron._omp_fn.1

Next,

gcc -fopenmp apple.o banana.o citron.o

runs an accel compiler, which produces the final target table, like in the previous case:

Target image

Section name

Content

<Target section>

apple._omp_fn.0
apple._omp_fn.1
banana._omp_fn.0
banana._omp_fn.1
citron._omp_fn.0
citron._omp_fn.1

Next, host compiler is executed in LTO WPA mode, i.e. it reads IR from .gnu.lto_.offload_table from apple.o, banana.o, citron.o, and writes the joint table into .gnu.lto_.offload_table in the temporary object ccXXXXXX.ltrans0.o:

ccXXXXXX.ltrans0.o

Section name

Content

.gnu.lto_.offload_table

apple._omp_fn.0
apple._omp_fn.1
banana._omp_fn.0
banana._omp_fn.1
citron._omp_fn.0
citron._omp_fn.1

In case of multiple partitions the joint table is written into the first partition only.

Next, host compiler is executed in LTO LTRANS mode. It reads the temporary table from .gnu.lto_.offload_table and writes the final table into the final object ccXXXXXX.ltrans0.ltrans.o:

ccXXXXXX.ltrans0.ltrans.o

Section name

Content

.gnu.offload_funcs

apple._omp_fn.0
apple._omp_fn.1
banana._omp_fn.0
banana._omp_fn.1
citron._omp_fn.0
citron._omp_fn.1

Finally, the host linker joins crtoffloadbegin.o, ccXXXXXX.ltrans0.ltrans.o and crtoffloadend.o:

Host binary

Section name

Content

.gnu.offload_funcs

<__offload_func_table>
apple._omp_fn.0
apple._omp_fn.1
banana._omp_fn.0
banana._omp_fn.1
citron._omp_fn.0
citron._omp_fn.1
<__offload_funcs_end>

Some files with and some without -flto

First,

gcc -c -fopenmp banana.c
gcc -c -fopenmp -flto apple.c citron.c

produces 3 object files with the following sections:

apple.o

Section name

Content

.gnu.lto_.offload_table

apple._omp_fn.0
apple._omp_fn.1

.gnu.offload_lto_.offload_table

apple._omp_fn.0
apple._omp_fn.1

banana.o

Section name

Content

.gnu.offload_lto_.offload_table

banana._omp_fn.0
banana._omp_fn.1

.gnu.offload_funcs

banana._omp_fn.0
banana._omp_fn.1

citron.o

Section name

Content

.gnu.lto_.offload_table

citron._omp_fn.0
citron._omp_fn.1

.gnu.offload_lto_.offload_table

citron._omp_fn.0
citron._omp_fn.1

Next, while running

gcc -fopenmp apple.o banana.o citron.o

the linker plugin creates a list of objects with offload sections and passes it to lto-wrapper. The order must be exactly the same as the final order after recompilation and linking. In this example it is: apple.o, citron.o and banana.o. Therefore, the accel compiler will produce the following target table:

Target image

Section name

Content

<Target section>

apple._omp_fn.0
apple._omp_fn.1
citron._omp_fn.0
citron._omp_fn.1
banana._omp_fn.0
banana._omp_fn.1

Next, host compiler will recompile LTO objects (apple.o and citron.o) into ccXXXXXX.ltrans0.ltrans.o with the following table:

ccXXXXXX.ltrans0.ltrans.o

Section name

Content

.gnu.offload_funcs

apple._omp_fn.0
apple._omp_fn.1
citron._omp_fn.0
citron._omp_fn.1

Finally, the host linker joins all objects in this order: crtoffloadbegin.o, ccXXXXXX.ltrans0.ltrans.o, banana.o, crtoffloadend.o; with the following host table:

Host binary

Section name

Content

.gnu.offload_funcs

<__offload_func_table>
apple._omp_fn.0
apple._omp_fn.1
citron._omp_fn.0
citron._omp_fn.1
banana._omp_fn.0
banana._omp_fn.1
<__offload_funcs_end>

Compilation without -flto

Offloading-related steps are marked in bold.

  • gcc

    • cc1 # Compile first source file into plain asm + intermediate representation for accel

    • as # Assemble this asm + IR into object file

    • ... # Compile and assemble all remaining source files

    • collect2 # Pre-linker driver

      • collect-ld # Simple wrapper over ld

        • ld with liblto_plugin.so # Perform linking

          • lto-wrapper # Is called from liblto_plugin.so

            • intelmic/mkoffload # Prepare offload image for Intel MIC devices

              • accel_gcc # Read target IR from all objects and produce target DSO

              • objcopy # Save target DSO in a special section in a new host's object file

            • .../mkoffload # Prepare images for other targets

              • ...

Compilation with -flto

  • gcc

    • cc1 # Compile first source file into plain asm + intermediate representation + IR for accel

    • as # Assemble this asm + IR into temporary object file

    • ... # Compile and assemble all remaining source files

    • collect2 # Pre-linker driver

      • collect-ld # Simple wrapper over ld

        • ld with liblto_plugin.so # Perform linking

          • lto-wrapper # Is called from liblto_plugin.so

            • gcc

              • lto1 # Perform whole program analysis and split into new partitions

            • gcc

              • lto1 # Perform local transformations in the first object file

              • as # Assemble into final object code

            • ... # Perform local transformations in each partitioned object file

            • intelmic/mkoffload # Prepare offload image for Intel MIC devices

              • accel_gcc # Read target IR from all partitions and produce target DSO

              • objcopy # Save target DSO in a special section in a new host's object file

            • .../mkoffload # Prepare images for other targets

              • ...

Compilation options

The main option to control offloading is:

  1. -foffload=<targets>=<options>
    By default, GCC will build offload images for all offload targets specified in configure with non-target-specific options passed to host compiler. This option is used to control offload targets and options for them. It can be used in a few ways:

    • -foffload=disable
      Tells GCC to disable offload support. Target regions will be run in host fallback mode.

    • -foffload=<targets>
      Tells GCC to build offload images for <targets>. They will be built with non-target-specific options passed to host compiler.

    • -foffload=<options>
      Tells GCC to build offload images for all targets specified in configure. They will be built with non-target-specific options passed to host compiler plus <options>.

    • -foffload=<targets>=<options>
      Tells GCC to build offload images for <targets>. They will be built with non-target-specific options passed to host compiler plus <options>.

    <targets> are separated by commas. Several <options> can be specified by separating them by spaces. Options specified by -foffload are appended to the end of option set, so in case of option conflicts they have more priority. The -foffload flag can be specified several times, and you have to do that to specify different <options> for different <targets>.

Also there are several internal options, which should not be specified by user:

  1. -foffload-abi=[lp64|ilp32]
    The option is generated by the host compiler. It is supposed to tell mkoffload (and offload compiler) which ABI is used in streamed GIMPLE, because host and offload compilers must have the same ABI.

  2. -foffload-objects=/tmp/ccxxxha
    This option is generated by linker plugin. It is used to pass the list of object files with offloading to lto-wrapper.

Examples

  • gcc -fopenmp -c -O2 test1.c
    gcc -fopenmp -c -O1 -msse -foffload=-mavx test2.c
    gcc -fopenmp -foffload="-O3 -v" test1.o test2.o

    In this example the offload images will be built with the following options: "-O2 -mavx -O3 -v" for targets specified in configure.

  • gcc -fopenmp -foffload=x86_64-intelmicemul-linux-gnu="-mavx2" -foffload=nvptx-none -foffload="-O3" -O2 test.c

    In this example 2 offload images will be built: for MIC with "-O2 -mavx2 -O3" and for PTX with "-O2 -O3".

Runtime support in libgomp

libgomp plugins

libgomp is designed to be independent of accelerator type it work with. In order to make it possible, plugins are used, while the libgomp itself contains only a generic interface and callbacks to the plugin for invoking target-dependent functionality. Plugins are shared object, implementing a set of routines listed below.

Common for OpenMP and OpenACC:

GOMP_OFFLOAD_get_name
GOMP_OFFLOAD_get_caps
GOMP_OFFLOAD_get_type
GOMP_OFFLOAD_get_num_devices
GOMP_OFFLOAD_init_device
GOMP_OFFLOAD_fini_device
GOMP_OFFLOAD_version
GOMP_OFFLOAD_load_image
GOMP_OFFLOAD_unload_image
GOMP_OFFLOAD_alloc
GOMP_OFFLOAD_free
GOMP_OFFLOAD_dev2host
GOMP_OFFLOAD_host2dev

OpenMP specific:

GOMP_OFFLOAD_run
GOMP_OFFLOAD_async_run
GOMP_OFFLOAD_dev2dev

OpenACC specific:

GOMP_OFFLOAD_openacc_parallel
GOMP_OFFLOAD_openacc_register_async_cleanup
GOMP_OFFLOAD_openacc_async_test
GOMP_OFFLOAD_openacc_async_test_all
GOMP_OFFLOAD_openacc_async_wait
GOMP_OFFLOAD_openacc_async_wait_async
GOMP_OFFLOAD_openacc_async_wait_all
GOMP_OFFLOAD_openacc_async_wait_all_async
GOMP_OFFLOAD_openacc_async_set_async
GOMP_OFFLOAD_openacc_create_thread_data
GOMP_OFFLOAD_openacc_destroy_thread_data
GOMP_OFFLOAD_openacc_get_current_cuda_device
GOMP_OFFLOAD_openacc_get_current_cuda_context
GOMP_OFFLOAD_openacc_get_cuda_stream
GOMP_OFFLOAD_openacc_set_cuda_stream

libgomp gets the list of offload targets from the configure (specified by --enable-offload-targets=target1,target2,...). During the offload initialization, it tries to load plugins named libgomp-plugin-<target>.so.1 from standard dynamic linker paths. The plugins can use third-party target-dependent libraries to perform low-level interaction with the accel devices. E.g., the plugin for Intel MIC devices uses liboffloadmic.so for implementing libgomp callbacks, and the plugin for Nvidia PTX devices uses libcuda.so.

Address translation

When #pragma omp target is expanded, the host_addr of outlined function is passed to GOMP_target{,_ext}. If target device is not available, libgomp just performs host fallback using host_addr. But to run the function on target, it needs to translate host_addr into the corresponding target_addr. The idea is to have [ host_addr, size ] arrays in .gnu.offload_{funcs,vars} sections which are ordered exactly the same as corresponding [ target_addr ] arrays inside the target images (size is needed only for vars).

To keep this host_addr -> target_addr mapping at runtime, each device descriptor gomp_device_descr contains a splay tree. When gomp_init_device performs initialization, it walks the whole array and in each iteration picks n-th host pair host_start/host_end plus corresponding n-th target pair tgt_start/tgt_end, and inserts it into the splay tree.

Execution process

When an executable or dynamic shared object is loaded, it calls GOMP_offload_register{,_ver} N times, where N is number of accel images, embedded into this exec/dso. This function stores the pointers to the images and other data needed by accel plugin into offload_images.

The first call to GOMP_target{,_ext}, GOMP_target_data{,_ext}, GOMP_target_update{,_ext} or GOMP_target_enter_exit_data performs corresponding device initialization: it calls GOMP_OFFLOAD_init_device from the plugin, and then stores address mapping table in the splay tree.

In case of Intel MIC, GOMP_OFFLOAD_init_device creates a new process on the device, and then offloads the accel images with the type == OFFLOAD_TARGET_TYPE_INTEL_MIC. All accel images, even inside the executable, represent dynamic shared objects, which are loaded into the newly created process.

GOMP_target{,_ext} looks up the host_addr passed to it in the splay tree and passes corresponding target_addr to plugin's GOMP_OFFLOAD_run function.

Partial Offloading

Partial offloading means that for some of the potentially offloadable regions, offloadable code is not created. For example:

  1. Parts of an application are compiled with offloading enabled, but other parts with offloading disabled.
  2. Usage of constructs in the offloading region that cannot be supported:
    • nvptx doesn't support setjmp/longjmp, exceptions (?), alloca, computed goto, non-local goto, for example;

    • hsa offloading fails if the compiler can't "gridify" certain loops.
  3. The compiler determines that offloading is not feasible. For example, if no parallelism is usable in an offloading region, single-threaded offloading execution will typically be slower than host-fallback execution because of hardware characteristics. Also, on a non-shared memory system, offloading incurs data copy penalties.

In shared memory offloading configurations, the run-time system can just use host-fallback. If not expected by a user, this may incur a performance regression, but the program semantics will not be affected (unless in the offloading region the program makes use of any program constructs that exhibit different behavior when executing in offloaded vs. host-fallback mode). Doing host-fallback in non-shared memory offloading configurations however may lead to hard-to-find problems, if a user expects that all offloading regions are executed on the device, but in fact some of them are silently executed on the host with different data environment.

If offloaded code is expected to be run on an accelerator, but that code is not in fact available, the run-time system will (silently) resort to host-fallback execution.

Therefore it is important in such cases to emit compile-time diagnostics.

OpenMP, for example, doesn't guarantee that all target regions must be executed on the device, but in this case a user can't be sure that some library function always will offload (because the library might be replaced by fallback version), and they will have to write something like:

map_data_to_target ();
some_library1_fn_with_offload ();
get_data_from_target ();   /* ! */
send_data_to_target ();    /* ! */
some_library2_fn_with_offload ();
get_data_from_target ();   /* ! */
send_data_to_target ();    /* ! */
some_library3_fn_with_offload ();
unmap_data_from_target ();

It may be worth discussing whether there should be a way to allow the run-time system to deduce what data needs to be resynced on target region entries/exits in presence of fallback execution; explicit copying via map(from/to:...) is a too big hammer for that.

In non-shared memory offloading configurations, it is user error if compiling parts of an application with offloading enabled, but other parts with offloading disabled. The compiler/run-time system are not expected to "fix up" any possible conflicts in data management.

Currently, the compilation process (host compiler) will stop if there is an error in any offload compilation. It is under discussion to change this (at least depending on some option): either downgrade all errors in the offloading compiler into warnings that just result in the offloading image for the particular accelerator not being created, or issue errors, but still allow the linking.

How to try offloading enabled GCC

Patches enabling OpenMP 4.0 offloading to Intel MIC are merged to trunk. They include general infrastructure changes, mkoffload tool, libgomp plugin, Intel MIC runtime offload library liboffloadmic and an emulator. This emulator lies under liboffloadmic and reproduces MIC's HW and SW stack behavior allowing to run offloaded code in a separate address space using the host machine. The emulator consists of 4 shared libraries which replace COI and MYO libraries from Intel Manycore Platform Software Stack (MPSS). In case of real offloading, user is supposed to specify path to MPSS libraries in LD_LIBRARY_PATH, this will overload emulator libraries on runtime.

tschwinge is using the following build scripts: trunk-offload-big.tar.bz2 (bootstrap, all languages), trunk-offload-light.tar.bz2 (no bootstrap, only C, C++, Fortran). Unpack, populate [...]/source-{gcc,newlib,nvptx-tools} (for example, using git-new-workdir, or symlinks to existing source trees), and then invoke the RUN script.

In the following instructions, note that DESTDIR specifies where the toolchain is to be installed. In the steps below, DESTDIR is set to /install, although any directory with sufficient write permissions should work so long as DESTDIR is set to an absolute path. Furthermore, during install DESTDIR may be will be populated with a usr/local/ subdirectories. If your system creates a DESTDIR/usr/local, and assuming that DESTDIR is /install as with the examples below, be sure to replace /install/bin with /install/usr/local/bin and set LD_LIBRARY_PATH to /install/usr/local/lib64 when you follow steps 3 and 4 below.

1. Building accel compiler:

For Intel MIC:

../configure --build=x86_64-intelmicemul-linux-gnu --host=x86_64-intelmicemul-linux-gnu --target=x86_64-intelmicemul-linux-gnu --enable-as-accelerator-for=x86_64-pc-linux-gnu
make
make install DESTDIR=/install

For Nvidia PTX (also see https://gcc.gnu.org/install/specific.html#nvptx-x-none):

First set up nvptx-tools. Note that ptxas must be in your PATH:

  ${NVPTX_TOOLS_SRC}/configure
  make
  make install DESTDIR=/install

Next insert a symbolic to nvptx-newlib's newlib directory into the directory containing the gcc sources. Then proceed to build the nvptx offloading gcc. Note that INSTDIR/usr/local/bin needs to be in your PATH:

../configure --target=nvptx-none --enable-as-accelerator-for=x86_64-pc-linux-gnu --with-build-time-tools=[install-nvptx-tools]/nvptx-none/bin --disable-sjlj-exceptions --enable-newlib-io-long-long
make
make install DESTDIR=/install

Finally, remove the newlib symlink from the gcc sources directory.

2. Building host compiler:

../configure --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --target=x86_64-pc-linux-gnu --enable-offload-targets=x86_64-intelmicemul-linux-gnu=/install/prefix,nvptx-none=/install/usr/local/nvptx-none --with-cuda-driver=[cuda_install_path]
make
make install DESTDIR=/install

If you install both compilers without DESTDIR, then there is no need to specify the paths to accel install trees in the --enable-offload-targets option.

3. Building an application:

/install/bin/gcc -fopenmp test.c
/install/bin/gcc -fopenacc test.c

4. Running an application using the Intel MIC emulator:

export LD_LIBRARY_PATH="/install/lib64/"
./a.out

This creates 2 processes on host: the a.out process and "target" process.

KNL instructions can be emulated by running target process under Intel Software Development Emulator (SDE):

export LD_LIBRARY_PATH="/install/lib64/"
/install/bin/gcc -fopenmp -Ofast -foffload="-march=knl" test.c
OFFLOAD_EMUL_RUN="sde -knl --" ./a.out

The debugger can be attached to the target process by:

OFFLOAD_EMUL_RUN=gdb ./a.out

..., and multiple devices can be emulated by:

OFFLOAD_EMUL_KNC_NUM=2 ./a.out # For GCC 5
OFFLOAD_EMUL_NUM=2 ./a.out     # For GCC 6

Running 'make check'

  • configure, make and install accel compiler (see #1)

  • configure and make host compiler (see #2)
  • From the host gcc build directory run:

make check-target-libgomp

Known issues

  • In-tree testing is not supported yet when an accel compiler is not installed. RFC patch.

  • If something goes wrong during the offloading compilation, the host binary is not created. However it's possible to continue compilation in such cases. Patch is here.

  • If someone builds an accel compiler without --enable-languages or with --enable-languages other than c,c++,fortran,lto, then bin directory will contain redundant drivers. Fix is here.

  • For OpenACC offloading, -foffload=disable does not do the right thing.

  • We should get rid of the (only) handful of ENABLE_OFFLOADING and ACCEL_COMPILER preprocessor conditionals, http://mid.mail-archive.com/874mh43i7q.fsf@kepler.schwinge.homeip.net.

  • Offloading compilation is slow, http://mid.mail-archive.com/87shzfa6z1.fsf@hertz.schwinge.homeip.net. Supposedly, because of having to invoke several tools (LTO streaming -> mkoffload -> offload compilers, assemblers, linkers -> combine the resulting images; but we have not done a detailed analysis on that).

Debugging offload compiler invocations

nvptx offloading

For nvptx offloading, the following issues still need to be resolved:

  • Add support for OpenMP offloading.

Intel MIC offloading

  • Intel MIC does not require special sysroot or build-time-tools, therefore the accel compiler should be configured as native (with same target in --build, --host and --target options). Probably it's better to configure it as cross compiler.

  • The host GCC build references/depends on the Intel MIC offloading compiler's installation directory (which thus has to be built and installed earlier), http://mid.mail-archive.com/878uaq68fn.fsf@kepler.schwinge.homeip.net.

  • Add support for OpenACC offloading.

See also

  • Accelerator BoF (GNU Tools Cauldron 2013) summary, video

  • Accelerator BoF (GNU Tools Cauldron 2014) video

  • OpenMP 4 Offloading Features implementation in GCC (Kirill Yukhin, GNU Tools Cauldron 2015) slides, video

  • Compiling for HSA accelerators with GCC (Martin Jambor, GNU Tools Cauldron 2015) slides, video

  • OpenACC & PTX (Nathan Sidwell, GNU Tools Cauldron 2015) video

  • Accelerator BoF (GNU Tools Cauldron 2015) slides1, slides2, video

  • Improving OpenACC kernels support in GCC (Thomas Schwinge, GNU Tools Cauldron 2017) slides

  • Future Direction of OpenACC (Cesar Philippidis, GNU Tools Cauldron 2018) slides video

None: Offloading (last edited 2019-11-05 16:06:30 by tschwinge)