[RFC] Offloading Support in libgomp

Michael Zolotukhin michael.v.zolotukhin@gmail.com
Fri Sep 13 09:35:00 GMT 2013

Hi Jakub et al.,
We prepared a draft for design document for offloading support in GCC - could
you please take a look?  It is intended to give a common comprehension of what
is going on in this part.

We might publish it to a GCC wiki, if it is ok.  And later we could fill it with
more details if needed.

Here it is:

1.  High level view on the compilation process with openmp plugins
1.1.  Compilation
1.2.  Linking
1.3.  Execution
2.  Linker plugins infrastructure
2.1.  Overview
2.2.  Multi-target support
3.  OpenMP pragma target handling in middle-end
4.  Runtime support in libGOMP
4.1.  General interface for offloading
4.2.  Maintaining info about mapped regions
4.3.  Preparing arguments for offloaded calls
4.4.  Plugins, usage of device-specific interfaces


1.1.  Compilation

When host version of GCC compiles a file, the following stages happen:
  * After OpenMP pragmas lowering and expanding a new outlined function with
'target'-attribute emerges - it later will be compiled both by host and target
GCC to produce two versions (or N+1 versions in case of N different targets).
  * Expanding replaces pragmas with corresponding calls to runtime library
(libgomp).  These calls are preceded by initialization of special structures,
containing arguments for outlined routines - that is done similar to 'pragma
omp parallel' processing.
  * Gimple for routines with 'target' attribute is streamed into a special
section of the assembler (similar to LTO-sections).
  * Usual compilation continues, producing host-side assembler.
  * Assembler generates a FAT-object, containing host-side code and Gimple IR
for the outlined functions (they were marked with 'target' attribute).

TODO: add something about routines and variables inside 'pragma declare target'.

1.2.  Linking

When all source files are compiled, a linker is invoked.  The linker is passed
a special option to invoke openmp-plugin.  The plugin is responsible for
producing target-side executables - for each target it calls the corresponding
target compiler and linker.
The target-side GCC is invoked to load Gimple IR from .gnu.target_lto sections
of the FAT-object and compile it to target-side objects which later will be
used by target-side linker.

The host-side linker needs libgomp along side with standard libraries like
libc/libm to successfully resolve symbols, generated by the host compiler.  The
target-side linker needs CRT.O, containing main-routine for target-side
executable and target-specific versions of standard libraries.

As a result of the work, the plugin produces N target executables and exits,
allowing the host linker to continue its work and produce host-side executable.

TBD: Should the main routine always contain a message-waiting loop (like in COI
implementation) or other options are also possible?
TBD: probably, it's better to have a separate plugin for each target, that a
single openmp plugin.

1.3.  Execution

Host-side executable contains calls to libgomp library, which interfaces all
interactions with target-devices.
On loading, the executable calls GOMP_target_init from libgomp.so, which will
load the target executables onto target-devices and start them.  Since this
moment, the devices are ready to execute requested code and interact with the
main host-process.

When a host-side program calls libgomp functions related to the offloading,
libgomp decides, whether it's profitable to offload, and which device to choose
for that.  In order to do that, libgomp calls available plugins and checks
which devices are ready to execute offloaded code.  Available plugins should be
located in a specified folder and should implement a certain interface.

Another important function of libgomp is host-target memory mapping and keeping
information about mapped regions and their types.

TBD: probably, it's better to call GOMP_target_init on the first attempt to
offload something to the given device.
TBD: probably, it's better to 'hard-code' available plugin during build of
libgomp (e.g., at configure step).


2.1.  Overview

When -flto or -fopenmp option is given to the GCC driver, linker plugin
invocation is triggered.  The plugin claims the input files containing
.gnu.lto* or .gnu.target_lto* sections for further processing and creates
resolutions file.
After this preliminary work, LTO-wrapper is called.  It is responsible for
sequential calls of GCC.

The first call is needed to run WPA, which performs usual LTO partitioning as
well as partitioning of OpenMP-target sections.  WPA reads bytecode of:
  1) all functions and variables with "omp declare target" attribute;
  2) the outlined bodies of #pragma omp target turned into '*.ompfn' functions;
  3) all the types, symtab etc. needed for that;
from .gnu.target_lto* sections and stores them into an extra partition.

The second call invokes GCC on the partitioned ltrans-files and produces
LTO-optimized host-side executable.

The third call invokes target-side GCC (which in turn would call a linker,
which could start LTO for target-side code) and produces target-side
executable.  GCC-target uses lto1 frontend to read bytecode from OpenMP-target
partition, produced during WPA stage.  Further it generates optimized code for
target and links it with ld-target.
This target-executable is added to host-linker input files and is placed into a
.rodata section of host-side executable.

2.2.  Multi-target support

If several different targets are used for offloading, .gnu.target_lto code must
be compiled for each of them.  In order to do that, several target-side
compilers need to be called.
LTO-wrapper scans a specified folder and runs every version of GCC located
there, assuming that these are the target-side compilers.

TBD: This scheme might need to be reconsidered.


Middle end work is done in two omp passes.  Specifically, omp-lower pass:
  * Creates outlined function with no body
  * Adds #pragma omp return in the end of the region
  * Creates empty struct args_data
  * For each var referenced in clauses  (e.g. int i):
    -  Adds entry to data_arr, data_sizes and data_kind arrays describing this
       variable, its size and mapping type
    -  Adds assignment before call to outlined function : args_data.i = &i
    -  Replace uses of i with uses of args_data->i inside the region

Then, omp-expand pass:
  * Moves the region to the outlined function
  * Adds a call to libGOMP to maybe offload this function:
    GOMP_target (condition /* evaluated expression from IF clause */,
		 device_no /* a number from DEVICE clause */,
		 foo, .foo.,
		 data_arr, data_size, data_kinds, 1 /* size of arrays */);

Finally, gimple of the outlined function and needed parts of symtab are
streamed to .gnu.target_lto* sections.


4.1.  General interface for offloading

LibGOMP implements generic routines, such as GOMP_target, GOMP_target_data,
GOMP_target_data_end, GOMP_target_update and others.  The compiler replaces
'pragma target' with calls to these routines, surrounded by a code with
arguments preparation.

All of these routines expects three arrays as a parameters: these arrays
contains info about variables mapping.  The first array contains host addresses
of the variables, the second - sizes of mapped regions, and the third - type of
mapping (TO, TOFROM, FROM, ALLOC).  We prefer three separate arrays to one
array of corresponding structures due to the following reasons:
  * The first array could be used as an argument list for outlined calls
  * The second and third arrays are expected to often contain only constants
and thus might be stored in .rodata section and not be allocated dynamically.

GOMP_target routine takes additional arguments:
  * Address of the host version of outlined function.  It is used when runtime
decides to perform host fallback instead of offloading to an accelerator.
  * Name of the target version of outlined function.  This is used when runtime
decides to offload.  It cannot directly call a function on a target device, so
it calls the corresponding plugin and gives it a function name to invoke.

GOMP_target, GOMP_target_data, GOMP_target_data_end, GOMP_target_update routines
performs maintaining of a global structure describing current mapping, which
will be covered in the next section, and an actual data marshalling:
  * GOMP_target copies regions with kind TO or TOFROM to device before
offloading and copies regions with kind FROM or TOFROM from the device when the
offloading is finished.  In case of host fallback no copying is performed.
  * GOMP_target_data copies regions with kind TO or TOFROM to the device.
  * GOMP_target_data_end copies regions with kind FROM or TOFROM from the
  * GOMP_target_data_update copies all regions according to their types to and
from the device.

4.2.  Maintaining info about mapped regions

Every libGOMP routine dealing with mapped memory regions is responsible for an
accurate maintaining of a global data structure describing this mapping.  This
data structure is a binary search tree containing structures 
    void *host_address;
    void *target_address;
    size_t region_size;
    enum {TO, TOFROM, FROM, ALLOC} region_type;
with host addresses used as a key.

The data structure allows to check whether a given host address is mapped, or
not.  In order to do that, on every request it needs to find out whether the
requested interval is covered with already mapped ones and check if all of them
have a corresponding type.

TBD: We could think of usage of some other data structure here.
TBD: Describe, when we need to emit errors, and when we might remap a region
(e.g. if we request to map a region with another type of mapping).

4.3.  Preparing arguments for offloaded calls

GOMP_target reuses functionality of GOMP_parallel for outlining and preparing
arguments for calls of a host-side version of the function.  In fact, host
fallback of GOMP_target behaves almost exactly as GOMP_parallel, except that
GOMP_target does not create any threads for its execution.
The other part of GOMP_target, which is responsible for performing offloading,
looks quite different.
Firstly, it is needed to map/allocate needed regions - that was described in
Once the mapping is finished and the data is copied to the device, it is needed
to process arguments list for the offloaded routine.  The goal of this
processing is to translate host addresses to the corresponding target addresses.
As the arguments structure is always actually an array of pointers, it is enough
to know only its size (or the number of the arguments) to correctly translate
all contained addresses.  The translated addresses rewrite the host addresses in
the arguments structure, making it ready for passing to the target-side version
of the offloaded routine.

4.4.  Plugins, usage of device-specific interfaces

LibGOMP is designed to be independent of devices type it work with.  In order to
make it possible, plugins are used: while the libGOMP itself contains only a
generic interface and callbacks to the plugin for invoking target-dependent
Plugins are shared object located in a specified folder, implementing a certain
set of routines.
TODO: list this set of necessary routines.
TBD: should we use all .SO located in a plugins folder or should we use
predefined set of plugins (e.g. defined during libgomp build)?

When required, libGOMP performs a search in the plugins folder for the plugins
and for each found .so file checks whether it is a proper plugin and if so
invokes a needed routine.  For instance, when libGOMP chooses a device to
offload to, it could iterate through every .SO from the plugins folder calling
gomp_is_device_available routine from it.

The plugins uses target-dependent libraries and perform low-level interaction
with the device.  For example, a plugin for Intel MIC could use COI interface
for implementing libgomp callbacks.

More information about the Gcc mailing list