[RFC] Offloading Support in libgomp

Jakub Jelinek jakub@redhat.com
Mon Sep 16 09:35:00 GMT 2013


On Fri, Sep 13, 2013 at 01:34:43PM +0400, Michael Zolotukhin wrote:
> 1.2.  Linking
> 
> When all source files are compiled, a linker is invoked.  The linker is passed
> a special option to invoke openmp-plugin.  The plugin is responsible for
> producing target-side executables - for each target it calls the corresponding
> target compiler and linker.

I thought the idea was to just use the LTO plugin for that, teach it to
handle the .gnu.target_lto* sections specially.

> The target-side GCC is invoked to load Gimple IR from .gnu.target_lto sections
> of the FAT-object and compile it to target-side objects which later will be
> used by target-side linker.
> 
> The host-side linker needs libgomp along side with standard libraries like
> libc/libm to successfully resolve symbols, generated by the host compiler.  The
> target-side linker needs CRT.O, containing main-routine for target-side
> executable and target-specific versions of standard libraries.

I'd say main shouldn't be linked into the shared libraries (well, for MIC)
that you put into the binaries resp. shared libraries, but into special
section inside of libgomp.so from which the runtime would upload it into the
binary.  Because, if you have main in every shared library, what do you do
if you have multiple shared libraries with offloading code in it?
What the plugin will need to do is for each of the shared libraries in the
link, extract from the special sections the embedded target shared libraries
(or executables) into temporary files and pass that to the target linker, so
that if you have say
#pragma omp declare target
extern int foo;
extern void bar (void);
#pragma omp declare target
in one shared library and the definitions thereof in a different one, you
can link that together.

> 1.3.  Execution
> 
> Host-side executable contains calls to libgomp library, which interfaces all
> interactions with target-devices.
> On loading, the executable calls GOMP_target_init from libgomp.so, which will

No.  The first call to omp_get_num_devices, GOMP_target, GOMP_target_data
or GOMP_target_update (using pthread_once) scans for the available target
devices, there is no GOMP_target_init, and the first GOMP_target,
GOMP_target_data or GOMP_target_update from a particular shared library or
binary (all of them will have __OPENMP_TARGET__ weak hidden symbol as one of
the arguments) offloads the embedded shared library into target (resp.
compiles HSAIL/PTX and uploads or whatever).

> load the target executables onto target-devices and start them.  Since this
> moment, the devices are ready to execute requested code and interact with the
> main host-process.
> 
> When a host-side program calls libgomp functions related to the offloading,
> libgomp decides, whether it's profitable to offload, and which device to choose
> for that.  In order to do that, libgomp calls available plugins and checks
> which devices are ready to execute offloaded code.  Available plugins should be
> located in a specified folder and should implement a certain interface.
> 
> Another important function of libgomp is host-target memory mapping and keeping
> information about mapped regions and their types.

The only "type" info needed is the copy_from flag, all the rest happens on
mapping memory to the device, so don't need to be tracked afterwards.

> TBD: probably, it's better to 'hard-code' available plugin during build of
> libgomp (e.g., at configure step).

That is certainly doable.

> 2.  LINKER PLUGINS INFRASTRUCTURE
> 
> 2.1.  Overview
> 
> When -flto or -fopenmp option is given to the GCC driver, linker plugin
> invocation is triggered.  The plugin claims the input files containing
> .gnu.lto* or .gnu.target_lto* sections for further processing and creates
> resolutions file.
> After this preliminary work, LTO-wrapper is called.  It is responsible for
> sequential calls of GCC.
> 
> The first call is needed to run WPA, which performs usual LTO partitioning as
> well as partitioning of OpenMP-target sections.  WPA reads bytecode of:
>   1) all functions and variables with "omp declare target" attribute;
>   2) the outlined bodies of #pragma omp target turned into '*.ompfn' functions;

1) and 2) is basically the same, because omp expansion adds "omp declare
target" attribute to the outlined bodies too.

I don't see why you want a WPA phase, at least when not also -flto.
IMNSHO you want to compile each .gnu.target_lto* set of input sections
individually, using one target compiler driver invocation, that will
generate object files and you just link them together.

> 3.  OPENMP PRAGMA TARGET HANDLING IN MIDDLE-END
> 
> Middle end work is done in two omp passes.  Specifically, omp-lower pass:
>   * Creates outlined function with no body
>   * Adds #pragma omp return in the end of the region
>   * Creates empty struct args_data
>   * For each var referenced in clauses  (e.g. int i):
>     -  Adds entry to data_arr, data_sizes and data_kind arrays describing this
>        variable, its size and mapping type
>     -  Adds assignment before call to outlined function : args_data.i = &i
>     -  Replace uses of i with uses of args_data->i inside the region
> 
> Then, omp-expand pass:
>   * Moves the region to the outlined function
>   * Adds a call to libGOMP to maybe offload this function:
>     GOMP_target (condition /* evaluated expression from IF clause */,
> 		 device_no /* a number from DEVICE clause */,
> 		 foo, .foo.,
> 		 data_arr, data_size, data_kinds, 1 /* size of arrays */);

Except for the #pragma omp declare target arrays creation all this is
implemented, the GOMP_target arguments have different order and some aren't
present, just the current fnname argument will be in fact __OPENMP_TARGET__,
i.e. either NULL, or address of a section which contains some info on what
offloading targets are supported in the current binary resp. shared library,
where to find them, where to find their mapping etc.

> GOMP_target routine takes additional arguments:
>   * Address of the host version of outlined function.  It is used when runtime
> decides to perform host fallback instead of offloading to an accelerator.
>   * Name of the target version of outlined function.  This is used when runtime
> decides to offload.  It cannot directly call a function on a target device, so
> it calls the corresponding plugin and gives it a function name to invoke.

See above and my earlier mail why a name is a bad idea.  You will look up
the { fnaddr, fnaddr + 1 } address range in the target mapping structure
instead.

> GOMP_target, GOMP_target_data, GOMP_target_data_end, GOMP_target_update routines
> performs maintaining of a global structure describing current mapping, which
> will be covered in the next section, and an actual data marshalling:
>   * GOMP_target copies regions with kind TO or TOFROM to device before
> offloading and copies regions with kind FROM or TOFROM from the device when the
> offloading is finished.  In case of host fallback no copying is performed.
>   * GOMP_target_data copies regions with kind TO or TOFROM to the device.
>   * GOMP_target_data_end copies regions with kind FROM or TOFROM from the
> device.
>   * GOMP_target_data_update copies all regions according to their types to and
> from the device.

All this is implemented, just it will need to be changed to use device hooks
to perform the actual target allocation/deallocation/copyto/copyfrom
operations.
> 
> 4.2.  Maintaining info about mapped regions
> 
> Every libGOMP routine dealing with mapped memory regions is responsible for an
> accurate maintaining of a global data structure describing this mapping.  This
> data structure is a binary search tree containing structures 
> struct
>   {
>     void *host_address;
>     void *target_address;
>     size_t region_size;
>     enum {TO, TOFROM, FROM, ALLOC} region_type;
>   }
> with host addresses used as a key.
> 
> The data structure allows to check whether a given host address is mapped, or
> not.  In order to do that, on every request it needs to find out whether the
> requested interval is covered with already mapped ones and check if all of them
> have a corresponding type.

This is again already implemented, just the splay tree and lock will need
moving from a global variable into the device descriptor.

	Jakub



More information about the Gcc mailing list