Differences between revisions 1 and 8 (spanning 7 versions)
Revision 1 as of 2014-07-21 19:00:17
Size: 8817
Editor: IlyaVerbin
Comment:
Revision 8 as of 2014-08-04 19:16:32
Size: 12472
Editor: IlyaVerbin
Comment: Describe address translation
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:

<<TableOfContents()>>
Line 7: Line 9:
Accel compiler — a compiler that reads intermediate representation from the special LTO sections, and generates code for the accelerator device. Accel compiler — a compiler that reads intermediate representation from the special LTO sections, and generates code for the accelerator device. Also called the "offload compiler".
Line 11: Line 13:
'''TODO:''' Describe `--enable-as-accelerator-for`, `--enable-offload-targets`, etc. The host and offload compilers need to be able to find each other. This is achieved by installing the offload compiler into special locations, and informing each about the presence of the other. All available offload compilers must first be configured with "--enable-as-accelerator-for=host-triplet", and installed into the same prefix as the host compiler. Then the host compiler is built with "--enable-offload-targets=target1,target2,..." which identifies the offload compilers that have already been built and installed.

The install locations for the offload compilers differ from those of a normal cross toolchain, by the following mapping:

|| bin/$target-gcc || `->` || bin/$host-accel-$target-gcc ||
|| lib/gcc/$target/$ver/ || `->` || lib/gcc/$host/$ver/accel/$target ||

It may be necessary to compile offload compilers with a sysroot, since otherwise install locations for libgomp could clash (maybe that library needs to move into lib/gcc/..?)

A target needs to provide a `mkoffload` tool if it wishes to be usable as an accelerator. It is installed as one of EXTRA_PROGRAMS, and the host `lto-wrapper` knows how to find it from the paths described above. `mkoffload` will invoke the offload compiler in LTO mode to produce an offload binary from the host object files, then post-process this to produce a new object file that can be linked in with the host executable. It can find the host compiler by examining the `COLLECT_GCC` environment variable, and it must take care to clear this and certain other environment variables when executing the offload compiler so as to not confuse it.
Line 19: Line 30:
 1. After `#pragma omp target` lowering and expansion, a new outlined function with the attribute `"omp declare target"` emerges — it will be later compiled both by host and accel compilers to produce two versions (or N+1 versions in case of N different accel targets).
 2. The expansion phase replaces pragmas with corresponding calls to the runtime library `libgomp`. These calls are preceded by initialization of special structures, containing arguments for outlined functions.
 3. During the `ipa_write_summaries` pass the intermediate representation of outlined functions is streamed out into the `.gnu.target_lto_*` sections of the "fat" object file. This object file also may contain `.gnu.lto_*` sections for the regular link-time optimizations.
 4. When all source files are compiled, pre-linker driver `collect2` is invoked. If linker plugin is available, `collect2` runs the linker, which loads `liblto_plugin.so`, which runs `lto-wrapper`. In case if linker plugin is not available, `collect2` runs `lto-wrapper` directly.
 5. `lto-wrapper` runs `mkoffload` for each accel target, specified during the configuration.
 6. `mkoffload` runs accel compiler, which reads IR from the `.gnu.target_lto_*` sections and compiles it for the accel target. Then `mkoffload` packs this target code (image) into the special section of a new host's object file. Also this object file has a constructor, which calls `GOMP_offload_register` from `libgomp`. The arguments are: type of the accelerator, pointers to the tables with addresses, and some other data needed by this target, like a pointer to the image.
 7. Linker adds new object files, produced by `mkoffload`s, to the list of host's input object files.
 1. After `#pragma omp target` lowering and expansion, a new outlined function with the attribute `"omp declare target"` emerges — it will be later compiled both by host and accel compilers to produce two versions (or N+1 versions in case of N different accel targets).<<BR>>
 The decls for all functions and global variables marked with `"omp declare target"` attribute are inserted into `offload_funcs` and `offload_vars` arrays.
 2. The expansion phase replaces pragmas with corresponding calls to the runtime library `libgomp` (`GOMP_target`, `GOMP_target_data` + `GOMP_target_end_data`, `GOMP_target_update`). These calls are preceded by initialization of special structures, containing arguments for outlined functions (`.omp_data_arr.*`, `.omp_data_sizes.*`, `.omp_data_kinds.*`).
 3. During the `ipa_write_summaries` pass the intermediate representation of outlined functions is streamed out into the `.gnu.target_lto_*` sections of the "fat" object file. This object file also may contain `.gnu.lto_*` sections for the regular link-time optimizations.<<BR>>
 Also the decls from `offload_funcs` and `offload_vars` are streamed out into the `.gnu.target_lto_.offload_table` section. Later an accel compiler will read this section to produce target's mapping table.<<BR>>
 Optionally, if `-flto` is present, the same decls are streamed out into the `.gnu.lto_.offload_table` section. Later the host compiler will use them to produce the final host's table with addresses.
 4. In `omp_finish_file` the addresses from `offload_funcs` and `offload_vars` are written into the `__gnu_offload_funcs` and `__gnu_offload_vars` sections correspondingly.
 5. When all source files are compiled, pre-linker driver `collect2` is invoked. If linker plugin is available, `collect2` runs the linker, which loads `liblto_plugin.so`, which runs `lto-wrapper`. In case if linker plugin is not available, `collect2` runs `lto-wrapper` directly.
 6. `lto-wrapper` runs `mkoffload` for each accel target, specified during the configuration.
 7. `mkoffload` runs accel compiler, which reads IR from the `.gnu.target_lto_*` sections and compiles it for the accel target. Then `mkoffload` packs this target code (image) into the special section of a new host's object file. The object file produced with `mkoffload` should contain a constructor that calls `GOMP_offload_register` to identify itself at run-time. Arguments to that function are a symbol called `__OPENMP_TARGET__` (provided by libgcc and unique per shared object), a target identifier, and some other data needed for a particular target (a pointer to the image, a table with information about mappings between host and offload functions and variables).
 8. Linker adds new object files, produced by `mkoffload`s, to the list of host's input object files.
Line 33: Line 48:
  * `collect2` # Pre-linker driver    * `collect2` # Pre-linker driver
Line 35: Line 50:
    * `ld` with `liblto_plugin.so` # Perform linking      * `ld` with `liblto_plugin.so` # Perform linking
Line 92: Line 107:
'''TODO:''' Describe address translation. When `#pragma omp target` is expanded, the `host_addr` of outlined function is passed to `GOMP_target`. If target device is not available, `libgomp` just performs host fallback using `host_addr`. But to run the function on target, it needs to translate `host_addr` into the corresponding `target_addr`. The idea is to have `[ host_addr, size ]` arrays in `__gnu_offload_funcs`/`vars` sections which are ordered exactly the same as corresponding `[ target_addr ]` arrays inside the target images (`size` is needed only for vars).

To keep this `host_addr -> target_addr` mapping at runtime, each device descriptor `gomp_device_descr` contains a splay tree. When `gomp_init_device` performs initialization, it walks the whole array and in each iteration picks n-th host pair `host_start`/`host_end` plus corresponding n-th target pair `tgt_start`/`tgt_end`, and inserts it into the splay tree.
Line 102: Line 119:
`GOMP_target` looks up the `host_addr` passed to it in the splay tree and passes corresponding `target_addr` to plugin's `device_run` function.
Line 108: Line 126:
/gcc_src/configure --build=x86_64-intelmic-linux-gnu --host=x86_64-intelmic-linux-gnu --target=x86_64-intelmic-linux-gnu --enable-as-accelerator-for=x86_64-pc-linux-gnu --enable-liboffloadmic=target --prefix=/install target_configargs="--enable-version-specific-runtime-libs" ../configure --build=x86_64-intelmic-linux-gnu --host=x86_64-intelmic-linux-gnu --target=x86_64-intelmic-linux-gnu --enable-as-accelerator-for=x86_64-pc-linux-gnu --prefix=/install target_configargs="--enable-version-specific-runtime-libs"
Line 114: Line 132:
export LIBOFFLOAD_TARGET_PATH="/install/lib/gcc/x86_64-intelmic-linux-gnu/4.10.0/"
/gcc_src
/configure --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --target=x86_64-pc-linux-gnu --enable-offload-targets=x86_64-intelmic-linux-gnu --enable-liboffloadmic=host --prefix=/install
../configure --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --target=x86_64-pc-linux-gnu --enable-offload-targets=x86_64-intelmic-linux-gnu --prefix=/install
Line 125: Line 142:
export LIBGOMP_PLUGIN_PATH="/gcc_build_host/x86_64-pc-linux-gnu/libgomp/plugins/intelmic" export LIBGOMP_PLUGIN_PATH="/install/lib64/"
Line 135: Line 152:
export LIBGOMP_PLUGIN_PATH="/gcc_build_host/x86_64-pc-linux-gnu/libgomp/plugins/intelmic" export LIBGOMP_PLUGIN_PATH="/gcc_build_host/x86_64-pc-linux-gnu/liboffloadmic/plugin/.libs"
Line 143: Line 160:
 * Do not delete build dir of the host compiler, since `make install` for `libgomp` plugin is not working yet.
 * Also there might be an issue with building `libgomp` plugin when using relative path to the `configure`.
 * In-tree testing is not supported yet with the scheme described above. I (Bernd) think that what should happen is that the host compiler configure step copies the already installed accel/ subdirectory from the install location to its build directory. (Ilya) In-tree testing works with `COMPILER_PATH` set to `/install/libexec/gcc/host/ver`, this helps `lto-wrapper` to find `mkoffload`.

OpenMP 4.0 and OpenACC offloading support in GCC

Terminology

Host compiler — a regular compiler. Not to be confused with build/host/target configure terms.

Accel compiler — a compiler that reads intermediate representation from the special LTO sections, and generates code for the accelerator device. Also called the "offload compiler".

Building host and accel compilers

The host and offload compilers need to be able to find each other. This is achieved by installing the offload compiler into special locations, and informing each about the presence of the other. All available offload compilers must first be configured with "--enable-as-accelerator-for=host-triplet", and installed into the same prefix as the host compiler. Then the host compiler is built with "--enable-offload-targets=target1,target2,..." which identifies the offload compilers that have already been built and installed.

The install locations for the offload compilers differ from those of a normal cross toolchain, by the following mapping:

bin/$target-gcc

->

bin/$host-accel-$target-gcc

lib/gcc/$target/$ver/

->

lib/gcc/$host/$ver/accel/$target

It may be necessary to compile offload compilers with a sysroot, since otherwise install locations for libgomp could clash (maybe that library needs to move into lib/gcc/..?)

A target needs to provide a mkoffload tool if it wishes to be usable as an accelerator. It is installed as one of EXTRA_PROGRAMS, and the host lto-wrapper knows how to find it from the paths described above. mkoffload will invoke the offload compiler in LTO mode to produce an offload binary from the host object files, then post-process this to produce a new object file that can be linked in with the host executable. It can find the host compiler by examining the COLLECT_GCC environment variable, and it must take care to clear this and certain other environment variables when executing the offload compiler so as to not confuse it.

Compilation process

Currently offloading works only with -flto.

Host compiler performs the following actions:

  1. After #pragma omp target lowering and expansion, a new outlined function with the attribute "omp declare target" emerges — it will be later compiled both by host and accel compilers to produce two versions (or N+1 versions in case of N different accel targets).
    The decls for all functions and global variables marked with "omp declare target" attribute are inserted into offload_funcs and offload_vars arrays.

  2. The expansion phase replaces pragmas with corresponding calls to the runtime library libgomp (GOMP_target, GOMP_target_data + GOMP_target_end_data, GOMP_target_update). These calls are preceded by initialization of special structures, containing arguments for outlined functions (.omp_data_arr.*, .omp_data_sizes.*, .omp_data_kinds.*).

  3. During the ipa_write_summaries pass the intermediate representation of outlined functions is streamed out into the .gnu.target_lto_* sections of the "fat" object file. This object file also may contain .gnu.lto_* sections for the regular link-time optimizations.
    Also the decls from offload_funcs and offload_vars are streamed out into the .gnu.target_lto_.offload_table section. Later an accel compiler will read this section to produce target's mapping table.
    Optionally, if -flto is present, the same decls are streamed out into the .gnu.lto_.offload_table section. Later the host compiler will use them to produce the final host's table with addresses.

  4. In omp_finish_file the addresses from offload_funcs and offload_vars are written into the __gnu_offload_funcs and __gnu_offload_vars sections correspondingly.

  5. When all source files are compiled, pre-linker driver collect2 is invoked. If linker plugin is available, collect2 runs the linker, which loads liblto_plugin.so, which runs lto-wrapper. In case if linker plugin is not available, collect2 runs lto-wrapper directly.

  6. lto-wrapper runs mkoffload for each accel target, specified during the configuration.

  7. mkoffload runs accel compiler, which reads IR from the .gnu.target_lto_* sections and compiles it for the accel target. Then mkoffload packs this target code (image) into the special section of a new host's object file. The object file produced with mkoffload should contain a constructor that calls GOMP_offload_register to identify itself at run-time. Arguments to that function are a symbol called __OPENMP_TARGET__ (provided by libgcc and unique per shared object), a target identifier, and some other data needed for a particular target (a pointer to the image, a table with information about mappings between host and offload functions and variables).

  8. Linker adds new object files, produced by mkoffloads, to the list of host's input object files.

Compilation with offloading using linker plugin:

  • gcc

    • cc1 # Compile first source file into plain asm + intermediate representation + IR for accel

    • as # Assemble this asm + IR into temporary object file

    • ... # Compile and assemble all remaining source files

    • collect2 # Pre-linker driver

      • collect-ld # Simple wrapper over ld

        • ld with liblto_plugin.so # Perform linking

          • lto-wrapper # Is called from liblto_plugin.so

            • gcc

              • lto1 # Perform whole program analysis and split into new partitions

            • gcc

              • lto1 # Perform local transformations in the first object file

              • as # Assemble into final object code

            • ... # Perform local transformations in each partitioned object file

            • intelmic/mkoffload # Prepare offload image for Intel MIC devices

              • accel_gcc # Read target IR from all partitions and produce target DSO

              • objcopy # Save target DSO in a special section in a new host's object file

            • .../mkoffload # Prepare images for other targets

              • ...

Compilation with offloading without linker plugin:

  • gcc

    • cc1 # Compile first source file into plain asm + intermediate representation + IR for accel

    • as # Assemble this asm + IR into temporary object file

    • ... # Compile and assemble all remaining source files

    • collect2 # Pre-linker driver

      • lto-wrapper # Run lto-wrapper directly from collect2

        • gcc

          • lto1 # Perform whole program analysis and split into new partitions

        • gcc

          • lto1 # Perform local transformations in the first object file

          • as # Assemble into final object code

        • ... # Perform local transformations in each partitioned object file

        • intelmic/mkoffload # Prepare offload image for Intel MIC devices

          • accel_gcc # Read target IR from all partitions and produce target DSO

          • objcopy # Save target DSO in a special section in a new host's object file

        • .../mkoffload # Prepare images for other targets

          • ...

      • collect-ld # Simple wrapper over ld

        • ld # Perform final linking

Runtime support in libgomp

libgomp plugins

libgomp is designed to be independent of accelerator type it work with. In order to make it possible, plugins are used, while the libgomp itself contains only a generic interface and callbacks to the plugin for invoking target-dependent functionality. Plugins are shared object, implementing a set of routines:

get_type
get_num_devices
offload_register
device_init
device_get_table
device_alloc
device_free
device_dev2host
device_host2dev
device_run

When required, libgomp performs a search in the plugins directory for the plugins named libgomp-plugin-*.so.1 and for each found file checks whether it is a proper plugin. The plugins use target-dependent libraries and perform low-level interaction with the accel devices. E.g., the plugin for Intel MIC devices uses liboffloadmic for implementing libgomp callbacks.

Address translation

When #pragma omp target is expanded, the host_addr of outlined function is passed to GOMP_target. If target device is not available, libgomp just performs host fallback using host_addr. But to run the function on target, it needs to translate host_addr into the corresponding target_addr. The idea is to have [ host_addr, size ] arrays in __gnu_offload_funcs/vars sections which are ordered exactly the same as corresponding [ target_addr ] arrays inside the target images (size is needed only for vars).

To keep this host_addr -> target_addr mapping at runtime, each device descriptor gomp_device_descr contains a splay tree. When gomp_init_device performs initialization, it walks the whole array and in each iteration picks n-th host pair host_start/host_end plus corresponding n-th target pair tgt_start/tgt_end, and inserts it into the splay tree.

Execution process

When an executable or dynamic shared object is loaded, it calls GOMP_offload_register N times, where N is number of accel images, embedded into this exec/dso. This function stores the pointers to the images and other data needed by accel plugin into offload_images.

The first call to GOMP_target, GOMP_target_data or GOMP_target_update performs corresponding device initialization: it calls device_init from the plugin, and then stores address mapping table in the splay tree.

In case of Intel MIC, device_init creates a new process on the device, and then offloads the accel images with the type == TARGET_TYPE_INTEL_MIC. All accel images, even inside the executable, represent dynamic shared objects, which are loaded into the newly created process.

GOMP_target looks up the host_addr passed to it in the splay tree and passes corresponding target_addr to plugin's device_run function.

How to try offloading enabled GCC

Currently all necessary changes could be found in a git branch named kyukhin/gomp4-offload. It contains set of patches from gomp-4_0-branch, some non-committed patches, libgomp plugin, Intel MIC runtime offload library liboffloadmic and an emulator. This emulator lies under liboffloadmic and reproduces MIC's HW and SW stack behavior allowing to run offloaded code in a separate address space using the host machine.

1. Building accel compiler:

../configure --build=x86_64-intelmic-linux-gnu --host=x86_64-intelmic-linux-gnu --target=x86_64-intelmic-linux-gnu --enable-as-accelerator-for=x86_64-pc-linux-gnu --prefix=/install target_configargs="--enable-version-specific-runtime-libs"
make
make install

2. Building host compiler:

../configure --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --target=x86_64-pc-linux-gnu --enable-offload-targets=x86_64-intelmic-linux-gnu --prefix=/install
make
make install

3. Building an application:

/install/bin/gcc -fopenmp -flto test.c

4. Running an application using the emulator:

export LIBGOMP_PLUGIN_PATH="/install/lib64/"
export LD_LIBRARY_PATH="/install/lib64/"
export MIC_LD_LIBRARY_PATH="/install/lib/gcc/x86_64-intelmic-linux-gnu/4.10.0/"
./a.out

Running 'make check' (run tests using the emulator):

configure, make and *install* accel compiler (see #1)
configure and make host compiler (see #2)
export COMPILER_PATH="/install/libexec/gcc/x86_64-pc-linux-gnu/4.10.0"
export LIBGOMP_PLUGIN_PATH="/gcc_build_host/x86_64-pc-linux-gnu/liboffloadmic/plugin/.libs"
export LD_LIBRARY_PATH="/gcc_build_host/x86_64-pc-linux-gnu/liboffloadmic/.libs"
export MIC_LD_LIBRARY_PATH="/install/lib/gcc/x86_64-intelmic-linux-gnu/4.10.0/"
cd gcc_build_host
make check-target-libgomp

Known issues

  • In-tree testing is not supported yet with the scheme described above. I (Bernd) think that what should happen is that the host compiler configure step copies the already installed accel/ subdirectory from the install location to its build directory. (Ilya) In-tree testing works with COMPILER_PATH set to /install/libexec/gcc/host/ver, this helps lto-wrapper to find mkoffload.

None: Offloading (last edited 2019-02-08 18:42:03 by CatherineMoore)