[PATCH 0/8] NVPTX offloading to NVPTX: backend patches

Wed Oct 19 12:19:00 GMT 2016

On Tue, Oct 18, 2016 at 07:58:49PM +0300, Alexander Monakov wrote:
> On Tue, 18 Oct 2016, Bernd Schmidt wrote:
> > The performance I saw was lower by a factor of 80 or so compared to their CUDA
> > version, and even lower than OpenMP on the host.
> 
> The currently published OpenMP version of LULESH simply doesn't use openmp-simd
> anywhere. This should make it obvious that it won't be anywhere near any
> reasonable CUDA implementation, and also bound to be below host performance.
> Besides, it's common for such benchmark suites to have very different levels of
> hand tuning for the native-CUDA implementation vs OpenMP implementation,
> sometimes to the point of significant algorithmic differences. So you're
> making an invalid comparison here.

This is related to the independent clause/construct (or whatever other
names) discussions, the problem with LULESH's
#pragma distribute parallel for
rather than
#pragma distribute parallel for simd
is that usually it calls (inline) functions, and distribute parallel for,
even with the implementation defined default for schedule() clause, isn't
just let the implementation choose distribution between teams/threads/simd
it likes; for loops which don't call any functions we can scan the loop body
and figure out if it could e.g. through various omp_* calls observe anything
that could reveal how it is distributed among teams/threads/simd, but for
loops that can call other functions that is hard to do, especially as early
as during omp lowering/expansion.
OpenMP 5.0 is likely going to have some clause or whatever that will just
say the loop iterations are completely independent, but until then the
programmer uses more prescriptive pragmas and needs to be careful what
exactly they want.

But, certainly we should collect some OpenMP/OpenACC offloading benchmarks
or write our own and use that to compare GCC with other compilers.

	Jakub