This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] omp-low.c split


On 12/13/2016 04:42 AM, Martin Jambor wrote:

>> And this as well.  But omp-grid.c is fine too.
> 
> ...I prefer omp-grid.c because I plan to use gridification also for
> GCN targets, though hopefully only as an optimization rather than a
> hard requirement ...and in fact I still think it is a good
> optimization of simple loops for execution on all CUDA-like
> environments with block/thread grids because it removes conditions
> which the run-time can handle better.

Regarding gridification, is your cauldron talk from 2015 still current,
or have there been some significant changes?

When we first started with OpenACC we were using a lot of the existing
lower_omp_for* infrastructure handle ACC LOOPs. But there was a couple
of problems with that. First, the chunk partitioning caused a lot of
overhead, and second because of OpenACC execution model it made more
sense to write our own functions (lower_oacc_head_tail /
lower_oacc_reductions). In fact, during lowering gcc only marks where
the loops are. All of those markers get replaced and the loops get
optimized during the oaccdevlow pass which runs on the target compiler.

Right now one of the significant bottlenecks we're experiencing on nvptx
targets is with I/O. First, prior to launching a PTX kernel, libgomp
transfers each data mapping individually in a synchronous manner. I'm
debating whether it makes sense to pass in all of those data mappings to
the accelerator prior to the PTX kernel launch asynchronously, obviously
with an explicit synchronization barrier just prior to launching the kernel.

Another bottleneck involves firstprivate variables. Often, those
variables are 'scalars' and consequently, they shouldn't need explicit
data mappings. I noticed that Jakub introduced a special
GOMP_MAP_FIRSTPRIVATE_INT, which omits data mappings for integral types
with less than or equal precision to pointers. It would probably be
beneficial to expand this to reals.

The last observation is that OpenMP code in general passes a struct with
all of the data mappings to the various OMP regions/offloaded code.
That's fine, but for offloading targets, particularly nvptx, it would
probably be slightly more efficient if those OMP regions took actual
function arguments instead of a single struct. At least on nvptx
targets, in order to pass that struct to the accelerator, the runtime
must first allocate device memory for it, then copy all of the struct
contents to the device each time prior to launching a PTX kernel. A lot
of this could be bypassed because cuLaunchKernel accepts a variable
number of kernel arguments. Obviously, those arguments need to be
transferred to the accelerator one way or another, so I'm not sure yet
how beneficial this optimization would end up being.

To be clear, I'm not proposing any of these changes for gcc7. Any
changes to the above will go to gomp-4_0-branch first, then we'll port
them over to gcc8.

What type of performance problems are you experiencing with HSA?

Cesar


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]