This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Merge of HSA branch


On 11/06/2015 11:12 AM, Bernd Schmidt wrote:
> On 11/05/2015 10:51 PM, Martin Jambor wrote:
>> Individual changes are described in slightly more detail in their
>> respective messages.  If you are interested in how the HSAIL
>> generation works in general, I encourage you to have a look at my
>> Cauldron slides or presentation, only very few things have changed as
>> far as the general principles are concerned.  Let me just quickly stress
>> here that we do acceleration within a single compiler, as opposed to
>> LTO-ways of all the other accelerator teams.
> 
> Realistically we're probably not going to reject this work, but I still want to ask whether the approach was acked by the community before you started. I'm really not exactly thrilled about having two different classes of backends in the compiler, and two different ways of handling offloading.
> 
>> I also acknowledge that we should add HSA-specific tests to the GCC
>> testsuite but we are only now looking at how to do that and will
>> welcome any guidance in this regard.
> 
> Yeah, I was looking for any kind of new test, because...
> 
>> the class of OpenMP loops we can handle well is small,
> 
> I'd appreciate more information on what this means. Any examples or performance numbers?

Hello.

As mentioned by Martin Jambor, it was explained during his speech at the Cauldron this year.
It can be easily explained on the following simple case:

#pragma omp target teams
#pragma omp distribute parallel for private(j)
   for (j=0; j<N; j++)
      c[j] = a[j];

Which is simple vector copy, that's going to be transformed to:

_4 = omp_data.i_1(D).D.5301 (iteration space)
_5 = __builtin_omp_get_num_threads ();
_6 = __builtin_omp_get_thread_num ();
_7 = calculate_chunk_start (_4, _5, _6); // pseudocode
_8 = calculate_chunk_end (_4, _5, _6); // pseudocode

for(i = _7; i < _8; i++)
  dest[i] = src[i];

and such kernel is dispatched with default grid size (in our case 64), so that every
work item handles chunk of size N/64.

On the other hand, gridification is doing to transform to:

_7 = __builtin_omp_get_thread_num ();
dest[_7] = src[_7];

and the kernels is offloaded like this:
HSA debug: GOMP_OFFLOAD_run called with grid size 10000000 and group size 0

Performance numbers are in order of magnitude and can be seen on slides 27-30 in [1]

Martin

[1] https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=get&target=mjambor-hsa-slides.pdf

> 
> 
> Bernd


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]