This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Offloading GSOC 2015

Hi All,

I submitted my proposal via GSOC platform as tiny-many project. On
basis on Krill's reply, i decided to work on thread hierarchy manager.
The pdf version of proposal can be found here : [1] . Shortly my
proposal consist combining dynamic parallelism, extra thread creation
in advance and kernel splitting while code generating for GPUs. If you
have comments and suggestions are welcome.


GÃray Ãzen

2015-03-23 13:58 GMT+01:00 guray ozen <>:
> Hi Kirill,
> Thread hierarchy management and creation policy is very interesting
> topic for me as well. I came across that paper couple weeks ago.
> Creating more threads in the beginning and applying suchlike
> busy-waiting or if-master algorithm generally works better than
> dynamic parallelism due to the overhead of dp. Moreover compiler might
> close some optimizations when dp is enable. This paper Cuda-np[1] is
> also interesting about managing threads. And its idea is very close
> that to create more thread in advance instead of using dynamic
> parallelism. However in the other hand, sometimes dp has better
> performance since it let create new thread hierarchy.
> In order to clarify, I prepared 2 examples while using dynamic
> parallelism and creating more threads in advance.
> *(1st example)  Better result is dynamic parallelism.
> *(2nd example) Better result is creating more threads in advance
> 1st example:
> *(prop0.c) Has 4 nested iteration
> *(prop0.c:10)will put small array into shared memory
> *Iteration size of first two loop is expressed explicitly. even if
> they become clear in rt, ptx/spir can be changed
> *Last two iteration is sizes are dynamic and dependent of first two
> iterations' induction variables
> *(prop0.c:24 - 28) there are array accessing in very inefficient way
> (non-coalesced)
> -If we put (prop0.c:21) #parallel for
> -*It will create another kernel (
> -*array accessing style will change  ( - 52)
> Basically advantages of creating dynamic parallelism in this point:
> 1- Accessing style to array is changed with coalasced
> 2- we could get rid of 3rd and 4th for loop since we could create
> thread as iteration size. (little advantage in terms of thread
> divergencency)
> 2nd example:
> *Has 2 nested iteration
> *Innermost loop has reduction
> *I put 3 possible generated cuda code example
> *1 - : only cudaize prop1.c:8 and don't take account
> prop1.c:12
> *2 - : create more thread for innermost
> loop. Do reduction with extra threads. communicate by using shared
> memory.
> *3 - : create child kernel. Communicate by using
> global memory. but allocate global memory in advance at
> Full version of prop1 calculates nbody. I benchmarked with y reserach
> compiler [2] and put results here
> . As is seen from that figure, 2nd kernel has best performance.
> When we compare these 2 example, my roughly idea about this issue
> that,  it might be good idea to implement an inspector by using
> compiler analyzing algorithms in order to decide whether dynamic
> parallelism will be used or not. Thus it also can be possible to avoid
> extra slowdown since compiler closes optimization when dp is enable.
> Besides there is some another cases exist while we can take advantage
> of dp such as recursive algorithms. Moreover using stream is available
> even if not guarantee concurrency (it also causes overhead). In
> addition to this, i can work on if-master or busy-waiting logic.
> I am really willing to work on thread hierarchy management and
> creation policy. if it is interesting for gcc, how can i progress on
> this topic?
> By the way, i haven't worked on #omp simd. it could be match with
> warps (if there is no dependency among loops). (in nvidia side) since
> threads in same warp can read their data with __shfl, data clauses can
> be used to enhance performance. (Not sure)
> [1] -
> [2] -
> GÃray Ãzen
> ~grypp
> 2015-03-20 15:47 GMT+01:00 Kirill Yukhin <>:
>> Hello GÃray,
>> On 20 Mar 12:14, guray ozen wrote:
>>> I've started to prepare my gsoc proposal for gcc's openmp for gpus.
>> I think that here is wide range for exploration. As you know, OpenMP 4
>> contains vectorization pragmas (`pragma omp simd') which not perfectly
>> suites for GPGPU.
>> Another problem is how to create threads dynamically on GPGPU. As far as
>> we understand it there're two possible solutions:
>>   1. Use dynamic parallelism available in recent API (launch new kernel from
>>   target)
>>   2. Estimate maximum thread number on host and start them all from host,
>>   making unused threads busy-waiting
>> There's a paper which investigates both approaches [1], [2].
>>> However i'm little bit confused about which ideas, i mentioned last my
>>> mail, should i propose or which one of them is interesting for gcc.
>>> I'm willing to work on data clauses to enhance performance of shared
>>> memory. Or maybe it might be interesting to work on OpenMP 4.1 draft
>>> version. How do you think i should propose idea?
>> We're going to work on OpenMP 4.1 offloading features.
>> [1] -
>> [2] -
>> --
>> Thanks, K

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]