This is the mail archive of the
mailing list for the GCC project.
Re: Offloading GSOC 2015
- From: guray ozen <guray dot ozen at gmail dot com>
- To: Kirill Yukhin <kirill dot yukhin at gmail dot com>
- Cc: Thomas Schwinge <thomas at codesourcery dot com>, tobias dot burnus at physik dot fu-berlin dot de, gcc at gcc dot gnu dot org, Jakub Jelinek <jakub at redhat dot com>, Ilya Verbin <iverbin at gmail dot com>
- Date: Sat, 28 Mar 2015 18:00:06 +0100
- Subject: Re: Offloading GSOC 2015
- Authentication-results: sourceware.org; auth=none
- References: <CA+ga0G7z+xsO8LB8oc0yv9VHFPpryaH1T2rHOudky-it3Wnu3Q at mail dot gmail dot com> <87wq2n66gj dot fsf at kepler dot schwinge dot homeip dot net> <CA+ga0G6Y60g5rhOoj310TPp4EZgUpvmv7AT1Y1UHWkKfTp-ZOQ at mail dot gmail dot com> <CA+ga0G5WuGOrAUQ_Sq_LBt6=C6uEfFt4+3c7eDwu8D5ruhMRQg at mail dot gmail dot com> <20150320144744 dot GA49928 at msticlxl57 dot ims dot intel dot com> <CA+ga0G4Xbj0MUxOBe3Eg0sYH5pd+GHtBc+1bghxV1-qZN7aiyQ at mail dot gmail dot com>
I submitted my proposal via GSOC platform as tiny-many project. On
basis on Krill's reply, i decided to work on thread hierarchy manager.
The pdf version of proposal can be found here :  . Shortly my
proposal consist combining dynamic parallelism, extra thread creation
in advance and kernel splitting while code generating for GPUs. If you
have comments and suggestions are welcome.
2015-03-23 13:58 GMT+01:00 guray ozen <email@example.com>:
> Hi Kirill,
> Thread hierarchy management and creation policy is very interesting
> topic for me as well. I came across that paper couple weeks ago.
> Creating more threads in the beginning and applying suchlike
> busy-waiting or if-master algorithm generally works better than
> dynamic parallelism due to the overhead of dp. Moreover compiler might
> close some optimizations when dp is enable. This paper Cuda-np is
> also interesting about managing threads. And its idea is very close
> that to create more thread in advance instead of using dynamic
> parallelism. However in the other hand, sometimes dp has better
> performance since it let create new thread hierarchy.
> In order to clarify, I prepared 2 examples while using dynamic
> parallelism and creating more threads in advance.
> *(1st example) Better result is dynamic parallelism.
> *(2nd example) Better result is creating more threads in advance
> 1st example: https://github.com/grypp/gcc-proposal-omp4/tree/master/prop0
> *(prop0.c) Has 4 nested iteration
> *(prop0.c:10)will put small array into shared memory
> *Iteration size of first two loop is expressed explicitly. even if
> they become clear in rt, ptx/spir can be changed
> *Last two iteration is sizes are dynamic and dependent of first two
> iterations' induction variables
> *(prop0.c:24 - 28) there are array accessing in very inefficient way
> -If we put (prop0.c:21) #parallel for
> -*It will create another kernel (prop0_dynamic.cu:34)
> -*array accessing style will change (prop0_dynamic.cu:48 - 52)
> Basically advantages of creating dynamic parallelism in this point:
> 1- Accessing style to array is changed with coalasced
> 2- we could get rid of 3rd and 4th for loop since we could create
> thread as iteration size. (little advantage in terms of thread
> 2nd example: https://github.com/grypp/gcc-proposal-omp4/tree/master/prop1
> *Has 2 nested iteration
> *Innermost loop has reduction
> *I put 3 possible generated cuda code example
> *1 - prop1_baseline.cu : only cudaize prop1.c:8 and don't take account
> *2 - prop1_createMoreThread.cu : create more thread for innermost
> loop. Do reduction with extra threads. communicate by using shared
> *3 - prop1_dynamic.cu : create child kernel. Communicate by using
> global memory. but allocate global memory in advance at
> Full version of prop1 calculates nbody. I benchmarked with y reserach
> compiler  and put results here
> . As is seen from that figure, 2nd kernel has best performance.
> When we compare these 2 example, my roughly idea about this issue
> that, it might be good idea to implement an inspector by using
> compiler analyzing algorithms in order to decide whether dynamic
> parallelism will be used or not. Thus it also can be possible to avoid
> extra slowdown since compiler closes optimization when dp is enable.
> Besides there is some another cases exist while we can take advantage
> of dp such as recursive algorithms. Moreover using stream is available
> even if not guarantee concurrency (it also causes overhead). In
> addition to this, i can work on if-master or busy-waiting logic.
> I am really willing to work on thread hierarchy management and
> creation policy. if it is interesting for gcc, how can i progress on
> this topic?
> By the way, i haven't worked on #omp simd. it could be match with
> warps (if there is no dependency among loops). (in nvidia side) since
> threads in same warp can read their data with __shfl, data clauses can
> be used to enhance performance. (Not sure)
>  - http://people.engr.ncsu.edu/hzhou/ppopp_14_1.pdf
>  - http://link.springer.com/chapter/10.1007%2F978-3-319-11454-5_16
> GÃray Ãzen
> 2015-03-20 15:47 GMT+01:00 Kirill Yukhin <firstname.lastname@example.org>:
>> Hello GÃray,
>> On 20 Mar 12:14, guray ozen wrote:
>>> I've started to prepare my gsoc proposal for gcc's openmp for gpus.
>> I think that here is wide range for exploration. As you know, OpenMP 4
>> contains vectorization pragmas (`pragma omp simd') which not perfectly
>> suites for GPGPU.
>> Another problem is how to create threads dynamically on GPGPU. As far as
>> we understand it there're two possible solutions:
>> 1. Use dynamic parallelism available in recent API (launch new kernel from
>> 2. Estimate maximum thread number on host and start them all from host,
>> making unused threads busy-waiting
>> There's a paper which investigates both approaches , .
>>> However i'm little bit confused about which ideas, i mentioned last my
>>> mail, should i propose or which one of them is interesting for gcc.
>>> I'm willing to work on data clauses to enhance performance of shared
>>> memory. Or maybe it might be interesting to work on OpenMP 4.1 draft
>>> version. How do you think i should propose idea?
>> We're going to work on OpenMP 4.1 offloading features.
>>  - http://openmp.org/sc14/Booth-Sam-IBM.pdf
>>  - http://dl.acm.org/citation.cfm?id=2688364
>> Thanks, K