This is the mail archive of the
mailing list for the GCC project.
Re: [PATCH,nvptx] Use CUDA driver API to select default runtime launch, geometry
On 08/01/2018 09:11 PM, Cesar Philippidis wrote:
> On 08/01/2018 07:12 AM, Tom de Vries wrote:
>>>>> + gangs = grids * (blocks / warp_size);
>>>> So, we launch with gangs == grids * workers ? Is that intentional?
>>> Yes. At least that's what I've been using in og8. Setting num_gangs =
>>> grids alone caused significant slow downs.
>> Well, what you're saying here is: increasing num_gangs increases
>> You don't explain why you multiply with workers specifically.
> I set it that way because I think the occupancy calculator is
> determining the occupancy of a single multiprocessor unit, rather than
> the entire GPU. Looking at the og8 code again, I had
> num_gangs = 2 * threads_per_sm / warp_size * dev_size
> which corresponds to
> 2 * grids * blocks / warp_size
I've done an experiment using the sample simpleOccupancy. The kernel is
small, so the blocks returned is the maximum: max_threads_per_block (1024).
The grids returned is 10, which I tentatively interpret as num_dev *
(max_threads_per_multi_processor / blocks). [ Where num_dev == 5, and
max_threads_per_multi_processor == 2048. ]
Substituting that into the og8 code, and equating
max_threads_per_multi_processor with threads_per_sm, I indeed get
num_gangs = 2 * grids * blocks / warp_size.
So with this extra information I see how you got there.
But I still see no rationale why blocks is used here, and I wonder
whether something like num_gangs = grids * 64 would give similar results.
Anyway, given that this is what is used on og8, I'm ok with using that,
so let's go with:
gangs = 2 * grids * (blocks / warp_size);
[ so, including the factor two you explicitly left out from the original
patch. Unless you see a pressing reason not to include it. ]
Can you repost after retesting? [ note: the updated patch I posted
earlier doesn't apply on trunk anymore due to the cuda-lib.def change. ]