[C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)

Tue Jul 16 16:40:00 GMT 2019

Hi

 Â Â Â  I eventually spent much more time working on the inplace_merge 
performance bench.

 Â Â Â  And the results do not confirm the theory:

Before patch:
inplace_merge.ccÂ Â Â Â Â Â Â Â  Â Â Â  bench 1 / 1 memoryÂ Â Â Â Â Â  Â Â Â  Â 243rÂ  227uÂ Â  
17sÂ Â Â Â Â  1216memÂ Â Â  5pf
inplace_merge.ccÂ Â Â Â Â Â Â Â  Â Â Â  bench 1 / 4 memoryÂ Â Â Â Â Â  Â Â Â  Â 297r 278uÂ Â  
18sÂ Â Â Â Â Â  480memÂ Â Â  0pf
inplace_merge.ccÂ Â Â Â Â Â Â Â  Â Â Â  bench 1 /64 memoryÂ Â Â Â Â Â  Â Â Â  373r 354uÂ Â  
18sÂ Â Â Â Â Â  480memÂ Â Â  0pf
inplace_merge.ccÂ Â Â Â Â Â Â Â  Â Â Â  bench 0 / 1 memoryÂ Â Â Â Â Â  Â Â Â Â  12r 11uÂ Â Â Â  
0sÂ Â Â Â Â Â  480memÂ Â Â  0pf

After the patch to reduce memory allocation:
inplace_merge.ccÂ Â Â Â Â Â Â Â  Â Â Â  bench 1 / 1 memoryÂ Â Â Â Â Â  Â Â Â  Â 245rÂ  227uÂ Â  
18sÂ Â Â Â Â  1216memÂ Â Â  0pf
inplace_merge.ccÂ Â Â Â Â Â Â Â  Â Â Â  bench 1 / 4 memoryÂ Â Â Â Â Â  Â Â Â  Â 292r 273uÂ Â  
19sÂ Â Â Â Â Â  480memÂ Â Â  0pf
inplace_merge.ccÂ Â Â Â Â Â Â Â  Â Â Â  bench 1 /64 memoryÂ Â Â Â Â Â  Â Â Â  373r 356uÂ Â  
17sÂ Â Â Â Â Â  480memÂ Â Â  0pf
inplace_merge.ccÂ Â Â Â Â Â Â Â  Â Â Â  bench 0 / 1 memoryÂ Â Â Â Â Â  Â Â Â Â  11r 11uÂ Â Â Â  
0sÂ Â Â Â Â Â  480memÂ Â Â  0pf

With the __len1 > __len2 condition change:
inplace_merge.ccÂ Â Â Â Â Â Â Â  Â Â Â  bench 1 / 1 memoryÂ Â Â Â Â Â  Â Â Â  Â 244rÂ  225uÂ Â  
20sÂ Â Â Â Â  1216memÂ Â Â  0pf
inplace_merge.ccÂ Â Â Â Â Â Â Â  Â Â Â  bench 1 / 4 memoryÂ Â Â Â Â Â  Â Â Â  Â 379r 361uÂ Â  
17sÂ Â Â Â Â Â  480memÂ Â Â  0pf
inplace_merge.ccÂ Â Â Â Â Â Â Â  Â Â Â  bench 1 /64 memoryÂ Â Â Â Â Â  Â Â Â  Â 640r 625uÂ Â  
16sÂ Â Â Â Â Â  480memÂ Â Â  0pf
inplace_merge.ccÂ Â Â Â Â Â Â Â  Â Â Â  bench 0 / 1 memoryÂ Â Â Â Â Â  Â Â Â  Â  11r 11uÂ Â Â  
0sÂ Â Â Â Â Â  480memÂ Â Â  0pf

When there is no memory limit the results are identical of course. 
Otherwise as soon as memory is limited performance starts to decrease 
with the condition change on __len1 vs __len2.

Could you give the bench you use to demonstrate the enhancement ? I also 
wonder why your patch doesn't change consistently the same condition in 
__merge_without_buffer ?

For the moment I'd like to propose the attached patch that is to say the 
reduction on the amount of allocated memory and the new/modified benches.

Note that as soon as I forbid any memory allocation I also reduce the 
size of the range to merge cause the implementation rely on recursivity 
and so could end-up in a stack overflow. Maybe I need to check for 
simulator targets like several other tests ? Unless simulators do not 
run the performance tests ?

Regarding this stack overflow issue, is there some routine to find out 
how many levels of function calls can be added before reaching the stack 
overflow ? I could perhaps call __builtin_alloca and check the result 
but that smells. If I could find out this we could fallback on an 
iterative approach to complete the merge.

 Â Â Â  PR libstdc++/83938
 Â Â Â  * include/bits/stl_algo.h:
 Â Â Â  (__inplace_merge): Take temporary buffer length from smallest range.
 Â Â Â  (__stable_sort): Limit temporary buffer length.
 Â Â Â  * include/bits/stl_tempbuf.h (get_temporary_buffer): Change __len
 Â Â Â  computation in the loop.
 Â Â Â  * testsuite/25_algorithms/inplace_merge/1.cc (test3): Test all possible
 Â Â Â  pivot positions.
 Â Â Â  * testsuite/performance/25_algorithms/inplace_merge.cc: New.
 Â Â Â  * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
 Â Â Â  execution with different memory constraints.

Ok to commit ?

FranÃ§ois

On 6/9/19 4:27 PM, FranÃ§ois Dumont wrote:
> On 12/21/18 9:57 PM, Jonathan Wakely wrote:
>> On 29/10/18 07:06 +0100, FranÃ§ois Dumont wrote:
>>> Hi
>>>
>>> Â Â Â  Some feedback regarding this patch ?
>>
>> Sorry this got missed, please resubmit during stage 1.
>>
>> You haven't CC'd the original patch author (chang jc) to give them a
>> chance to comment on your proposed changes to the patch.
>>
>> The attached PDF on PR libstdc++/83938 has extensive discussion of the
>> performance issue, but I don't see any for your version. Some
>> performance benchmarks for your version would be helpful.
>
> Here is this patch again.
>
> This time it is much closer to John's original one, I just kept my 
> change on the size of the temporary buffer which doesn't need to be as 
> large as it is currently allocated, especially with John's patch.
>
> The performance tests are showing the good improvements, attached are 
> the before/after. Surprisingly I do not see any change regarding 
> allocated memory, I guess measures are not accurate enough. However 
> working on them also show that current implementation is fragile. If 
> you reduce the allowed allocated memory too much you end up with a 
> stack overflow because of the recursive implementation.
>
> I already have a patch to keep on trying to allocate memory as long as 
> above a given threshold rather than 0 but I am also working on making 
> implementation non-recursive. We'll see if even a buffer of size 1 is 
> better than no buffer at all then.
>
> Â Â Â  PR libstdc++/83938
> Â Â Â  * include/bits/stl_algo.h:
> Â Â Â  (__merge_adaptive): When buffer too small consider smallest range 
> first
> Â Â Â  and cut based on buffer size.
> Â Â Â  (__inplace_merge): Take temporary buffer length from smallest range.
> Â Â Â  (__stable_sort): Limit temporary buffer length.
> Â Â Â  * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
> Â Â Â  to reduce requested buffer length on allocation failure.
> Â Â Â  * testsuite/performance/25_algorithms/inplace_merge.cc: New.
> Â Â Â  * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
> Â Â Â  execution with different level of memory.
>
> FranÃ§ois
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: pr83938.patch
Type: text/x-patch
Size: 15992 bytes
Desc: not available
URL: <http://gcc.gnu.org/pipermail/libstdc++/attachments/20190716/6ffa3ec2/attachment.bin>