[C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)

François Dumont frs.dumont@gmail.com
Tue Jul 16 16:40:00 GMT 2019


Hi

     I eventually spent much more time working on the inplace_merge 
performance bench.

     And the results do not confirm the theory:

Before patch:
inplace_merge.cc             bench 1 / 1 memory            243r  227u   
17s      1216mem    5pf
inplace_merge.cc             bench 1 / 4 memory            297r 278u   
18s       480mem    0pf
inplace_merge.cc             bench 1 /64 memory           373r 354u   
18s       480mem    0pf
inplace_merge.cc             bench 0 / 1 memory            12r 11u     
0s       480mem    0pf

After the patch to reduce memory allocation:
inplace_merge.cc             bench 1 / 1 memory            245r  227u   
18s      1216mem    0pf
inplace_merge.cc             bench 1 / 4 memory            292r 273u   
19s       480mem    0pf
inplace_merge.cc             bench 1 /64 memory           373r 356u   
17s       480mem    0pf
inplace_merge.cc             bench 0 / 1 memory            11r 11u     
0s       480mem    0pf

With the __len1 > __len2 condition change:
inplace_merge.cc             bench 1 / 1 memory            244r  225u   
20s      1216mem    0pf
inplace_merge.cc             bench 1 / 4 memory            379r 361u   
17s       480mem    0pf
inplace_merge.cc             bench 1 /64 memory            640r 625u   
16s       480mem    0pf
inplace_merge.cc             bench 0 / 1 memory             11r 11u    
0s       480mem    0pf

When there is no memory limit the results are identical of course. 
Otherwise as soon as memory is limited performance starts to decrease 
with the condition change on __len1 vs __len2.

Could you give the bench you use to demonstrate the enhancement ? I also 
wonder why your patch doesn't change consistently the same condition in 
__merge_without_buffer ?

For the moment I'd like to propose the attached patch that is to say the 
reduction on the amount of allocated memory and the new/modified benches.

Note that as soon as I forbid any memory allocation I also reduce the 
size of the range to merge cause the implementation rely on recursivity 
and so could end-up in a stack overflow. Maybe I need to check for 
simulator targets like several other tests ? Unless simulators do not 
run the performance tests ?

Regarding this stack overflow issue, is there some routine to find out 
how many levels of function calls can be added before reaching the stack 
overflow ? I could perhaps call __builtin_alloca and check the result 
but that smells. If I could find out this we could fallback on an 
iterative approach to complete the merge.

     PR libstdc++/83938
     * include/bits/stl_algo.h:
     (__inplace_merge): Take temporary buffer length from smallest range.
     (__stable_sort): Limit temporary buffer length.
     * include/bits/stl_tempbuf.h (get_temporary_buffer): Change __len
     computation in the loop.
     * testsuite/25_algorithms/inplace_merge/1.cc (test3): Test all possible
     pivot positions.
     * testsuite/performance/25_algorithms/inplace_merge.cc: New.
     * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
     execution with different memory constraints.

Ok to commit ?

François

On 6/9/19 4:27 PM, François Dumont wrote:
> On 12/21/18 9:57 PM, Jonathan Wakely wrote:
>> On 29/10/18 07:06 +0100, François Dumont wrote:
>>> Hi
>>>
>>>     Some feedback regarding this patch ?
>>
>> Sorry this got missed, please resubmit during stage 1.
>>
>> You haven't CC'd the original patch author (chang jc) to give them a
>> chance to comment on your proposed changes to the patch.
>>
>> The attached PDF on PR libstdc++/83938 has extensive discussion of the
>> performance issue, but I don't see any for your version. Some
>> performance benchmarks for your version would be helpful.
>
> Here is this patch again.
>
> This time it is much closer to John's original one, I just kept my 
> change on the size of the temporary buffer which doesn't need to be as 
> large as it is currently allocated, especially with John's patch.
>
> The performance tests are showing the good improvements, attached are 
> the before/after. Surprisingly I do not see any change regarding 
> allocated memory, I guess measures are not accurate enough. However 
> working on them also show that current implementation is fragile. If 
> you reduce the allowed allocated memory too much you end up with a 
> stack overflow because of the recursive implementation.
>
> I already have a patch to keep on trying to allocate memory as long as 
> above a given threshold rather than 0 but I am also working on making 
> implementation non-recursive. We'll see if even a buffer of size 1 is 
> better than no buffer at all then.
>
>     PR libstdc++/83938
>     * include/bits/stl_algo.h:
>     (__merge_adaptive): When buffer too small consider smallest range 
> first
>     and cut based on buffer size.
>     (__inplace_merge): Take temporary buffer length from smallest range.
>     (__stable_sort): Limit temporary buffer length.
>     * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
>     to reduce requested buffer length on allocation failure.
>     * testsuite/performance/25_algorithms/inplace_merge.cc: New.
>     * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
>     execution with different level of memory.
>
> François
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: pr83938.patch
Type: text/x-patch
Size: 15992 bytes
Desc: not available
URL: <http://gcc.gnu.org/pipermail/libstdc++/attachments/20190716/6ffa3ec2/attachment.bin>


More information about the Libstdc++ mailing list