[C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)
François Dumont
frs.dumont@gmail.com
Tue Jul 16 16:40:00 GMT 2019
Hi
   I eventually spent much more time working on the inplace_merge
performance bench.
   And the results do not confirm the theory:
Before patch:
inplace_merge.cc           bench 1 / 1 memory          243r 227u Â
17s     1216mem   5pf
inplace_merge.cc           bench 1 / 4 memory          297r 278u Â
18s      480mem   0pf
inplace_merge.cc           bench 1 /64 memory         373r 354u Â
18s      480mem   0pf
inplace_merge.cc           bench 0 / 1 memory          12r 11u   Â
0s      480mem   0pf
After the patch to reduce memory allocation:
inplace_merge.cc           bench 1 / 1 memory          245r 227u Â
18s     1216mem   0pf
inplace_merge.cc           bench 1 / 4 memory          292r 273u Â
19s      480mem   0pf
inplace_merge.cc           bench 1 /64 memory         373r 356u Â
17s      480mem   0pf
inplace_merge.cc           bench 0 / 1 memory          11r 11u   Â
0s      480mem   0pf
With the __len1 > __len2 condition change:
inplace_merge.cc           bench 1 / 1 memory          244r 225u Â
20s     1216mem   0pf
inplace_merge.cc           bench 1 / 4 memory          379r 361u Â
17s      480mem   0pf
inplace_merge.cc           bench 1 /64 memory          640r 625u Â
16s      480mem   0pf
inplace_merge.cc           bench 0 / 1 memory          11r 11u  Â
0s      480mem   0pf
When there is no memory limit the results are identical of course.
Otherwise as soon as memory is limited performance starts to decrease
with the condition change on __len1 vs __len2.
Could you give the bench you use to demonstrate the enhancement ? I also
wonder why your patch doesn't change consistently the same condition in
__merge_without_buffer ?
For the moment I'd like to propose the attached patch that is to say the
reduction on the amount of allocated memory and the new/modified benches.
Note that as soon as I forbid any memory allocation I also reduce the
size of the range to merge cause the implementation rely on recursivity
and so could end-up in a stack overflow. Maybe I need to check for
simulator targets like several other tests ? Unless simulators do not
run the performance tests ?
Regarding this stack overflow issue, is there some routine to find out
how many levels of function calls can be added before reaching the stack
overflow ? I could perhaps call __builtin_alloca and check the result
but that smells. If I could find out this we could fallback on an
iterative approach to complete the merge.
   PR libstdc++/83938
   * include/bits/stl_algo.h:
   (__inplace_merge): Take temporary buffer length from smallest range.
   (__stable_sort): Limit temporary buffer length.
   * include/bits/stl_tempbuf.h (get_temporary_buffer): Change __len
   computation in the loop.
   * testsuite/25_algorithms/inplace_merge/1.cc (test3): Test all possible
   pivot positions.
   * testsuite/performance/25_algorithms/inplace_merge.cc: New.
   * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
   execution with different memory constraints.
Ok to commit ?
François
On 6/9/19 4:27 PM, François Dumont wrote:
> On 12/21/18 9:57 PM, Jonathan Wakely wrote:
>> On 29/10/18 07:06 +0100, François Dumont wrote:
>>> Hi
>>>
>>> Â Â Â Some feedback regarding this patch ?
>>
>> Sorry this got missed, please resubmit during stage 1.
>>
>> You haven't CC'd the original patch author (chang jc) to give them a
>> chance to comment on your proposed changes to the patch.
>>
>> The attached PDF on PR libstdc++/83938 has extensive discussion of the
>> performance issue, but I don't see any for your version. Some
>> performance benchmarks for your version would be helpful.
>
> Here is this patch again.
>
> This time it is much closer to John's original one, I just kept my
> change on the size of the temporary buffer which doesn't need to be as
> large as it is currently allocated, especially with John's patch.
>
> The performance tests are showing the good improvements, attached are
> the before/after. Surprisingly I do not see any change regarding
> allocated memory, I guess measures are not accurate enough. However
> working on them also show that current implementation is fragile. If
> you reduce the allowed allocated memory too much you end up with a
> stack overflow because of the recursive implementation.
>
> I already have a patch to keep on trying to allocate memory as long as
> above a given threshold rather than 0 but I am also working on making
> implementation non-recursive. We'll see if even a buffer of size 1 is
> better than no buffer at all then.
>
> Â Â Â PR libstdc++/83938
> Â Â Â * include/bits/stl_algo.h:
> Â Â Â (__merge_adaptive): When buffer too small consider smallest range
> first
> Â Â Â and cut based on buffer size.
> Â Â Â (__inplace_merge): Take temporary buffer length from smallest range.
> Â Â Â (__stable_sort): Limit temporary buffer length.
> Â Â Â * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
> Â Â Â to reduce requested buffer length on allocation failure.
> Â Â Â * testsuite/performance/25_algorithms/inplace_merge.cc: New.
> Â Â Â * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
> Â Â Â execution with different level of memory.
>
> François
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pr83938.patch
Type: text/x-patch
Size: 15992 bytes
Desc: not available
URL: <http://gcc.gnu.org/pipermail/libstdc++/attachments/20190716/6ffa3ec2/attachment.bin>
More information about the Libstdc++
mailing list