89120 – std::minmax_element 2.5 times slower than hand written loop

Bug 89120 - std::minmax_element 2.5 times slower than hand written loop

Summary: std::minmax_element 2.5 times slower than hand written loop

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	libstdc++ (show other bugs)
Version:	9.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2019-01-30 17:14 UTC by Antony Polukhin
Modified:	2021-05-17 17:39 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Antony Polukhin 2019-01-30 17:14:05 UTC

std::minmax_element is slow when there's a lot of data and it does not fit into the CPU cache: http://quick-bench.com/Z0iRfbm2_S9KvQ1C92ydh8USF-8

Comment 1 Marc Glisse 2019-01-30 18:51:09 UTC

First there is the issue of iterator vs value, as in your other PR.

The performance of minmax depends a lot on the element type and the distribution. The standard requires that we perform only 3n/2 comparisons (not 2n like your version), which essentially prescribes the implementation. However, minmax is probably the most classical example shown in classes on branch prediction. For a uniform choice of a permutation, the 2n comparisons can be well predicted (O(log n) are mispredicted) while if you only do 3n/2 comparisons, n/2 of those are unpredictable, so n/4 are badly predicted, and that costs a lot.

Comment 2 Antony Polukhin 2021-05-17 17:39:02 UTC

Long story short: I've found no way to improve the standard library code to always work faster. I'm in favor of closing this ticket as invalid/wont fix.

Long story:

I've tried to add a specialization of minmax_element algorithm for std::less comparators and arithmetic types. That specialization was doing more comparisons but in a more predictable way. On big datasets the performance increased, but decreased on small datasets.


Then I've tried another approach. If the comparison of __first with __next is barely predictable, then just avoid branching on it.

Portable solution:

bool __b = __comp(__next, __first);	  
_ForwardIterator __pots[3] = {__first, __next, __first};
_ForwardIterator __pot_min = *(__pots + __b);
_ForwardIterator __pot_max = *(__pots + __b + 1);

Special case for random access iterators:

bool __b = __comp(__next, __first);
_ForwardIterator __pot_min = __first, __pot_max = __next;
__pot_min += b;
__pot_max -= b;


Unfortunately both those approaches add some overhead for small datasets. Another disadvantage, is that those approaches produce orthogonal results on different compilers:  

GCC-9 performance gets better on big datasets
-----------------------------------------------------------------
Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
naive_minmax/2                     3 ns          3 ns  247522237
naive_minmax/8                     7 ns          7 ns  103044422
naive_minmax/262144          1715635 ns    1710406 ns        407
naive_minmax/1048576         6970755 ns    6947034 ns        101

branchless_minmax/2                8 ns          8 ns   81324904
branchless_minmax/8               30 ns         30 ns   23494608
branchless_minmax/262144      457287 ns     456412 ns       1529
branchless_minmax/1048576    4267914 ns    4219969 ns        363



Clang-9 performance degrades on big datasets
-----------------------------------------------------------------
Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
naive_minmax/2                     2 ns          2 ns  380928404
naive_minmax/8                     7 ns          7 ns   92642970
naive_minmax/262144           262921 ns     262288 ns       2630
naive_minmax/1048576         1149407 ns    1147626 ns        618

branchless_minmax/2                2 ns          2 ns  307146020
branchless_minmax/8               10 ns         10 ns   74417142
branchless_minmax/262144      425880 ns     425241 ns       1637
branchless_minmax/1048576    1747785 ns    1745725 ns        397


Final attempt. Different compilers optimize the algorithm differently. Clang shows good performance on big datasets with >4k elements, GCC - on medium sized datasets with 128-1k elements. Maybe providing more info on probabilities could help both compilers to produce better code. But looks like heuristics already deduce the probabilities to be close to 0.5, __builtin_expect_with_probability(__b, true, 0.5) changed nothing in the assembly https://godbolt.org/z/PqWoaKfhW