[PATCH 2/2][libstdc++]: Adjust probabilities of hashmap loop conditions
Tamar Christina
Tamar.Christina@arm.com
Tue Dec 17 20:24:02 GMT 2024
> On Fri, 13 Dec 2024 at 17:13, Tamar Christina <tamar.christina@arm.com> wrote:
> >
> > Hi All,
> >
> > We are currently generating a loop which has more comparisons than you'd
> > typically need as the probablities on the small size loop are such that it
> > assumes the likely case is that an element is not found.
> >
> > This again generates a pattern that's harder for branch predictors to follow,
> > but also just generates more instructions for the what one could say is the
> > typical case: That your hashtable contains the entry you are looking for.
> >
> > This patch adds a __builtin_expect to indicate that it's likely that you'd
> > find the element that's being searched for.
> >
> > The second change is in _M_find_before_node where at the moment the loop
> > is optimized for the case where we don't do any iterations.
> >
> > A simple testcase is:
> >
> > #include <stdbool.h>
> >
> > bool foo (int **a, int n, int val, int *tkn)
> > {
> > for (int i = 0; i < n; i++)
> > {
> > if (!a[i] || a[i]==tkn)
> > return false;
> >
> > if (*a[i] == val)
> > return true;
> > }
> > }
> >
> > which generataes:
> >
> > foo:
> > cmp w1, 0
> > ble .L1
> > add x1, x0, w1, uxtw 3
> > b .L4
> > .L9:
> > ldr w4, [x4]
> > cmp w4, w2
> > beq .L6
> > cmp x0, x1
> > beq .L1
> > .L4:
> > ldr x4, [x0]
> > add x0, x0, 8
> > cmp x4, 0
> > ccmp x4, x3, 4, ne
> > bne .L9
> > mov w0, 0
> > .L1:
> > ret
> > .L6:
> > mov w0, 1
> > ret
> >
> > i.e. BB rotation makes is generate an unconditional branch to a conditional
> > branch. However this method is only called when the size is above a certain
> > threshold, and so it's likely that we have to do that first iteration.
> >
> > Adding:
> >
> > #include <stdbool.h>
> >
> > bool foo (int **a, int n, int val, int *tkn)
> > {
> > for (int i = 0; i < n; i++)
> > {
> > if (__builtin_expect(!a[i] || a[i]==tkn, 0))
> > return false;
> >
> > if (*a[i] == val)
> > return true;
> > }
> > }
> >
> > to indicate that we will likely do an iteration more generates:
> >
> > foo:
> > cmp w1, 0
> > ble .L1
> > add x1, x0, w1, uxtw 3
> > .L4:
> > ldr x4, [x0]
> > add x0, x0, 8
> > cmp x4, 0
> > ccmp x4, x3, 4, ne
> > beq .L5
> > ldr w4, [x4]
> > cmp w4, w2
> > beq .L6
> > cmp x0, x1
> > bne .L4
> > .L1:
> > ret
> > .L5:
> > mov w0, 0
> > ret
> > .L6:
> > mov w0, 1
> > ret
> >
> > which results in ~0-20% extra on top of the previous patch.
> >
> > In table form:
> >
> > +-----------+--------------------+-------+----------+----------------+
> > | Group | Benchmark | Size | % Inline | % Unlikely all |
> > +-----------+--------------------+-------+----------+----------------+
> > | Find | unord<uint64_t | 11264 | 53.52% | 64.81% |
> > | Find | unord<uint64_t | 11264 | 47.98% | 62.21% |
> > | Find Many | unord<uint64_t | 11 | 47.62% | 45.32% |
> > | Find Many | unord<uint64_t | 11 | 40.90% | 41.99% |
> > | Find Many | unord<std::string | 11 | 44.94% | 41.25% |
> > | Find Many | unord<std::string | 11 | 44.89% | 41.19% |
> > | Find | unord<NonSSOString | 11 | 31.17% | 35.17% |
> > | Find | unord<NonSSOString | 11 | 31.80% | 33.77% |
> > | Find | unord<uint64_t | 352 | 28.27% | 30.79% |
> > | Find Many | unord<uint64_t | 352 | 30.57% | 30.57% |
> > | Find Many | unord<uint64_t | 11264 | 20.71% | 30.34% |
> > | Find Many | unord<uint64_t | 11264 | 21.33% | 29.87% |
> > | Find Many | unord<NonSSOString | 352 | 25.93% | 29.09% |
> > | Find Many | unord<NonSSOString | 352 | 26.59% | 28.07% |
> > | Find Many | unord<uint64_t | 352 | 26.80% | 26.67% |
> > | Find | unord<std::string | 11 | 25.66% | 24.44% |
> > | Find Many | unord<std::string | 352 | 23.12% | 24.05% |
> > | Find Many | unord<NonSSOString | 11 | 23.21% | 23.08% |
> > | Find Many | unord<std::string | 352 | 19.23% | 22.08% |
> > | Find Many | unord<NonSSOString | 11 | 22.04% | 22.04% |
> > | Find | unord<uint64_t | 352 | 15.43% | 19.37% |
> > | Find | unord<std::string | 11 | 20.36% | 17.48% |
> > | Find | unord<std::string | 352 | 18.59% | 14.94% |
> > | Find | unord<NonSSOString | 352 | 13.79% | 13.76% |
> > | Find | unord<NonSSOString | 11264 | 7.04% | 13.38% |
> > | Find | unord<NonSSOString | 11264 | 9.19% | 13.24% |
> > | Find | unord<std::string | 11264 | 11.80% | 12.73% |
> > | Find Many | unord<std::string | 11264 | 8.78% | 12.44% |
> > | Find Many | unord<std::string | 11264 | 5.30% | 10.86% |
> > | Find Many | unord<NonSSOString | 11264 | 6.15% | 8.85% |
> > | Find Many | unord<NonSSOString | 11264 | 5.90% | 8.49% |
> > | Find | unord<NonSSOString | 352 | 9.69% | 8.30% |
> > | Find | unord<std::string | 11264 | 9.97% | 6.30% |
> > | Find | vec<uint64_t | 352 | 6.20% | 6.13% |
> > +-----------+--------------------+-------+----------+----------------+
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master? and possible backports?
> >
> > Thanks,
> > Tamar
> >
> > libstdc++-v3/ChangeLog:
> >
> > * include/bits/hashtable.h (_M_locate): Make it likely that we find an
> > element.
> > (_M_find_before_node): Make it likely that the map has at least one
> > entry and so we do at least one iteration.
> >
> > ---
> > diff --git a/libstdc++-v3/include/bits/hashtable.h b/libstdc++-
> v3/include/bits/hashtable.h
> > index
> e791e52ec329277474f3218d8a44cd37ded14ac3..8101d868d0c5f7ac4f97931a
> ffcf71d826c88094 100644
> > --- a/libstdc++-v3/include/bits/hashtable.h
> > +++ b/libstdc++-v3/include/bits/hashtable.h
> > @@ -2171,7 +2171,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > if (this->_M_equals(__k, __code, *__p))
> > return __prev_p;
> >
> > - if (!__p->_M_nxt || _M_bucket_index(*__p->_M_next()) != __bkt)
> > + if (__builtin_expect (!__p->_M_nxt || _M_bucket_index(*__p->_M_next())
> != __bkt, 0))
> > break;
> > __prev_p = __p;
> > }
> > @@ -2201,7 +2201,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > if (this->_M_equals_tr(__k, __code, *__p))
> > return __prev_p;
> >
> > - if (!__p->_M_nxt || _M_bucket_index(*__p->_M_next()) != __bkt)
> > + if (__builtin_expect (!__p->_M_nxt || _M_bucket_index(*__p-
> >_M_next()) != __bkt, 0))
> > break;
> > __prev_p = __p;
> > }
> > @@ -2228,7 +2228,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > pointer_to(const_cast<__node_base&>(_M_before_begin));
> > while (__loc._M_before->_M_nxt)
> > {
> > - if (this->_M_key_equals(__k, *__loc._M_node()))
> > + if (__builtin_expect (this->_M_key_equals(__k, *__loc._M_node()), 1))
> > return __loc;
> > __loc._M_before = __loc._M_before->_M_nxt;
> > }
>
> The first two changes make sense to me: we're typically going to loop
> more than once when searching for an element, so the break is
> unlikely. The third change is a bit more surprising, because although
> it's likely we'll find an element *eventually* it is only likely once
> out of N iterations, not likely on every iteration. But if the numbers
> show it helps, then that just shows my intuition is probably wrong.
>
> If this is optimized for 'find' is it possible that this change
> pessimizes insertion (which also has to use _M_locate to find where to
> insert)?
You're right there seems to be a uniform slowdown for insertions in small sized
hashtables of ~10 entries. They're about 10-14% slower with that change.
As expected just the ones on _M_find_before_node does not cause any issues.
Since the insertions into small hashtables don't run long enough I've also increased
the number of iterations to ~1 million even out the score a bit more but there's still
a sizable 5 regressions.
I've kicked off a longer run removing the change on _M_locate and with some more
variable size finds/inserts.
So far it looks like the additional benefits gotten on _M_locate were
mainly on two tests. I'll look at the perf traces to figure out exactly why but I'll respin
the patch without the _M_locate change and send it tomorrow morning once
benchmarks finish.
Thanks!
Tamar
More information about the Libstdc++
mailing list