[PATCH] Optimize to_chars

Fri Aug 30 16:44:00 GMT 2019

On 30/08/19 17:01 +0100, Jonathan Wakely wrote:
>On 30/08/19 17:27 +0300, Antony Polukhin wrote:
>>Bunch of micro optimizations for std::to_chars:
>>* For base == 8 replacing the lookup in __digits table with arithmetic
>>computations leads to a same CPU cycles for a loop (exchanges two
>>movzx with 3 bit ops https://godbolt.org/z/RTui7m ). However this
>>saves 129 bytes of data and totally avoids a chance of cache misses on
>>__digits.
>>* For base == 16 replacing the lookup in __digits table with
>>arithmetic computations leads to a few additional instructions, but
>>totally avoids a chance of cache misses on __digits (- ~9 cache misses
>>for worst case) and saves 513 bytes of const data.
>>* Replacing __first[pos] and __first[pos - 1] with __first[1] and
>>__first[0] on final iterations saves ~2% of code size.
>>* Removing trailing '\0' from arrays of digits allows the linker to
>>merge the symbols (so that "0123456789abcdefghijklmnopqrstuvwxyz" and
>>"0123456789abcdef" could share the same address). This improves data
>>locality and reduces binary sizes.
>>* Using __detail::__to_chars_len_2 instead of a generic
>>__detail::__to_chars_len makes the operation O(1) instead of O(N). It
>>also makes the code two times shorter ( https://godbolt.org/z/Peq_PG)
>>.
>>
>>In sum: this significantly reduces the size of a binary (for about
>>4KBs only for base-8 conversion https://godbolt.org/z/WPKijS ), deals
>>with latency (CPU cache misses) without changing the iterations count
>>and without adding costly instructions into the loops.
>
>This is great, thanks.
>
>Have you tried comparing the improved code to libc++'s implementation?
>I believe they use precomputed arrays of digits, but they use larger
>arrays that allow 4 bytes to be written at once, which is considerably
>faster (and those precomputed arrays libe in libc++.so not in the
>header). Would we be better off keeping the precomputed arrays and
>expanding them to do 4-byte writes?
>
>Since we don't have a patch to do that, I think I'll commit yours. We
>can always go back to precomputed arrays later if somebody does that
>work.
>
>My only comments are on the changelog:
>
>>Changelog:
>>   * include/std/charconv (__detail::__to_chars_8,
>>   __detail::__to_chars_16): Replace array of precomputed digits
>
>When the list of changed functions is split across lines it should be
>like this:
>
>   * include/std/charconv (__detail::__to_chars_8)
>   (__detail::__to_chars_16): Replace array of precomputed digits
>
>i.e close the parentheses before the line break, and reopen on the
>next line.
>
>>   with arithmetic operations to avoid CPU cache misses. Remove
>>   zero termination from array of digits to allow symbol merge with
>>   generic implementation of __detail::__to_chars. Replace final
>>   offsets with constants. Use __detail::__to_chars_len_2 instead
>>   of a generic __detail::__to_chars_len.
>>   * include/std/charconv (__detail::__to_chars): Remove
>
>Don't repeat the asterisk and filename for later changes in the same
>file, i.e.
>
>   (__detail::__to_chars): Remove zero termination from array of digits.
>   (__detail::__to_chars_2): Leading digit is always '1'.
>
>There's no changelog entry for the changes to __to_chars_len_8 and
>__to_chars_len_16.

Oh, there's no separate __to_chars_len_16 function anyway, and you did
mention it as "Use __detail::__to_chars_len_2 instead ..." - sorry!

I think we might as well inline __to_chars_len_8 into __to_chars_8,
there's not much benefit to having a separate function. I'll do that
as a follow up patch.