This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Optimize to_chars


On 30/08/19 17:08 +0100, Jonathan Wakely wrote:
On 30/08/19 17:01 +0100, Jonathan Wakely wrote:
On 30/08/19 17:27 +0300, Antony Polukhin wrote:
Bunch of micro optimizations for std::to_chars:
* For base == 8 replacing the lookup in __digits table with arithmetic
computations leads to a same CPU cycles for a loop (exchanges two
movzx with 3 bit ops https://godbolt.org/z/RTui7m ). However this
saves 129 bytes of data and totally avoids a chance of cache misses on
__digits.
* For base == 16 replacing the lookup in __digits table with
arithmetic computations leads to a few additional instructions, but
totally avoids a chance of cache misses on __digits (- ~9 cache misses
for worst case) and saves 513 bytes of const data.
* Replacing __first[pos] and __first[pos - 1] with __first[1] and
__first[0] on final iterations saves ~2% of code size.
* Removing trailing '\0' from arrays of digits allows the linker to
merge the symbols (so that "0123456789abcdefghijklmnopqrstuvwxyz" and
"0123456789abcdef" could share the same address). This improves data
locality and reduces binary sizes.
* Using __detail::__to_chars_len_2 instead of a generic
__detail::__to_chars_len makes the operation O(1) instead of O(N). It
also makes the code two times shorter ( https://godbolt.org/z/Peq_PG)
.

In sum: this significantly reduces the size of a binary (for about
4KBs only for base-8 conversion https://godbolt.org/z/WPKijS ), deals
with latency (CPU cache misses) without changing the iterations count
and without adding costly instructions into the loops.

This is great, thanks.

Have you tried comparing the improved code to libc++'s implementation?
I believe they use precomputed arrays of digits, but they use larger
arrays that allow 4 bytes to be written at once, which is considerably
faster (and those precomputed arrays libe in libc++.so not in the
header). Would we be better off keeping the precomputed arrays and
expanding them to do 4-byte writes?

Since we don't have a patch to do that, I think I'll commit yours. We
can always go back to precomputed arrays later if somebody does that
work.

My only comments are on the changelog:

Changelog:
 * include/std/charconv (__detail::__to_chars_8,
 __detail::__to_chars_16): Replace array of precomputed digits

When the list of changed functions is split across lines it should be
like this:

 * include/std/charconv (__detail::__to_chars_8)
 (__detail::__to_chars_16): Replace array of precomputed digits

i.e close the parentheses before the line break, and reopen on the
next line.

 with arithmetic operations to avoid CPU cache misses. Remove
 zero termination from array of digits to allow symbol merge with
 generic implementation of __detail::__to_chars. Replace final
 offsets with constants. Use __detail::__to_chars_len_2 instead
 of a generic __detail::__to_chars_len.
 * include/std/charconv (__detail::__to_chars): Remove

Don't repeat the asterisk and filename for later changes in the same
file, i.e.

 (__detail::__to_chars): Remove zero termination from array of digits.
 (__detail::__to_chars_2): Leading digit is always '1'.

There's no changelog entry for the changes to __to_chars_len_8 and
__to_chars_len_16.

Oh, there's no separate __to_chars_len_16 function anyway, and you did
mention it as "Use __detail::__to_chars_len_2 instead ..." - sorry!

I think we might as well inline __to_chars_len_8 into __to_chars_8,
there's not much benefit to having a separate function. I'll do that
as a follow up patch.

I've committed this patch to simplify the code a bit.

Tested x86_64-linux, committed to trunk.

commit 8ea5798914e9cf5b823d245d85db4d76e0631d0e
Author: Jonathan Wakely <jwakely@redhat.com>
Date:   Fri Aug 30 17:27:48 2019 +0100

    Minor simplifications for std::to_chars implementation
    
            * include/std/charconv (__detail::__to_chars_2_len): Use std::log2p1.
            (__detail::__to_chars_8_len): Remove.
            (__detail::__to_chars_8): Inline length calculation here.
            (__detail::__from_chars_binary): Use numeric_limits instead of
            CHAR_BIT.

diff --git a/libstdc++-v3/include/std/charconv b/libstdc++-v3/include/std/charconv
index 4e94c39656d..ceefa3b6778 100644
--- a/libstdc++-v3/include/std/charconv
+++ b/libstdc++-v3/include/std/charconv
@@ -35,8 +35,9 @@
 
 #include <type_traits>
 #include <limits>
-#include <cctype>
-#include <bits/charconv.h> // for __to_chars_len, __to_chars_10_impl
+#include <bit>			// for __log2p1
+#include <cctype>		// for isdigit
+#include <bits/charconv.h>	// for __to_chars_len, __to_chars_10_impl
 #include <bits/error_constants.h> // for std::errc
 
 // Define when floating point is supported: #define __cpp_lib_to_chars 201611L
@@ -96,43 +97,7 @@ namespace __detail
   template<typename _Tp>
     constexpr unsigned
     __to_chars_len_2(_Tp __value) noexcept
-    {
-      static_assert(is_integral<_Tp>::value, "implementation bug");
-      static_assert(is_unsigned<_Tp>::value, "implementation bug");
-
-      constexpr size_t __nbits = __CHAR_BIT__ * sizeof(_Tp);
-
-      // N.B. __builtin_clzll is undefined if __value == 0, but std::to_chars
-      // handles zero values directly.
-
-      // For sizeof(_Tp) > 1 this is an order of magnitude faster than
-      // the generic __to_chars_len.
-      return __nbits
-	- (__builtin_clzll(__value)
-	    - ((__CHAR_BIT__ * sizeof(long long)) - __nbits));
-    }
-
-  template<typename _Tp>
-    constexpr unsigned
-    __to_chars_len_8(_Tp __value) noexcept
-    {
-      static_assert(is_integral<_Tp>::value, "implementation bug");
-      static_assert(is_unsigned<_Tp>::value, "implementation bug");
-
-      constexpr size_t __nbits = __CHAR_BIT__ * sizeof(_Tp);
-
-      if _GLIBCXX17_CONSTEXPR (__nbits <= 16)
-	{
-	  return __value > 077777u ? 6u
-	    : __value > 07777u ? 5u
-	    : __value > 0777u ? 4u
-	    : __value > 077u ? 3u
-	    : __value > 07u ? 2u
-	    : 1u;
-	}
-      else
-	return (__to_chars_len_2(__value) + 2) / 3;
-    }
+    { return std::__log2p1(__value); }
 
   // Generic implementation for arbitrary bases.
   template<typename _Tp>
@@ -255,8 +220,19 @@ namespace __detail
       static_assert(is_unsigned<_Tp>::value, "implementation bug");
 
       to_chars_result __res;
+      unsigned __len;
 
-      const unsigned __len = __to_chars_len_8(__val);
+      if _GLIBCXX17_CONSTEXPR (numeric_limits<_Tp>::digits <= 16)
+	{
+	  __len = __val > 077777u ? 6u
+	    : __val > 07777u ? 5u
+	    : __val > 0777u ? 4u
+	    : __val > 077u ? 3u
+	    : __val > 07u ? 2u
+	    : 1u;
+	}
+      else
+	__len = (__to_chars_len_2(__val) + 2) / 3;
 
       if (__builtin_expect((__last - __first) < __len, 0))
 	{
@@ -397,7 +373,7 @@ namespace __detail
 	  __i++;
 	}
       __first += __i;
-      return __i <= (sizeof(_Tp) * __CHAR_BIT__);
+      return __i <= numeric_limits<_Tp>::digits;
     }
 
   /// std::from_chars implementation for integers in bases 3 to 10.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]