This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [patch 1/N] std::regex refactoring - _BracketMatcher
- From: Jonathan Wakely <jwakely at redhat dot com>
- To: Tim Shen <timshen91 at gmail dot com>
- Cc: libstdc++ <libstdc++ at gcc dot gnu dot org>, gcc-patches <gcc-patches at gcc dot gnu dot org>
- Date: Mon, 28 Apr 2014 16:05:12 +0100
- Subject: Re: [patch 1/N] std::regex refactoring - _BracketMatcher
- Authentication-results: sourceware.org; auth=none
- References: <20140428124003 dot GW928 at redhat dot com> <CAPrifDkixPm8=RyGVdXHVQbp=s+cZeYL71wX8sQsm1koVedvKA at mail dot gmail dot com>
On 28/04/14 10:15 -0400, Tim Shen wrote:
On Mon, Apr 28, 2014 at 8:40 AM, Jonathan Wakely <jwakely@redhat.com> wrote:
I've been looking through the regex code and have a few ideas for
simplifications or optimisations that I'd like to share.
Thanks :)
This first patch is for _BracketMatcher. We only use std::bitset when
is_same<_CharT, char> so 8 * sizeof(_CharT) should be __CHAR_BIT__
instead. We also only user _UnsignedCharT when is_same<_CharT, char>
so that can just be simplified to unsigned char.
Yes, since _UnsignedCharT is just used as indexes, we can always use a
larger unsigned integer instead. Maybe "size_t" is a better choice?
There is a well-defined mapping from every unsigned char in the range
[0,255] to char and back, so conversions between char and unsigned
char are fine. If we used a larger type then we would get the wrong result
when char is signed, because (size_t)(char)-1 != (unsigned char)(char)-1
I'm not sure if we'll have a wchar_t cache (bitset<65536>) in the future ;)
sizeof(wchar_t) is 4 on unix-like systems, but a cache for char16_t
wouldn't be totally crazy.
If you prefer I will not remove the uses of _CharT and _UnsignedCharT
and won't assume the cache is only used for char. There are still some
simplifications we can make.
The contents of _BracketMatcher::_M_char_set are not sorted and can
contain duplicates in the current code. Making that a sorted, unique
list in _BracketMatcher::_M_ready() allows a binary search instead of
linear search. This improves worst case performance for pathological
regular expressions like std::wregex('['+std::wstring(1000, 'a')+"b]")
but I'm not sure if it helps in the common case.
Trust me, in common case, even it's a bit slower, it won't be
observable. So it's just OK.
Finally, in the non-char case the _CacheT member is an unused empty
object, so having that as the first member requires 7 bytes of
padding. Re-ordering the members reduces the size of a non-char
_BracketMatcher by 8 bytes (but it's still a whopping 96 bytes).
(For a char _BracketMatcher the bitset cache makes it 128 bytes,
this patch doesn't change that).
This is OK too.
Thanks, I'll clean the patch up and commit it soon.