Bug 78714 - std::get_time / std::time_get::get does not accept full month name in %b
Summary: std::get_time / std::time_get::get does not accept full month name in %b
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: 7.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
Depends on:
Blocks: 86976
  Show dependency treegraph
Reported: 2016-12-07 14:48 UTC by Sergey Zubkov
Modified: 2021-12-10 16:11 UTC (History)
4 users (show)

See Also:
Known to work:
Known to fail:
Last reconfirmed: 2017-07-13 00:00:00


Note You need to log in before you can comment on or make changes to this bug.
Description Sergey Zubkov 2016-12-07 14:48:26 UTC
The standard says time_get calls get_time which follows "ISO/IEC 9945 function strptime" ([locale.time.get.members]p8.4)

strptime defines the meaning of %b as "The month, using the locale's month names; either the abbreviated or full name may be specified.": http://pubs.opengroup.org/onlinepubs/9699919799/functions/strptime.html

however it appears libstdc++ only accepts abbreviated month name: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/locale_facets_nonio.tcc#L673-L678


#include <iostream>
#include <sstream>
#include <cassert>
#include <iomanip>
#include <time.h>
tm t;
int main()
    assert(strptime("2011-Feb-18 23:12:34", "%Y-%b-%d %H:%M:%S", &t)); // pass
    assert(strptime("2011-February-18 23:12:34", "%Y-%b-%d %H:%M:%S", &t)); // pass
      std::istringstream ss("2011-Feb-18 23:12:34");
      assert(ss >> std::get_time(&t, "%Y-%b-%d %H:%M:%S")); // pass
      std::istringstream ss("2011-February-18 23:12:34");
      assert(ss >> std::get_time(&t, "%Y-%b-%d %H:%M:%S")); // FAIL

all asserts pass with clang's libc++ and with MSVC 2015
Comment 1 Howard Hinnant 2020-01-26 23:58:37 UTC
Here is a hopefully helpful algorithm for scanning a list of "keywords" for a match:


The keywords can be prefixes of one another, and the scan will match the longest possible keyword.  The scan is case-insensitive.

Here is how it can be used:


Here is how this bug is impacting real-world code, and estimates on how the impact will increase with C++20:

Comment 2 CVS Commits 2021-12-10 16:04:06 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:


commit r12-5898-gc82e492616e343b6d6db218d2b498267bac899de
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Fri Dec 10 17:01:28 2021 +0100

    libstdc++: Some time_get fixes [PR78714]
    The following patch is an attempt to fix various time_get related issues.
    Sorry, it is long...
    One of them is PR78714.  It seems _M_extract_via_format has been written
    with how strftime behaves in mind rather than how strptime behaves.
    There is a significant difference between the two, for strftime %a and %A
    behave differently etc., one emits an abbreviated name, the other full name.
    For strptime both should behave the same and accept both the full or
    abbreviated names.  This needed large changes in _M_extract_name, which
    was assuming the names are unique and names aren't prefixes of other names.
    The _M_extract_name changes allow to deal with those cases.  As can be
    seen in the new testcase, e.g. for %b and english locales we need to
    accept both Apr and April.  If we see Apr in the input, the code looks
    at whether there is end right after those 3 chars or if the next
    character doesn't match characters in the longer names; in that case
    it accepts the abbreviated name.  Otherwise, if the input has Apri, it
    commits to a longer name and fails if it isn't April.  This behavior is
    different from strptime, which for %bix and Aprix accepts it, but for
    an input iterator I'm afraid we can't do better, we can't go back (peek
    more than the current character).
    Another case is that %d and %e in strptime should work the same, while
    previously the code was hardcoding that %d would be 01 to 31 and %e
     1 to 31 (with leading 0 replaced by space).
    strptime POSIX 2009 documentation seems to suggest for numbers it should
    accept up to the specified number of digits rather than exactly that number
    of digits:
    The pattern "[x,y]" indicates that the value shall fall within the range
    given (both bounds being inclusive), and the maximum number of characters scanned
    shall be the maximum required to represent any value in the range without leading
    so by my reading "1:" is valid for "%H:".
    The glibc strptime implementation actually skips any amount of whitespace
    in all the cases where a number is read, my current patch skips a single
    space at the start of %d/%e but not the others, but doesn't subtract the
    space length from the len characters.
    One option would be to do the leading whitespace skipping in _M_extract_num
    but take it into account how many digits can be read.
    This matters for " 12:" and "%H:", but not for " 12:" and " %H:"
    as in the latter case the space in the format string results in all the
    whitespace at the start to be consumed.
    Note, the allowing of a single digit rather than 2 changes a behavior in
    other ways, e.g. when seeing 40 in a number for range [1, 31] we reject
    it as before, but previously we'd keep *ret == '4' because it was assuming
    it has to be 2 digits and 40 isn't valid, so we know error already on the
    4, but now we accept the 4 as value and fail iff the next format string
    doesn't match the 0.
    Also, previously it wasn't really checking the number was in the right
    range, it would accept 00 for [1, 31] numbers, or would accept 39.
    Another thing is that %I was parsing 12 as tm_hour 12 rather than as tm_hour 0
    like e.g. glibc does.
    Another thing is that %t was matching a single tab and %n a single newline,
    while strptime docs say it skips over whitespace (again, zero or more).
    Another thing is that %p wasn't handled at all, I think this was the main
    cause of
    FAIL: 22_locale/time_get/get_time/char/2.cc execution test
    FAIL: 22_locale/time_get/get_time/char/wrapped_env.cc execution test
    FAIL: 22_locale/time_get/get_time/char/wrapped_locale.cc execution test
    FAIL: 22_locale/time_get/get_time/wchar_t/2.cc execution test
    FAIL: 22_locale/time_get/get_time/wchar_t/wrapped_env.cc execution test
    FAIL: 22_locale/time_get/get_time/wchar_t/wrapped_locale.cc execution test
    before this patch, because en_HK* locales do use %I and %p in it.
    The patch handles %p only if it follows %I (i.e. when the hour is parsed
    first), which is the more usual case (in glibc):
    grep '%I' localedata/locales/* | grep '%I.*%p' | wc -l
    grep '%I' localedata/locales/* | grep -v '%I.*%p' | wc -l
    grep '%I' localedata/locales/* | grep -v '%p' | wc -l
    The last case use %P instead of %p in t_fmt_ampm, not sure if that one
    is never used by strptime because %P isn't handled by strptime.
    Anyway, the right thing to handle even %p%I would be to pass some state
    around through all the _M_extract_via_format calls like glibc passes
      struct __strptime_state
        unsigned int have_I : 1;
        unsigned int have_wday : 1;
        unsigned int have_yday : 1;
        unsigned int have_mon : 1;
        unsigned int have_mday : 1;
        unsigned int have_uweek : 1;
        unsigned int have_wweek : 1;
        unsigned int is_pm : 1;
        unsigned int want_century : 1;
        unsigned int want_era : 1;
        unsigned int want_xday : 1;
        enum ptime_locale_status decided : 2;
        signed char week_no;
        signed char century;
        int era_cnt;
      } s;
    around.  That is for the %p case used like:
      if (s.have_I && s.is_pm)
        tm->tm_hour += 12;
    during finalization, but handles tons of other cases which it is unclear
    if libstdc++ needs or doesn't need to handle, e.g. strptime if one
    specifies year and yday computes wday/mon/day from it, etc. basically for
    the redundant fields computes them from other fields if those have been
    parsed and are sufficient to determine it.
    To do this we'd need to change ABI for the _M_extract_via_format,
    though sure, we could add a wrapper around the new one with the old
    arguments that would just use a dummy state.  And we'd need a new
    _M_whatever finalizer that would do those post parsing tweaks.
    Also, %% wasn't handled.
    For a whitespace in the strings there was inconsistent behavior,
    _M_extract_via_format would require exactly that whitespace char (say
    matching space, or matching tab), while the caller follows what
    https://eel.is/c++draft/locale.time.get#members-8.5 says, that
    when encountering whitespace it skips whitespace in the format and
    then whitespace in the input if any.  I've changed _M_extract_via_format
    to skip whitespace in the input (looping over format isn't IMHO necessary,
    because next iteration of the loop will handle that too).
    Tested on x86_64-linux by make check-target-libstdc++-v3, ok for trunk
    if it passes full bootstrap/regtest?
    For the new 3.cc testcases, I have included hopefully correctly
    corresponding C testcase using strptime in an attachment, and to the
    extent where it can be compared (e.g. strptime on failure just
    returns NULL, doesn't tell where it exactly stopped) I think the
    only difference is that
      str = "Novembur";
      format = "%bembur";
      ret = strptime (str, format, &time);
    case where strptime accepts it but there is no way to do it with input
    I admit I don't have libc++ or other STL libraries around to be able to
    check how much the new 3.cc matches or disagrees with other implementations.
    Now, the things not handled by this patch but which should be fixed (I
    probably need to go back to compiler work) or at least looked at:
    1) seems %j, %r, %U, %w and %W aren't handled (not sure if all of them
       are already in POSIX 2009 or some are later)
    2) I haven't touched the %y/%Y/%C and year handling stuff, that is
       definitely not matching what POSIX 2009 says:
           C       All  but the last two digits of the year {2}; leading zeros shall be permitted but shall not be required. A leading '+' or 'â' character shall be permitted before
                   any leading zeros but shall not be required.
           y       The  last  two  digits of the year. When format contains neither a C conversion specifier nor a Y conversion specifier, values in the range [69,99] shall refer to
                   years 1969 to 1999 inclusive and values in the range [00,68] shall refer to years 2000 to 2068 inclusive; leading zeros shall be permitted but shall  not  be  reâ
                   quired. A leading '+' or 'â' character shall be permitted before any leading zeros but shall not be required.
                   Note:     It is expected that in a future version of this standard the default century inferred from a 2-digit year will change. (This would apply to all commands
                             accepting a 2-digit year as input.)
           Y       The full year {4}; leading zeros shall be permitted but shall not be required. A leading '+' or 'â' character shall be permitted  before  any  leading  zeros  but
                   shall not be required.
       I've tried to avoid making changes to _M_extract_num for these as well
       to keep current status quo (the __len == 4 cases).  One thing is what
       to do for things with %C %y and/or %Y in the formats, another thing
       is what to do in the methods that directly perform _M_extract_num
       for year
    3) the above question what to do for leading whitespace of any numbers
       being parsed
    4) the %p%I issue mentioned above and generally what to do if we
       pass state and have finalizers at the end of parsing
    5) _M_extract_via_format is also inconsistent with its callers on handling
       the non-whitespace characters in between format specifiers, the caller
       follows https://eel.is/c++draft/locale.time.get#members-8.6 and does
       case insensitive comparison:
              // TODO real case-insensitive comparison
              else if (__ctype.tolower(*__s) == __ctype.tolower(*__fmt) ||
                       __ctype.toupper(*__s) == __ctype.toupper(*__fmt))
       while _M_extract_via_format only compares exact characters:
                  // Verify format and input match, extract and discard.
                  if (__format[__i] == *__beg)
       (another question is if there is a better way how to do real
       case-insensitive comparison of 2 characters and whether we e.g. need
       to handle the Turkish i/İ and ı/I which have different number of bytes
       in UTF-8)
    6) _M_extract_name does something weird for case-sensitivity,
          // NB: Some of the locale data is in the form of all lowercase
          // names, and some is in the form of initially-capitalized
          // names. Look for both.
          if (__beg != __end)
                if (__c == __names[__i1][0]
                    || __c == __ctype.toupper(__names[__i1][0]))
       for the first letter while just
            __name[__pos] == *__beg
       on all the following letters.  strptime says:
       In case a text string (such as the name of a day of the week or a month
       name) is to be matched, the comparison is case insensitive.
       so supposedly all the _M_extract_name comparisons should be case
    2021-12-10  Jakub Jelinek  <jakub@redhat.com>
            PR libstdc++/78714
            * include/bits/locale_facets_nonio.tcc (_M_extract_via_format):
            Mention in function comment it interprets strptime format string
            rather than strftime.  Handle %a and %A the same by accepting both
            full and abbreviated names.  Similarly handle %h, %b and %B the same.
            Handle %d and %e the same by accepting possibly optional single space
            and 1 or 2 digits.  For %I store tm_hour 0 instead of tm_hour 12.  For
            %t and %n skip any whitespace.  Handle %p and %%.  For whitespace in
            the string skip any whitespace.
            (_M_extract_num): For __len == 2 accept 1 or 2 digits rather than
            always 2.  Don't punt early if __value * __mult is larget than __max
            or smaller than __min - __mult, instead punt if __value > __max.
            At the end verify __value is in between __min and __max and punt
            (_M_extract_name): Allow non-unique names or names which are prefixes
            of other names.  Don't recompute lengths of names for every character.
            * testsuite/22_locale/time_get/get/char/3.cc: New test.
            * testsuite/22_locale/time_get/get/wchar_t/3.cc: New test.
            * testsuite/22_locale/time_get/get_date/char/12791.cc (test01): Use
            62 instead 60 and expect 6 to be accepted and thus *ret01 == '2'.
            * testsuite/22_locale/time_get/get_date/wchar_t/12791.cc (test01):
            * testsuite/22_locale/time_get/get_time/char/2.cc (test02): Add " PM"
            to the string.
            * testsuite/22_locale/time_get/get_time/char/5.cc (test01): Expect
            tm_hour 1 rather than 0.
            * testsuite/22_locale/time_get/get_time/wchar_t/2.cc (test02): Add
            " PM" to the string.
            * testsuite/22_locale/time_get/get_time/wchar_t/5.cc (test01): Expect
            tm_hour 1 rather than 0.
Comment 3 Jakub Jelinek 2021-12-10 16:11:47 UTC
Should be fixed now.