Bug 81200 - regex: bogus treatment of collating symbols; grep -E works
Summary: regex: bogus treatment of collating symbols; grep -E works
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: 8.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-06-25 11:58 UTC by Hubert Tong
Modified: 2017-06-26 04:29 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Hubert Tong 2017-06-25 11:58:04 UTC
The source below attempts to compile "_[[.left-square-bracket.]]_" as a egrep regex pattern.
The collating symbol, [.left-square-bracket.], is only valid if "left-square-bracket" is a multi-character collating element in the locale.
"left-square-bracket" is most assuredly not a multi-character collating element in the "POSIX" locale.

Note: WG 21 document N1429 advocates the behaviour exhibited by the implementation; however, it appears from N1623 (relevant portion quoted below) that the committee made corrections:
"that is a bunch of portable names for characters, which are not the same as collating elements within the meaning of POSIX locales"

The <regex> implementation accepts the pattern (not expected), and the string "_[_" matches (not expected).
grep -E seems to work as expected.

Online compiler: https://wandbox.org/permlink/VdEOUrcdcBqnbjLB

### SOURCE (llregex3.cc):
#include <regex>

int main(void) {
  std::regex regex;
  regex.imbue(std::locale("POSIX"));

  try {
    regex.assign("_[[.left-square-bracket.]]_", std::regex_constants::egrep);
    printf("No error.\n");

    bool b;
    b = regex_match("_[_", regex);
    printf("%s _[_.\n", b ? "Matched" : "Did not match");
  }
  catch (const std::regex_error &e) {
    if (e.code() == std::regex_constants::error_collate) {
      printf("Got error_collate.\n");
    }
    else {
      printf("Got other error.\n");
    }
  }
}

### COMPILER INVOCATION:
g++ -std=c++11 llregex3.cc -o llregex3

### PROGRAM INVOCATION AND OUTPUT:
> ./llregex3
No error.
Matched _[_.
Return:  0x00:0

### EXPECTED PROGRAM OUTPUT:
Got error_collate.

### REFERENCE BEHAVIOUR (POSIX locale; grep -E):
> ( export LANG=POSIX; locale && grep -E '_[[.left-square-bracket.]]_' )
LANG=POSIX
LANGUAGE=en_US:en
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
grep: Invalid collation character
Return:  0x02:2

### COMPILER VERSION INFO (g++ -v):
Using built-in specs.
COLLECT_GCC=/opt/wandbox/gcc-head/bin/g++
COLLECT_LTO_WRAPPER=/opt/wandbox/gcc-head/libexec/gcc/x86_64-pc-linux-gnu/8.0.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../source/configure --prefix=/opt/wandbox/gcc-head --enable-languages=c,c++ --disable-multilib --without-ppl --without-cloog-ppl --enable-checking=release --disable-nls --enable-lto LDFLAGS=-Wl,-rpath,/opt/wandbox/gcc-head/lib,-rpath,/opt/wandbox/gcc-head/lib64,-rpath,/opt/wandbox/gcc-head/lib32
Thread model: posix
gcc version 8.0.0 20170623 (experimental) (GCC)

### grep --version:
grep (GNU grep) 2.25
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.