Bug 83601 - std::regex_replace C++14 conformance issue: escaping in SED mode
Summary: std::regex_replace C++14 conformance issue: escaping in SED mode
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: 8.0
: P3 normal
Target Milestone: 8.0
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: std::regex
  Show dependency treegraph
 
Reported: 2017-12-27 12:45 UTC by Andrey Guskov
Modified: 2023-07-20 11:52 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-12-27 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andrey Guskov 2017-12-27 12:45:52 UTC
C++14 standard (page 1107, see here: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf#1121), 28.5.2 [Bitmask type regex_constants::match_flag_type]:

...
format_sed
When a regular expression match is to be replaced by a new string, the new string shall be constructed using the rules used by the sed utility in POSIX.
...


The rules which SED uses are documented in IEEE 1003.1 (p. 3221):

An <ampersand>
('&') appearing in the replacement shall be replaced by the string matching the
BRE. The special meaning of '&' in this context can be suppressed by preceding it
by a <backslash>. The characters "\n", where n is a digit, shall be replaced by the
text matched by the corresponding back-reference expression. 
...
The special meaning of "\n" where n is a digit in
this context, can be suppressed by preceding it by a <backslash>.


The current implementation of std::regex_replace does not comply to the standard: special meanings of &, \0, \2 cannot be suppressed by escaping them with backslashes.


Reproducer:

#include <regex>
int frep(const wchar_t *istr, const wchar_t *rstr, const wchar_t *ostr) {
    std::basic_regex<wchar_t> wrgx(L"(a*)(b+)");
    std::basic_string<wchar_t> wstr = istr, wret = ostr, test;
    std::regex_replace(std::back_inserter(test), wstr.begin(), wstr.end(),
                       wrgx, std::basic_string<wchar_t>(rstr),
                       std::regex_constants::format_sed);
    return !printf("'%ls' %c= '%ls'\n",
                   test.c_str(), (test == wret)? '=' : '!', wret.c_str());
}
int main() {
    frep(L"xbbyabz", L"!\\\\2!", L"x!\\2!y!\\2!z");
    frep(L"xbbyabz", L"!\\\\0!", L"x!\\0!y!\\0!z");
    return frep(L"xbbyabz", L"!\\&!", L"x!&!y!&!z");
}
Comment 1 Jonathan Wakely 2017-12-27 22:50:49 UTC
Reduced:

#include <regex>
int main() {
  auto format = std::regex_constants::format_sed;
  auto out = regex_replace("ab", std::regex("(a)(b)"), R"(\\1\&\\2)", format);
  if (out != R"(\1&\2)")
    throw 1;
}


Tim, is there an easy fix for this that I can try, or should I leave it to you?
Comment 2 Tim Shen 2018-01-14 00:49:02 UTC
Author: timshen
Date: Sun Jan 14 00:48:30 2018
New Revision: 256654

URL: https://gcc.gnu.org/viewcvs?rev=256654&root=gcc&view=rev
Log:
	PR libstdc++/83601
	* include/bits/regex.tcc (regex_replace): Fix escaping in sed.
	* testsuite/28_regex/algorithms/regex_replace/char/pr83601.cc: Tests.
	* testsuite/28_regex/algorithms/regex_replace/wchar_t/pr83601.cc: Tests.

Added:
    trunk/libstdc++-v3/testsuite/28_regex/algorithms/regex_replace/char/pr83601.cc
    trunk/libstdc++-v3/testsuite/28_regex/algorithms/regex_replace/wchar_t/pr83601.cc
Modified:
    trunk/libstdc++-v3/ChangeLog
    trunk/libstdc++-v3/include/bits/regex.tcc
Comment 3 Tim Shen 2018-01-14 01:01:21 UTC
Mark as fixed.
Comment 4 Dominik Haumann 2018-09-17 12:53:21 UTC
If there is interest, another (smaller) test case would be:

    const std::string input    = R"((.))";
    const std::string expected = R"(\(\.\))";
    const std::string obtained_std = std::regex_replace(input, std::regex(R"([.^$|()\[\]{}*+?\\])"), R"(\\&)",
			           std::regex_constants::match_default | std::regex_constants::format_sed);
    const std::string obtained_boost = boost::regex_replace(input, boost::regex(R"([.^$|()\[\]{}*+?\\])"), R"(\\&)",
			               boost::regex_constants::match_default | boost::regex_constants::format_sed);
    
    std::cout << "expected.......='" << expected       << "'" << std::endl;
    std::cout << "obtained(std)..='" << obtained_std   << "'" << std::endl;
    std::cout << "obtained(boost)='" << obtained_boost << "'" << std::endl;

Output with GCC < 8:

    expected.......='\(\.\)'
    obtained(std)..='\\(\\(\\.\\)\\)'
    obtained(boost)='\(\(\.\)\)'

With GCC >= 8, it works and it's the same as with boost.