Bug 91412 - Unexpectedly correct raw string literal
Summary: Unexpectedly correct raw string literal
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: preprocessor (show other bugs)
Version: 9.1.1
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-08-09 18:49 UTC by Alisdair Meredith
Modified: 2020-09-24 16:34 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2020-09-24 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alisdair Meredith 2019-08-09 18:49:55 UTC
Per several existing bug reports (e.g., https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433) in phase one of translation, when mapping source character set to basic character set, a '\' character followed by trailing whitespace until the newline is mapped to a single '\' character and the newline.  Therefore, comments with what the author believes to be significant trailing whitespace (e.g., to preserve ASCII art in documentation) is mapped into a line-continuation in a comment, potentially swallowing code in the following line.

So far, so good.

However, the following program does not follow that same mapping:

#include <iostream>

int main() {
   std::cout << R"(Hello\   
World!)";
}

(note that there are 3 space characters after the '\' that may get swallowed by HTML/bugzilla)

In this case, the line-splice for '\' occurs in phase 2 of translation, and then gets undone in phase 7.  However, this does not undo the source-to-basic character mapping in phase 1, only the splicing of a '\' immediately followed by a newline, so there should be no whitespace following 'Hello' in the emitted output.  Yet when the program is compiled and run, three space characters are indeed present.


Either the source-to-basic-character set mapping needs updating to further special case trailing whitespace in what will later be determined to be a raw string literal, or the raw literal should not contain the three spaces.
Comment 1 Tom Honermann 2020-09-13 20:03:15 UTC
My understanding is that the usual rationale for removal of trailing whitespace is to consider it part of a newline sequence; similar to considering <cr><lf> as a single newline.  Using that rationale, it seems appropriate that the spaces be retained as part of translation phase 2 reversion; just as it would presumably be desirable to preserve a <cr><lf> sequence through such reversion.
Comment 2 Nathan Sidwell 2020-09-24 16:34:01 UTC
libcpp/lex.c (lex_raw_string)

	  after_backslash:
	    if (note->type == ' ')
	      /* GNU backslash whitespace newline extension.  FIXME
		 could be any sequence of non-vertical space.  When we
		 can properly restore any such sequence, we should
		 mark this note as handled so _cpp_process_line_notes
		 doesn't warn.  */
	      accum.append (pfile, UC" ", 1);