Bug 38990 - preprocessing different in g++ -E and regular compiling.
Summary: preprocessing different in g++ -E and regular compiling.
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: preprocessor (show other bugs)
Version: 4.3.2
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: accepts-invalid, diagnostic
Depends on:
Blocks:
 
Reported: 2009-01-27 19:16 UTC by Kaz Kylheku
Modified: 2013-06-07 09:27 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2009-01-28 11:07:45


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kaz Kylheku 2009-01-27 19:16:45 UTC
(This might affect the C front end also, I have not tried).

Basically, there are situations where g++ -E accepts multi-line string literals and glues them together into a proper string literal token. If the resulting preprocessor output is compiled, there are no diagnostics. But if the compiler is run on the original code, there are diagnostics. Test case:

#include <cstdio>
 
#define pf(FMT, ARGS ...) printf(FMT, ## ARGS)
 
int main()
{
  pf("Hello,
     %s!",
     "world");
}

If this is passed through g++ (4.3.2) there are warnings from the preprocessor about unmatched " characters. But the problem is repaired in the output, where the pf expression turns to:

  printf("Hello, %s", "world");

The preprocessor output, if captured, can be compiled without diagnostics, because it no longer contains a multi-line string literal. If the program is compiled directly, there are syntax errors from the compiler, in addition to the warnings from the preprocessor. The behavior of the preprocessor, when integrated into the compiler, is different; it does not repair the broken string literal.

In g++ 4.1.1, the behavior is similar, except that preprocessing does not emit any diagnostics at all in either situation, so if the compilation is broken into two steps (preprocess with g++ -E and then compile), then it's completely silent.

To answer the question, why would anyone preprocess and then compile: there are tools which do this, like ccache. The ccache program preprocesses a translation unit, and digests it to a hash. Using the hash, it can decide to retrieve a cached object file, or to pass the preprocessed output to the compiler (avoiding double preprocessing). Other tools like distmake and distcc works similarly.

So when ccache is used with g++, in effect the combo turns into a C++ implementation that accepts broken string literals when they are macro arguments. If developer accidentally write such string literals, then the build breaks for anyone who runs it without ccache.

(If there is a way for to run g++ or gcc as a preprocessor in such a way that it does not accept multi-line string literals as macro arguments, please advise; I can hack it into ccache as a workaround).
Comment 1 Richard Biener 2009-01-28 11:07:45 UTC
Confirmed.

void foo(const char *, ...)

#define pf(FMT, ARGS ...) foo(FMT, ## ARGS)

int bar(void)
{
    pf("Hello,
            %s!",
            "world");
}
Comment 2 Kaz Kylheku 2009-01-28 16:30:00 UTC
(In reply to comment #1)
> Confirmed.

Thanks. By the way, I started looking at patching this. My suspicions were confirmed that this is a case of pasting together invalid tokens. The compiler sees the tokens individually, because it's closely integrated with the preprocessor. But when the tokens are converted to text, they resemble a valid string literal. The embedded newlines are gone.

What's happening is that the input:

"Hello,\n      %s!",\n      "world");

is being tokenized like this:

{"Hello,}{%}{s}{",}{"world"}{)}

The "Hello, and ", are assigned the special lexical category CPP_OTHER, because they are improper tokens. Of course % is an operator and s is a CPP_NAME identifier.  Also note how everything becomes one argument to the macro, since the comma is never seen as a independent token.

A possible way to fix this bug would be in the function lex_string to not back up over the \n that is found in the middle of a string literal, so that the newline becomes part of the CPP_OTHER token.  This behavior might have to be language-dependent, though. It looks like assembly language programs may be depending on the current behavior, hence this test in lex_string:

  if (type == CPP_OTHER && CPP_OPTION (pfile, lang) != CLK_ASM)
    cpp_error (pfile, CPP_DL_PEDWARN, "missing terminating %c character",
	       (int) terminator);

Or maybe CPP_OTHER tokens should never be pasted together with anything that follows them because even inserting a space is not good enough; maybe a newline should be emitted between CPP_OTHER and the next token instead of a space, if the language is other than CLK_ASM.

Will experiment.


Comment 3 Jan Smets 2013-06-07 08:43:11 UTC
Confirmed. This issue is easily hit when using distcc.
Comment 4 Jan Smets 2013-06-07 09:20:17 UTC
Known to work: GCC 4.6