Reading code from stdin doesn't work. strace shows that it attempts to read from a file "<stdin>" repeatedly. ~> echo "pragma(msg, int(__VERSION__));" | /usr/bin/gdc -x d - <stdin>:2:1: error: Outside Unicode code space <stdin>:2:1: error: Outside Unicode code space <stdin>:2:1: error: Outside Unicode code space <stdin>:2:1: error: Outside Unicode code space ... <stdin>:2: error: no identifier for declarator �����������... Expected behaviour: ~> echo "pragma(msg, int(__VERSION__));" > ver.d; /usr/bin/gdc-11 -x d /tmp/ver.d 2076 /usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/lib64/gcc/x86_64-suse-linux/11/../../../../lib64/Scrt1.o: in function `_start': /home/abuild/rpmbuild/BUILD/glibc-2.35/csu/../sysdeps/x86_64/start.S:103: undefined reference to `main' collect2: error: ld returned 1 exit status
Forgot to mention: Happens with gdc 10, 11 and 12 the same way.
What version of glibc are you using? Not encountered this myself from debian's gcc packages. $ echo "pragma(msg, int(__VERSION__));" | /usr/bin/gdc-9 -c -x d - 2076 $ echo "pragma(msg, int(__VERSION__));" | /usr/bin/gdc-10 -c -x d - 2076 $ echo "pragma(msg, int(__VERSION__));" | /usr/bin/gdc-11 -c -x d - 2076 $ echo "pragma(msg, int(__VERSION__));" | /usr/bin/gdc-12 -c -x d - 2099
Information for package glibc: ------------------------------ Repository : Main Repository (OSS) Name : glibc Version : 2.35-2.1 Arch : x86_64 Vendor : openSUSE Installed Size : 7.0 MiB Installed : Yes Status : up-to-date Source package : glibc-2.35-2.1.src Upstream URL : http://www.gnu.org/software/libc/libc.html Summary : Standard Shared Libraries (from the GNU C Library)
Ah, it's likely something even more fiendish than that. What encodings are you using? (i.e: locale -a)
locale LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=
I had a quick debugging session: The DMD lexer code doesn't really care about the size of the buffer and instead runs until it encounters either a 0 or 0x1A byte. The stdin read loop in d_parse_file doesn't explicitly 0-terminate the buffer, which means that it works randomly... With explicit termination it works: ~> echo -e "pragma(msg, int(__VERSION__));\0" | /usr/bin/gdc -x d - 2076 /usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/lib64/gcc/x86_64-suse-linux/11/../../../../lib64/Scrt1.o: in function `_start': /home/abuild/rpmbuild/BUILD/glibc-2.35/csu/../sysdeps/x86_64/start.S:103: undefined reference to `main' collect2: error: ld returned 1 exit status This also explains the huge amount of diagnostic messages, which are random based on the content past the module source buffer. It just runs until it encounters a terminator by accident. strace -k shows that the attempts to open "<stdin>" are actually caused by the printing of diagnostic messages itself, arguably a separate bug.
Seen by valgrind: ==23477== Conditional jump or move depends on uninitialised value(s) ==23477== at 0x159B539: Lexer::scan(Token*) (lexer.c:267) ==23477== by 0x15CE1F3: UnknownInlinedFun (lexer.c:161) ==23477== by 0x15CE1F3: Parser::parseDeclDefs(int, Dsymbol**, PrefixAttributes*) (parse.c:858) ==23477== by 0x15CE44C: Parser::parseModule() (parse.c:204) ==23477== by 0x14EA392: Module::parse() (dmodule.c:623) ==23477== by 0x166B7F8: d_parse_file() [clone .lto_priv.0] (d-lang.cc:983) ==23477== by 0x1385B54: compile_file() [clone .lto_priv.0] (toplev.c:457) ==23477== by 0x1370773: UnknownInlinedFun (toplev.c:2201) ==23477== by 0x1370773: toplev::main(int, char**) (toplev.c:2340) ==23477== by 0x136F9B3: main (main.c:39) ==23477== ==23477== Use of uninitialised value of size 8 ==23477== at 0x159B543: Lexer::scan(Token*) (lexer.c:267) ==23477== by 0x15CE1F3: UnknownInlinedFun (lexer.c:161) ==23477== by 0x15CE1F3: Parser::parseDeclDefs(int, Dsymbol**, PrefixAttributes*) (parse.c:858) ==23477== by 0x15CE44C: Parser::parseModule() (parse.c:204) ==23477== by 0x14EA392: Module::parse() (dmodule.c:623) ==23477== by 0x166B7F8: d_parse_file() [clone .lto_priv.0] (d-lang.cc:983) ==23477== by 0x1385B54: compile_file() [clone .lto_priv.0] (toplev.c:457) ==23477== by 0x1370773: UnknownInlinedFun (toplev.c:2201) ==23477== by 0x1370773: toplev::main(int, char**) (toplev.c:2340) ==23477== by 0x136F9B3: main (main.c:39) ==23477== 2076
(In reply to Fabian Vogt from comment #6) > I had a quick debugging session: The DMD lexer code doesn't really care > about the size of the buffer and instead runs until it encounters either a 0 > or 0x1A byte. The stdin read loop in d_parse_file doesn't explicitly > 0-terminate the buffer, which means that it works randomly... > OK, so the suggestion would be to zero the padding at the end of the input buffer then. --- a/gcc/d/d-lang.cc +++ b/gcc/d/d-lang.cc @@ -1072,6 +1072,10 @@ d_parse_file (void) global.params.doHdrGeneration); modules.push (m); + /* Zero the padding past the end of the buffer so the D lexer has a + sentinel. The lexer only reads up to 4 bytes at a time. */ + memset (buffer + len, '\0', 16); + /* Overwrite the source file for the module, the one created by Module::create would have a forced a `.d' suffix. */ m->src.length = len;
(In reply to ibuclaw from comment #8) > (In reply to Fabian Vogt from comment #6) > > I had a quick debugging session: The DMD lexer code doesn't really care > > about the size of the buffer and instead runs until it encounters either a 0 > > or 0x1A byte. The stdin read loop in d_parse_file doesn't explicitly > > 0-terminate the buffer, which means that it works randomly... > > > > OK, so the suggestion would be to zero the padding at the end of the input > buffer then. > > --- a/gcc/d/d-lang.cc > +++ b/gcc/d/d-lang.cc > @@ -1072,6 +1072,10 @@ d_parse_file (void) > global.params.doHdrGeneration); > modules.push (m); > > + /* Zero the padding past the end of the buffer so the D lexer has a > + sentinel. The lexer only reads up to 4 bytes at a time. */ > + memset (buffer + len, '\0', 16); > + > /* Overwrite the source file for the module, the one created by > Module::create would have a forced a `.d' suffix. */ > m->src.length = len; Yep, that should work. Though I wonder why 16B of padding and not just a single byte for the 0. FWICT the lexer reads a single byte at a time only (utf8_t is an unsigned char), so it should stop at the first 0. The comment above explaining the padding mentions a "final '\n'" which should probably be adjusted with the change to \0.
(In reply to Fabian Vogt from comment #9) > (In reply to ibuclaw from comment #8) > > (In reply to Fabian Vogt from comment #6) > > > I had a quick debugging session: The DMD lexer code doesn't really care > > > about the size of the buffer and instead runs until it encounters either a 0 > > > or 0x1A byte. The stdin read loop in d_parse_file doesn't explicitly > > > 0-terminate the buffer, which means that it works randomly... > > > > > > > OK, so the suggestion would be to zero the padding at the end of the input > > buffer then. > > > > --- a/gcc/d/d-lang.cc > > +++ b/gcc/d/d-lang.cc > > @@ -1072,6 +1072,10 @@ d_parse_file (void) > > global.params.doHdrGeneration); > > modules.push (m); > > > > + /* Zero the padding past the end of the buffer so the D lexer has a > > + sentinel. The lexer only reads up to 4 bytes at a time. */ > > + memset (buffer + len, '\0', 16); > > + > > /* Overwrite the source file for the module, the one created by > > Module::create would have a forced a `.d' suffix. */ > > m->src.length = len; > > Yep, that should work. Though I wonder why 16B of padding and not just a > single byte for the 0. FWICT the lexer reads a single byte at a time only > (utf8_t is an unsigned char), so it should stop at the first 0. > > The comment above explaining the padding mentions a "final '\n'" which > should probably be adjusted with the change to \0. The lexer scans spaces 4 bytes at a time (*cast(uint*)p == 0x20202020). So should zero at least 4 bytes to avoid asan complaining about reading uninitialized memory.
(In reply to ibuclaw from comment #10) > (In reply to Fabian Vogt from comment #9) > > (In reply to ibuclaw from comment #8) > > > (In reply to Fabian Vogt from comment #6) > > > > I had a quick debugging session: The DMD lexer code doesn't really care > > > > about the size of the buffer and instead runs until it encounters either a 0 > > > > or 0x1A byte. The stdin read loop in d_parse_file doesn't explicitly > > > > 0-terminate the buffer, which means that it works randomly... > > > > > > > > > > OK, so the suggestion would be to zero the padding at the end of the input > > > buffer then. > > > > > > --- a/gcc/d/d-lang.cc > > > +++ b/gcc/d/d-lang.cc > > > @@ -1072,6 +1072,10 @@ d_parse_file (void) > > > global.params.doHdrGeneration); > > > modules.push (m); > > > > > > + /* Zero the padding past the end of the buffer so the D lexer has a > > > + sentinel. The lexer only reads up to 4 bytes at a time. */ > > > + memset (buffer + len, '\0', 16); > > > + > > > /* Overwrite the source file for the module, the one created by > > > Module::create would have a forced a `.d' suffix. */ > > > m->src.length = len; > > > > Yep, that should work. Though I wonder why 16B of padding and not just a > > single byte for the 0. FWICT the lexer reads a single byte at a time only > > (utf8_t is an unsigned char), so it should stop at the first 0. > > > > The comment above explaining the padding mentions a "final '\n'" which > > should probably be adjusted with the change to \0. > > The lexer scans spaces 4 bytes at a time (*cast(uint*)p == 0x20202020). So > should zero at least 4 bytes to avoid asan complaining about reading > uninitialized memory. Indeed, that's the case with GCC 12. I've been looking at the code from GCC 11, where it doesn't do that yet (and the frontend is still in C).
The master branch has been updated by Iain Buclaw <ibuclaw@gcc.gnu.org>: https://gcc.gnu.org/g:a0bc7fd42136f124726985b1ab4dcde739cd260e commit r13-867-ga0bc7fd42136f124726985b1ab4dcde739cd260e Author: Iain Buclaw <ibuclaw@gdcproject.org> Date: Tue May 31 14:45:02 2022 +0200 d: Fix D lexer sometimes fails to compile code read from stdin As of gdc-12, the lexer expects there 4 bytes of zero padding at the end of the source buffer to mark the end of input. Sometimes when reading from stdin, the data at the end of input is garbage rather than zeroes. Fix that by explicitly calling memset past the end of the buffer. PR d/105544 gcc/d/ChangeLog: * d-lang.cc (d_parse_file): Zero padding past the end of the stdin buffer so the D lexer has a sentinel to stop parsing at.
The releases/gcc-12 branch has been updated by Iain Buclaw <ibuclaw@gcc.gnu.org>: https://gcc.gnu.org/g:f106ef53024cc464ae446189fbad373caaff058e commit r12-8439-gf106ef53024cc464ae446189fbad373caaff058e Author: Iain Buclaw <ibuclaw@gdcproject.org> Date: Tue May 31 14:45:02 2022 +0200 d: Fix D lexer sometimes fails to compile code read from stdin As of gdc-12, the lexer expects there 4 bytes of zero padding at the end of the source buffer to mark the end of input. Sometimes when reading from stdin, the data at the end of input is garbage rather than zeroes. Fix that by explicitly calling memset past the end of the buffer. PR d/105544 gcc/d/ChangeLog: * d-lang.cc (d_parse_file): Zero padding past the end of the stdin buffer so the D lexer has a sentinel to stop parsing at. (cherry picked from commit a0bc7fd42136f124726985b1ab4dcde739cd260e)
Committed patch posted here to mainline and backported to gcc-12.