Bug 105544 - gdc fails to compile d source from stdin
Summary: gdc fails to compile d source from stdin
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: d (show other bugs)
Version: 12.1.0
: P3 normal
Target Milestone: ---
Assignee: Iain Buclaw
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-05-10 09:33 UTC by Fabian Vogt
Modified: 2022-05-31 16:37 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work: 11.1.0
Known to fail: 12.1.0
Last reconfirmed: 2022-05-10 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Fabian Vogt 2022-05-10 09:33:43 UTC
Reading code from stdin doesn't work. strace shows that it attempts to read from a file "<stdin>" repeatedly.

~> echo "pragma(msg, int(__VERSION__));" | /usr/bin/gdc -x d -
<stdin>:2:1: error: Outside Unicode code space
<stdin>:2:1: error: Outside Unicode code space
<stdin>:2:1: error: Outside Unicode code space
<stdin>:2:1: error: Outside Unicode code space
...
<stdin>:2: error: no identifier for declarator �����������...

Expected behaviour:

~> echo "pragma(msg, int(__VERSION__));" > ver.d; /usr/bin/gdc-11 -x d /tmp/ver.d
2076
/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/lib64/gcc/x86_64-suse-linux/11/../../../../lib64/Scrt1.o: in function `_start':
/home/abuild/rpmbuild/BUILD/glibc-2.35/csu/../sysdeps/x86_64/start.S:103: undefined reference to `main'
collect2: error: ld returned 1 exit status
Comment 1 Fabian Vogt 2022-05-10 09:34:44 UTC
Forgot to mention: Happens with gdc 10, 11 and 12 the same way.
Comment 2 ibuclaw 2022-05-10 12:11:47 UTC
What version of glibc are you using?

Not encountered this myself from debian's gcc packages.

$ echo "pragma(msg, int(__VERSION__));" | /usr/bin/gdc-9 -c -x d -
2076
$ echo "pragma(msg, int(__VERSION__));" | /usr/bin/gdc-10 -c -x d -
2076
$ echo "pragma(msg, int(__VERSION__));" | /usr/bin/gdc-11 -c -x d -
2076
$ echo "pragma(msg, int(__VERSION__));" | /usr/bin/gdc-12 -c -x d -
2099
Comment 3 Martin Liška 2022-05-10 12:14:29 UTC
Information for package glibc:
------------------------------
Repository     : Main Repository (OSS)
Name           : glibc
Version        : 2.35-2.1
Arch           : x86_64
Vendor         : openSUSE
Installed Size : 7.0 MiB
Installed      : Yes
Status         : up-to-date
Source package : glibc-2.35-2.1.src
Upstream URL   : http://www.gnu.org/software/libc/libc.html
Summary        : Standard Shared Libraries (from the GNU C Library)
Comment 4 ibuclaw 2022-05-11 22:06:41 UTC
Ah, it's likely something even more fiendish than that.

What encodings are you using?  (i.e: locale -a)
Comment 5 Martin Liška 2022-05-12 06:58:52 UTC
locale
LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
Comment 6 Fabian Vogt 2022-05-12 07:39:07 UTC
I had a quick debugging session: The DMD lexer code doesn't really care about the size of the buffer and instead runs until it encounters either a 0 or 0x1A byte. The stdin read loop in d_parse_file doesn't explicitly 0-terminate the buffer, which means that it works randomly...

With explicit termination it works:

~> echo -e "pragma(msg, int(__VERSION__));\0" | /usr/bin/gdc -x d -
2076
/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/lib64/gcc/x86_64-suse-linux/11/../../../../lib64/Scrt1.o: in function `_start':
/home/abuild/rpmbuild/BUILD/glibc-2.35/csu/../sysdeps/x86_64/start.S:103: undefined reference to `main'
collect2: error: ld returned 1 exit status

This also explains the huge amount of diagnostic messages, which are random based on the content past the module source buffer. It just runs until it encounters a terminator by accident.

strace -k shows that the attempts to open "<stdin>" are actually caused by the printing of diagnostic messages itself, arguably a separate bug.
Comment 7 Martin Liška 2022-05-12 07:47:32 UTC
Seen by valgrind:

==23477== Conditional jump or move depends on uninitialised value(s)
==23477==    at 0x159B539: Lexer::scan(Token*) (lexer.c:267)
==23477==    by 0x15CE1F3: UnknownInlinedFun (lexer.c:161)
==23477==    by 0x15CE1F3: Parser::parseDeclDefs(int, Dsymbol**, PrefixAttributes*) (parse.c:858)
==23477==    by 0x15CE44C: Parser::parseModule() (parse.c:204)
==23477==    by 0x14EA392: Module::parse() (dmodule.c:623)
==23477==    by 0x166B7F8: d_parse_file() [clone .lto_priv.0] (d-lang.cc:983)
==23477==    by 0x1385B54: compile_file() [clone .lto_priv.0] (toplev.c:457)
==23477==    by 0x1370773: UnknownInlinedFun (toplev.c:2201)
==23477==    by 0x1370773: toplev::main(int, char**) (toplev.c:2340)
==23477==    by 0x136F9B3: main (main.c:39)
==23477== 
==23477== Use of uninitialised value of size 8
==23477==    at 0x159B543: Lexer::scan(Token*) (lexer.c:267)
==23477==    by 0x15CE1F3: UnknownInlinedFun (lexer.c:161)
==23477==    by 0x15CE1F3: Parser::parseDeclDefs(int, Dsymbol**, PrefixAttributes*) (parse.c:858)
==23477==    by 0x15CE44C: Parser::parseModule() (parse.c:204)
==23477==    by 0x14EA392: Module::parse() (dmodule.c:623)
==23477==    by 0x166B7F8: d_parse_file() [clone .lto_priv.0] (d-lang.cc:983)
==23477==    by 0x1385B54: compile_file() [clone .lto_priv.0] (toplev.c:457)
==23477==    by 0x1370773: UnknownInlinedFun (toplev.c:2201)
==23477==    by 0x1370773: toplev::main(int, char**) (toplev.c:2340)
==23477==    by 0x136F9B3: main (main.c:39)
==23477== 
2076
Comment 8 ibuclaw 2022-05-12 09:32:06 UTC
(In reply to Fabian Vogt from comment #6)
> I had a quick debugging session: The DMD lexer code doesn't really care
> about the size of the buffer and instead runs until it encounters either a 0
> or 0x1A byte. The stdin read loop in d_parse_file doesn't explicitly
> 0-terminate the buffer, which means that it works randomly...
> 

OK, so the suggestion would be to zero the padding at the end of the input buffer then.

--- a/gcc/d/d-lang.cc
+++ b/gcc/d/d-lang.cc
@@ -1072,6 +1072,10 @@ d_parse_file (void)
 				      global.params.doHdrGeneration);
 	  modules.push (m);
 
+	  /* Zero the padding past the end of the buffer so the D lexer has a
+	     sentinel.  The lexer only reads up to 4 bytes at a time.  */
+	  memset (buffer + len, '\0', 16);
+
 	  /* Overwrite the source file for the module, the one created by
 	     Module::create would have a forced a `.d' suffix.  */
 	  m->src.length = len;
Comment 9 Fabian Vogt 2022-05-12 09:44:56 UTC
(In reply to ibuclaw from comment #8)
> (In reply to Fabian Vogt from comment #6)
> > I had a quick debugging session: The DMD lexer code doesn't really care
> > about the size of the buffer and instead runs until it encounters either a 0
> > or 0x1A byte. The stdin read loop in d_parse_file doesn't explicitly
> > 0-terminate the buffer, which means that it works randomly...
> > 
> 
> OK, so the suggestion would be to zero the padding at the end of the input
> buffer then.
>
> --- a/gcc/d/d-lang.cc
> +++ b/gcc/d/d-lang.cc
> @@ -1072,6 +1072,10 @@ d_parse_file (void)
>  				      global.params.doHdrGeneration);
>  	  modules.push (m);
>  
> +	  /* Zero the padding past the end of the buffer so the D lexer has a
> +	     sentinel.  The lexer only reads up to 4 bytes at a time.  */
> +	  memset (buffer + len, '\0', 16);
> +
>  	  /* Overwrite the source file for the module, the one created by
>  	     Module::create would have a forced a `.d' suffix.  */
>  	  m->src.length = len;

Yep, that should work. Though I wonder why 16B of padding and not just a single byte for the 0. FWICT the lexer reads a single byte at a time only (utf8_t is an unsigned char), so it should stop at the first 0.

The comment above explaining the padding mentions a "final '\n'" which should probably be adjusted with the change to \0.
Comment 10 ibuclaw 2022-05-12 12:58:47 UTC
(In reply to Fabian Vogt from comment #9)
> (In reply to ibuclaw from comment #8)
> > (In reply to Fabian Vogt from comment #6)
> > > I had a quick debugging session: The DMD lexer code doesn't really care
> > > about the size of the buffer and instead runs until it encounters either a 0
> > > or 0x1A byte. The stdin read loop in d_parse_file doesn't explicitly
> > > 0-terminate the buffer, which means that it works randomly...
> > > 
> > 
> > OK, so the suggestion would be to zero the padding at the end of the input
> > buffer then.
> >
> > --- a/gcc/d/d-lang.cc
> > +++ b/gcc/d/d-lang.cc
> > @@ -1072,6 +1072,10 @@ d_parse_file (void)
> >  				      global.params.doHdrGeneration);
> >  	  modules.push (m);
> >  
> > +	  /* Zero the padding past the end of the buffer so the D lexer has a
> > +	     sentinel.  The lexer only reads up to 4 bytes at a time.  */
> > +	  memset (buffer + len, '\0', 16);
> > +
> >  	  /* Overwrite the source file for the module, the one created by
> >  	     Module::create would have a forced a `.d' suffix.  */
> >  	  m->src.length = len;
> 
> Yep, that should work. Though I wonder why 16B of padding and not just a
> single byte for the 0. FWICT the lexer reads a single byte at a time only
> (utf8_t is an unsigned char), so it should stop at the first 0.
> 
> The comment above explaining the padding mentions a "final '\n'" which
> should probably be adjusted with the change to \0.

The lexer scans spaces 4 bytes at a time (*cast(uint*)p == 0x20202020). So should zero at least 4 bytes to avoid asan complaining about reading uninitialized memory.
Comment 11 Fabian Vogt 2022-05-12 13:18:22 UTC
(In reply to ibuclaw from comment #10)
> (In reply to Fabian Vogt from comment #9)
> > (In reply to ibuclaw from comment #8)
> > > (In reply to Fabian Vogt from comment #6)
> > > > I had a quick debugging session: The DMD lexer code doesn't really care
> > > > about the size of the buffer and instead runs until it encounters either a 0
> > > > or 0x1A byte. The stdin read loop in d_parse_file doesn't explicitly
> > > > 0-terminate the buffer, which means that it works randomly...
> > > > 
> > > 
> > > OK, so the suggestion would be to zero the padding at the end of the input
> > > buffer then.
> > >
> > > --- a/gcc/d/d-lang.cc
> > > +++ b/gcc/d/d-lang.cc
> > > @@ -1072,6 +1072,10 @@ d_parse_file (void)
> > >  				      global.params.doHdrGeneration);
> > >  	  modules.push (m);
> > >  
> > > +	  /* Zero the padding past the end of the buffer so the D lexer has a
> > > +	     sentinel.  The lexer only reads up to 4 bytes at a time.  */
> > > +	  memset (buffer + len, '\0', 16);
> > > +
> > >  	  /* Overwrite the source file for the module, the one created by
> > >  	     Module::create would have a forced a `.d' suffix.  */
> > >  	  m->src.length = len;
> > 
> > Yep, that should work. Though I wonder why 16B of padding and not just a
> > single byte for the 0. FWICT the lexer reads a single byte at a time only
> > (utf8_t is an unsigned char), so it should stop at the first 0.
> > 
> > The comment above explaining the padding mentions a "final '\n'" which
> > should probably be adjusted with the change to \0.
> 
> The lexer scans spaces 4 bytes at a time (*cast(uint*)p == 0x20202020). So
> should zero at least 4 bytes to avoid asan complaining about reading
> uninitialized memory.

Indeed, that's the case with GCC 12. I've been looking at the code from GCC 11, where it doesn't do that yet (and the frontend is still in C).
Comment 12 GCC Commits 2022-05-31 16:32:59 UTC
The master branch has been updated by Iain Buclaw <ibuclaw@gcc.gnu.org>:

https://gcc.gnu.org/g:a0bc7fd42136f124726985b1ab4dcde739cd260e

commit r13-867-ga0bc7fd42136f124726985b1ab4dcde739cd260e
Author: Iain Buclaw <ibuclaw@gdcproject.org>
Date:   Tue May 31 14:45:02 2022 +0200

    d: Fix D lexer sometimes fails to compile code read from stdin
    
    As of gdc-12, the lexer expects there 4 bytes of zero padding at the end
    of the source buffer to mark the end of input.  Sometimes when reading
    from stdin, the data at the end of input is garbage rather than zeroes.
    Fix that by explicitly calling memset past the end of the buffer.
    
            PR d/105544
    
    gcc/d/ChangeLog:
    
            * d-lang.cc (d_parse_file): Zero padding past the end of the stdin
            buffer so the D lexer has a sentinel to stop parsing at.
Comment 13 GCC Commits 2022-05-31 16:33:49 UTC
The releases/gcc-12 branch has been updated by Iain Buclaw <ibuclaw@gcc.gnu.org>:

https://gcc.gnu.org/g:f106ef53024cc464ae446189fbad373caaff058e

commit r12-8439-gf106ef53024cc464ae446189fbad373caaff058e
Author: Iain Buclaw <ibuclaw@gdcproject.org>
Date:   Tue May 31 14:45:02 2022 +0200

    d: Fix D lexer sometimes fails to compile code read from stdin
    
    As of gdc-12, the lexer expects there 4 bytes of zero padding at the end
    of the source buffer to mark the end of input.  Sometimes when reading
    from stdin, the data at the end of input is garbage rather than zeroes.
    Fix that by explicitly calling memset past the end of the buffer.
    
            PR d/105544
    
    gcc/d/ChangeLog:
    
            * d-lang.cc (d_parse_file): Zero padding past the end of the stdin
            buffer so the D lexer has a sentinel to stop parsing at.
    
    (cherry picked from commit a0bc7fd42136f124726985b1ab4dcde739cd260e)
Comment 14 Iain Buclaw 2022-05-31 16:37:40 UTC
Committed patch posted here to mainline and backported to gcc-12.