This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Universal Character Names, v2

From: Neil Booth <neil at daikokuya dot co dot uk>
To: "Martin v. L?wis" <martin at v dot loewis dot de>
Cc: Zack Weinberg <zack at codesourcery dot com>, gcc-patches at gcc dot gnu dot org
Date: Mon, 2 Dec 2002 23:32:35 +0000
Subject: Re: Universal Character Names, v2
References: <200211282334.gASNYdTA004058@mira.informatik.hu-berlin.de> <87r8d5rq2b.fsf@egil.codesourcery.com> <20021129071218.GB8045@daikokuya.co.uk> <87u1hxbe0z.fsf@egil.codesourcery.com> <20021202002441.GA3539@daikokuya.co.uk> <m3znrpqmv4.fsf@mira.informatik.hu-berlin.de> <20021202203112.GA13039@daikokuya.co.uk> <m34r9wyqu2.fsf@mira.informatik.hu-berlin.de>

Martin v. L?wis wrote:-

> The actual processing would, of course, take one character at a
> time. It would count characters to determine columns.

Yes, that's straightforward.  But it doesn't help to simplify the
lexer.

> Reading the input file would *only* conversion from the input charset,
> to, say, UTF-8. This normally won't produce any diagnostics, unless
> there is an actual encoding error. In that case, further compilation
> needs to terminate, so that will be the last error you see.

As I said in a mail upthread, since we're scanning the whole line for
charset conversion before tokenization, perhaps we should take the
opportunity to convert trigraphs and splice lines at the same time, for
efficiency reasons.

This is the fundamental trade-off: do we keep the current unavoidable
awkwardness of trigraph / escaped newline handling when tokens are not
guaranteed contiguous (you've seen the pain it causes with UCNs), or do
we clean that up in the same pre-scan pass that we're doing anyway to
convert to UTF-8?  Getting rid of all the nastiness of phases 1 and 2 is
very appealing.  The (only?) downside is it makes line / col tracking
a little tricky; this would probably take the form of some kind of
enhancement to line-map.[ch].  But there's more...

A long-standing project I have in mind when thinking about these things
is doing caret diagnostics, where a section of the offending code line
is output and a caret in the following line points to the exact location
of the complaint (which might even be mid-token, e.g. complaining about
an invalid character in a UCN escape in the middle of a string literal;
it looks good to point to the exact character.  See e.g. any EDG
compiler.)

To do this, we need to have the codec involved in printing the offending
line, and translating line + logical columns to character locations.
If we have a diagnostic for line 7000 of foo.c, and if we've converted
the whole file like you suggest, then we either have to start from the
beginning and convert all the way to line 7000 again to find where it
is, which sucks, or we have to keep some kind of table of line number
to file offset every N lines or M bytes, say.  Things are not so hard if
we keep things more localized by converting a line-at-a-time instead.

So I'm thinking that perhaps just a bit more complexity here to track
the location of escaped newlines might be worth paying for the benefit of
having stages 1 and 2 completely done by the time we come to tokenization,
and a clear path to caret diagnostics.  We can get rid of the existing
get_effective_char(), tab tracking, skip_escaped_newlines() and your UCN
problems at the very least.  Does this make sense?

I'm not sure which of the two approaches is better; but the simplification
of the current lexer that would arise from being guaranteed that tokens
are contiguous is attractive.  It would have further minor benefits in
-C handling, and in making it possible for CPP output to easily preserve
the form of whitespace.

> I doubt that. If stdio is used, many files will completely live in the
> stdio buffer, anyway.

stdio is a redundant layer of buffering which we don't control, and that
Zack and I are, I think, agreed that we have no intention of using.
We'll either stick with the current mmap() / full-file read(), or
possibly read() to a buffer in chunks of, say, 16K.

Zack, I'd be interested in your current thoughts on any of the above.

Neil.

References:
- Re: Universal Character Names, v2
  - From: Zack Weinberg
- Re: Universal Character Names, v2
  - From: Neil Booth
- Re: Universal Character Names, v2
  - From: Martin v. Löwis
- Re: Universal Character Names, v2
  - From: Neil Booth
- Re: Universal Character Names, v2
  - From: Martin v. Löwis

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]