This is the mail archive of the
java@gcc.gnu.org
mailing list for the Java project.
Re: Universal Character Names, v2
- From: Neil Booth <neil at daikokuya dot co dot uk>
- To: Zack Weinberg <zack at codesourcery dot com>
- Cc: "Martin v. L?wis" <martin at v dot loewis dot de>, gcc-patches at gcc dot gnu dot org,java at gcc dot gnu dot org
- Date: Mon, 2 Dec 2002 00:24:41 +0000
- Subject: Re: Universal Character Names, v2
- References: <200211282334.gASNYdTA004058@mira.informatik.hu-berlin.de> <87r8d5rq2b.fsf@egil.codesourcery.com> <20021129071218.GB8045@daikokuya.co.uk> <87u1hxbe0z.fsf@egil.codesourcery.com>
Zack Weinberg wrote:-
> modulo the fact that we may not support binary encodings yet.
I've had more thoughts about arbitrary charsets. Rather than converting
to UTF-8 on a per-character basis, the obvious place is to convert
a line-at-a-time from the new-line handler (plus a call when starting
a buffer to get the process started). This would vastly reduce most of the
overhead issues. We're best using our own converters, and adding them
one-by-one on demand (a la GNAT), rather than relying on host
implementations of mbtowc or iconv, IMO. Since they're scanning the
line, they may as well do trigraph conversion at the same time, and
possibly splice lines.
That would leave the question of whether we have a scan to do this for
the normal case (and thereby stop mmapping and our NUL trick).
Good caret diagnostics in this situation are best handled, I think,
by changing from line/col location via line-map to a single "unsigned
int" representing the position in the translation unit in logical
characters.
Thoughts?
Neil.