This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
[RFC] Use wide chars to represent Fortran source internally
- From: FX <fxcoudert at gmail dot com>
- To: Fortran List <fortran at gcc dot gnu dot org>, "gcc-patches@@gcc.gnu.org patches" <gcc-patches at gcc dot gnu dot org>
- Date: Tue, 29 Apr 2008 20:06:31 +0200
- Subject: [RFC] Use wide chars to represent Fortran source internally
Hi all,
This patch is a first step to handling non ASCII encoded source files
and non-default character kinds in gfortran. It is a rather tedious
patch, but not very invasive nor very difficult. I don't make any
promise to implement non-default char kinds in 4.4, but I have
recently thought quite a bit about it (with help from Tobias) and at
some point I saw what to do, and also had some time to do it (long
train journey and conference), hence this patch.
This patch doesn't (shouldn't?) change the current behaviour of the
front-end except for one single thing: when -fbackslash is used, we
now allow \x??, \u???? and \U???????? escape sequences that are
translated to wide characters (where each “?” is a hexadecimal
digit). All I have done is to change the internal representation of
each line of source from a sequence of “char” to a sequence of 32-bit
integers, adapt the front-end code for that (including locii
handling) and audit the code consuming source characters to check we
do sensible things with them. As a result, after that patch, changing
to source file encodings other than ASCII will only requires changes
in the few functions actually reading files, and allowing more
character kinds will only (sic!) require changes to string/substring
handling functions. The middle ground between these two has been
carefully audited and should not need modifications any more.
Here's a description of the major parts of the patch, to make it
easier to review:
-- create a gfc_char_t type that is an unsigned integer type of at
least 32-bit
-- switch all structures dealing with source lines to this type
-- provide functions to handle strings of wide characters (their
name starts with "gfc_wide_": gfc_wide_is_printable,
gfc_wide_is_digit, gfc_wide_fits_in_byte, gfc_wide_tolower,
gfc_wide_strlen)
-- make two new versions of the gfc_next_char and gfc_peek_char
functions, named gfc_next_ascii_char and gfc_peek_ascii_char: most
functions consuming source characters (including all matchers except
the literal string matcher) will only ever deal with one-byte
characters, so we give them exactly that... (the trick is to give a
stupid ASCII value, like UCHAR_MAX, instead of characters that are
wider than one byte; this is fine, as no consumer ever expects
UCHAR_MAX)
-- in the routine handling backslash-escaped sequences when the -
fbackslash option is used, also handle \x??, \u???? and \U????????
wide-char escape pattern
-- modify show_locus to display large character source lines, by
escape-encoding characters that are not ASCII-printable (extending
what was previously done)
-- when matching literal strings, currently error out when a
character wider than a byte is encountered; (this can only be reached
by using \u and \U escape sequences right now)
As a consequence of this patch, memory usage to store the source file
roughly quadruples. I've not seen a single case where it gives a
significant memory increase on the total amount used during
compilation higher than 7%, even at -O0: for the huge cp2k-in-one-
file, which has 26MB, 430k-lines source files, compilation at -O0
requires 1.5 GB of memory, so the 75 MB additional memory isn't seen.
If other maintainers feel it is an important issue, I'm open to
suggestions but I strongly suggest we consider maintanability: a
variable-width encoding source representation would probably be more
trouble. Even if there is some consensus that it is needed, I would
suggest going for that in a second time, as the separation of types
and use of helper functions provided in this patch are already a
first step in the right direction.
I've already built it on my laptop (i686-darwin), bootstrapped and
regtested on the compile farm (x86_64-linux). I welcome your comments
and reviews. OK to commit?
FX
--
François-Xavier Coudert
http://www.homepages.ucl.ac.uk/~uccafco/
Attachment:
wide_char_1.ChangeLog
Description: Binary data
Attachment:
wide_char_1.diff
Description: Binary data