This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[RFC] Use wide chars to represent Fortran source internally


Hi all,

This patch is a first step to handling non ASCII encoded source files and non-default character kinds in gfortran. It is a rather tedious patch, but not very invasive nor very difficult. I don't make any promise to implement non-default char kinds in 4.4, but I have recently thought quite a bit about it (with help from Tobias) and at some point I saw what to do, and also had some time to do it (long train journey and conference), hence this patch.

This patch doesn't (shouldn't?) change the current behaviour of the front-end except for one single thing: when -fbackslash is used, we now allow \x??, \u???? and \U???????? escape sequences that are translated to wide characters (where each “?” is a hexadecimal digit). All I have done is to change the internal representation of each line of source from a sequence of “char” to a sequence of 32-bit integers, adapt the front-end code for that (including locii handling) and audit the code consuming source characters to check we do sensible things with them. As a result, after that patch, changing to source file encodings other than ASCII will only requires changes in the few functions actually reading files, and allowing more character kinds will only (sic!) require changes to string/substring handling functions. The middle ground between these two has been carefully audited and should not need modifications any more.

Here's a description of the major parts of the patch, to make it easier to review:
-- create a gfc_char_t type that is an unsigned integer type of at least 32-bit
-- switch all structures dealing with source lines to this type
-- provide functions to handle strings of wide characters (their name starts with "gfc_wide_": gfc_wide_is_printable, gfc_wide_is_digit, gfc_wide_fits_in_byte, gfc_wide_tolower, gfc_wide_strlen)
-- make two new versions of the gfc_next_char and gfc_peek_char functions, named gfc_next_ascii_char and gfc_peek_ascii_char: most functions consuming source characters (including all matchers except the literal string matcher) will only ever deal with one-byte characters, so we give them exactly that... (the trick is to give a stupid ASCII value, like UCHAR_MAX, instead of characters that are wider than one byte; this is fine, as no consumer ever expects UCHAR_MAX)
-- in the routine handling backslash-escaped sequences when the - fbackslash option is used, also handle \x??, \u???? and \U???????? wide-char escape pattern
-- modify show_locus to display large character source lines, by escape-encoding characters that are not ASCII-printable (extending what was previously done)
-- when matching literal strings, currently error out when a character wider than a byte is encountered; (this can only be reached by using \u and \U escape sequences right now)



As a consequence of this patch, memory usage to store the source file roughly quadruples. I've not seen a single case where it gives a significant memory increase on the total amount used during compilation higher than 7%, even at -O0: for the huge cp2k-in-one- file, which has 26MB, 430k-lines source files, compilation at -O0 requires 1.5 GB of memory, so the 75 MB additional memory isn't seen. If other maintainers feel it is an important issue, I'm open to suggestions but I strongly suggest we consider maintanability: a variable-width encoding source representation would probably be more trouble. Even if there is some consensus that it is needed, I would suggest going for that in a second time, as the separation of types and use of helper functions provided in this patch are already a first step in the right direction.


I've already built it on my laptop (i686-darwin), bootstrapped and regtested on the compile farm (x86_64-linux). I welcome your comments and reviews. OK to commit?


FX


--
François-Xavier Coudert
http://www.homepages.ucl.ac.uk/~uccafco/

Attachment: wide_char_1.ChangeLog
Description: Binary data

Attachment: wide_char_1.diff
Description: Binary data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]