This is the mail archive of the
fortran@gcc.gnu.org
mailing list for the GNU Fortran project.
Re: RFC Wide Characters in I/O
- From: Jerry DeLisle <jvdelisle at verizon dot net>
- To: Janne Blomqvist <blomqvist dot janne at gmail dot com>
- Cc: Fortran List <fortran at gcc dot gnu dot org>
- Date: Fri, 16 May 2008 19:54:02 -0700
- Subject: Re: RFC Wide Characters in I/O
- References: <482DD108.5080006@verizon.net> <482E00FD.10702@gmail.com>
Janne Blomqvist wrote:
Jerry DeLisle wrote:
Modify the front-end in trans-io.c to build a call to a new function
called transfer_wide_character. This function is used if we have an
ENCODING= specifier or a wide character to transfer. This will retain
the existing transfer_character call to maintain compatibility.
Repeating myself, the new function will be used for kind > 1, while
transfer_character is retained for kind=1 and hollerith as is.
This sounds good. As you imply, I agree that it's ok to restrict
hollerith to plain ASCII, as hollerith was removed before any support
for non-ASCII was added to the Fortran standard.
The downstream functions in the library such as list_formatted_read,
list_formatted_write, formatted_transfer, unformatted_read,
unformatted_write, already accept a kind parameter which
transfer_character sets to 1.
Yes. But needless to say, there's probably a number of bugs lurking
there since so far kind has always been 1.
Always, always, always
If we are given a wide character to transfer as unformatted, we simply
transfer all the bytes as is. (I will confirm with the standard on this.)
Yes, I think this is ok.
If the user has specified an ENCODING="default" [snip]... If kind is
greater than 1, I suggest we transfer each byte as is. So for kind=4,
we would transfer 4 bytes.
Like FX said on IRC, I think the sensible thing here would be to convert
each external byte into a 4-byte internal representation. Anything else
would be very confusing, IMHO.
Agreed
This would enable doing some packed 4x1 byte character stuff.
Meaning, packing 4 ASCII characters into one kind=4 character and doing I/O with
the kind=4 character. Lets just forget about this one. :)
(Note: It is my understanding, in reading about UTF-8, that it is
internally represented by a fixed width 4 byte hexidecimal number
which when transfered to an external file, is translated on the fly
into a variable width encoding. When reading back in, it is
translated back to the 4 byte code.)
To clarify, Unicode basically is a mapping between integer numbers
("code points" in unicode jargon) and characters. UTF-8 is a particular
encoding for Unicode, encoding the code points into a variable number of
8-bit octets (with the nice property that the first 127 are backwards
compatible with ASCII). UTF-8 thus has no concept of internal, external,
in-memory or on-disk formats.
The concept of internal and external is entirely mine for the sole purpose of
explaining what we want to do. ;)
In practice, what we want to do is exactly what you describe. I.e.
externally data is UTF-8 encoded, and is converted to a fixed width
4-byte representation in memory for easier handling.
--- snip ---
I think the usual thing is to use a "?" or something like that for a
non-representable character, rather than a runtime error.
Looks like "?" is the consensus. Works for me.
For kind=4 and ENCODING="UTF-8", we do a complete conforming
translation of the 4 byte hex to/from the variable width UTF-8 encoding.
Yeah; there's probably library functions available for doing a lot of
these conversions back and forth. One problem here might be that wchar_t
is not 32 bits on all platforms, so perhaps we can't rely on the libc
wide char functions?
If there is something we can use, fine, but it tends to give us less control
over what is happening. We will plan to look into it before we decide whether
to roll our own or use someone else's.
Thanks everyone for comments.
Jerry