This is the mail archive of the
fortran@gcc.gnu.org
mailing list for the GNU Fortran project.
Re: RFC Wide Characters in I/O
- From: Tobias Burnus <burnus at net-b dot de>
- To: Jerry DeLisle <jvdelisle at verizon dot net>
- Cc: Fortran List <fortran at gcc dot gnu dot org>
- Date: Fri, 16 May 2008 23:01:44 +0200
- Subject: Re: RFC Wide Characters in I/O
- References: <482DD108.5080006@verizon.net>
Jerry DeLisle wrote:
If we are given a wide character to transfer as unformatted, we simply
transfer all the bytes as is. (I will confirm with the standard on this.)
I think this OK as the standard does not define the on-disk format;
additionally, ENCODING= can only be given for formatted I/O. One should
make sure that the endianess is properly honored. As a wide character is
simply an INTEGER(4) one can simply write them as integer(4) [array] -
what they actually are.
If the user has specified an ENCODING="default" and the kind is 1, we
do what we do now and transfer as 8bit (mostly ASCII).
OK. This automatically allows to support 8-bit encodings and is backward
compatible.
[still formatted and encoding='default':]
If kind is greater than 1, I suggest we transfer each byte as is. So
for kind=4, we would transfer 4 bytes. This would enable doing some
packed 4x1 byte character stuff. I think the standard would allow
this and thats why "default" is so loosely defined.
I think for I/O, almost everything is allowed by the standard. Note that
the standard distinguishes between DEFAULT and ASCII characters. In case
of gfortran, they are the same and thus the better-defined ASCII applies.
Somehow I think your suggestion is not necessarily the best one. For
write(foo,'(a)') ucs4_"Hello"
I expect that it writes 5 bytes and not 20 bytes to the unit "foo" or
internal file "foo". Regarding internal files one finds in the standard:
"An input/output list shall not contain an item of nondefault character
type if the input/output statement specifies an internal file of default
character type. An input/output list shall not contain an item of
nondefault character type other than ISO 10646 or ASCII character type
if the input/output statement specifies an internal file of ISO 10646
character type. An input/output list shall not contain a character item
of any character type other than ASCII character type if the
input/output statement specifies an internal file of ASCII character type."
If I read this correctly, the standard requires likewise. Thus I think
it is OK to give an error if the character is not representable in a
byte. (Thus char < 128 or char < 256 should be rejected.) Alternatively,
one can convert these characters into a "?".
For files one is more flexible, but still I think being able to write a
normal ASCII string without needing to assign the ucs4-kind string to a
default-kind variable first is useful. Again either an error or
conversion to "?" could be a solution.
For the sake of completeness, the following can also be found in the
standard:
"During input from a Unicode file,
(1) characters in the record that correspond to an ASCII character
variable shall have a position in the ISO 10646 character type collating
sequence of 127 or less, and
(2) characters in the record that correspond to a default character
variable shall be representable in the default character type.
During input from a non-Unicode file,
(1) characters in the record that correspond to a character variable
shall have the kind of the character variable, and
(2) characters in the record that correspond to a numeric or logical
variable shall be of default character type.
During output to a Unicode file, all characters transmitted to the
record are of ISO 10646 character type. If a character input/output list
item or character string edit descriptor contains a character that is
not representable in the ISO 10646 character type, the result is
processor-dependent.
During output to a non-Unicode file, characters transmitted to the
record as a result of processing a character string edit descriptor or
as a result of evaluating a numeric, logical, or default character data
entity, are of type default character."
I think writing a wide-char variable into a default-kind internal file
should give an error for characters which cannot fit into the variable.
(One can argue whether only ASCII characters or also ISO-8859-1 (Latin1)
characters fit.)
(Note: It is my understanding, in reading about UTF-8, that it is
internally represented by a fixed width 4 byte hexidecimal number
which when transfered to an external file, is translated on the fly
into a variable width encoding. When reading back in, it is
translated back to the 4 byte code.)
I'm not 100% sure whether I read this correctly, but the UTF-8
characters should be represented as UCS-4 (integer(4)) in the library
until the record is finally written.
If the user has specified an ENCODING="UTF-8" and the kind is less
than 4, strictly speaking, thats an error, but we can give a warning
and assume a truncation.
I guess, you are talking about READING from a file; and (see quote
above) this is valid as long as the characters are representable. Thus I
would go again for "?" or for an error.
Tobias
PS: I think it is a matter of taste, but I slightly favour printing a
"?" to giving a run-time error.