[Bug libfortran/99210] X editing for reading file with encoding='utf-8'

Sun Feb 28 03:25:35 GMT 2021

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99210

--- Comment #3 from Jerry DeLisle <jvdelisle at gcc dot gnu.org> ---
Here is the real issue. The X format specifier is a position modifier. UTF-8 is
a variable character length encoding so moving one character could mean move 1,
2, 3, or 4 bytes depending on the content of the file.

Up to now we have chosen to move "position" by 1 byte.

13.8.1.1 Position editing

1 The position edit descriptors T, TL, TR, and X, specify the position at which
the next character will be transmitted to or from the record. If any character
skipped by a position edit descriptor is of type nondefault character,
and the unit is a default character internal file or an external non-Unicode
file, the result of that position editing is processor dependent.

Our interpretation of this has been that the example provided in this PR is
processor dependent. However, the file is opened as encoding='UTF-8'.

So, we have to use UTF-8 based skips for READs.  The following patch does this:

diff --git a/libgfortran/io/read.c b/libgfortran/io/read.c
index 7515d912c51..30ff0e0deb7 100644
--- a/libgfortran/io/read.c
+++ b/libgfortran/io/read.c
@@ -1255,6 +1255,23 @@ read_x (st_parameter_dt *dtp, size_t n)

   if (n == 0)
     return;
+    
+  if (dtp->u.p.current_unit->flags.encoding == ENCODING_UTF8)
+    {
+      gfc_char4_t c;
+      size_t nbytes, j;
+    
+      /* Proceed with decoding one character at a time.  */
+      for (j = 0; j < n; j++)
+       {
+         c = read_utf8 (dtp, &nbytes);
+    
+         /* Check for a short read and if so, break out.  */
+         if (nbytes == 0 || c == (gfc_char4_t)0)
+           break;
+       }
+      return;
+    }

   length = n;

The remaining part of this is what to do for end of file conditions.  So, I am
doing a little mor testing.