[PATCH] Multibyte awareness for diagnostics (PR 49973)

Mon Dec 9 23:00:00 GMT 2019

On Mon, Dec 09, 2019 at 03:12:31PM -0500, David Malcolm wrote:
> On Fri, 2019-12-06 at 15:31 -0500, Lewis Hyatt wrote:
> > On Fri, Dec 06, 2019 at 10:54:30AM -0500, David Malcolm wrote:
> 
> [...]
> 
> > > The patch is OK for trunk with the nits above fixed.  Do you have
> > > commit access?  (I've got my own patch [1] that touches diagnostic-
> > > show-locus.c which I'll need to refresh once yours goes in); need
> > > to
> > > remember to include those data files when committing.
> > > 
> > > Thanks very much for fixing this, and sorry again for the delay in
> > > reviewing it.
> > 
> > That's wonderful, thanks very much. Happy to contribute something. I
> > do not
> > have commit access so if you could please apply it, that would be
> > great.
> 
> Thanks; I've committed it on your behalf to trunk as r279137 (including
> the data files).
> 
> FWIW I made a few minor fixes:
> * I added a ChangeLog entry for contrib/unicode/unicode-license.txt,
> and added the PR reference to the gcc/testsuite/ChangeLog.
> * I also moved the top-level ChangeLog to contrib/ChangeLog, updating
> the paths.
> * I updated the instructions to reflect the path of the generated file:
> -3.  Run ./gen_wcwidth.py X.Y > ../../libcpp/generated_wcwidth.h
> +3.  Run ./gen_wcwidth.py X.Y > ../../libcpp/generated_cpp_wcwidth.h
> (I tested regenerating the file and got the same result as you, on
> supplying "12.1.0")
> 
> and did another bootstrap&regression test (on x86_64-pc-linux-gnu) for
> good measure.
>

> 
> > Might I also trouble you to please have a look at this one:
> > https://gcc.gnu.org/ml/gcc-patches/2019-10/msg00766.html ?
> 
> Looking now.
> 
> > It is very much shorter than this one and fixes just a couple
> > glitches in
> > pretty-print.c -- UTF-8 gets mangled when printed via %q, and line-
> > wrapping
> > shouldn't wrap in the middle of a UTF-8 sequence. That's the only
> > other
> > issue I am aware of as far as diagnostics with multibyte characters
> > go.
> 
> If you feel like tackling a related issue, maybe have a look at how we
> print tab characters in diagnostic_show_locus; we should probably
> resolve them into spaces (respecting -ftab-stop [1]), since when we
> print them as tabs after a left-margin for line-numbers the
> "tabification" is mangled.  I don't know if there's a bug for this in
> bugzilla, but it seems tightly related to this column-handling work.
> 
> > Thanks again!
> 
> Thanks for all your work putting the patch together.
> 
> Dave
> 
> [1] see also c-indentation.c; gah
> 

That's great, thanks very much.

I'm happy to look at the tab situation. My understanding is that -ftabstop is
only used to implement the logic in c-indentation.c and does not propagate to
diagnostics otherwise. Currently, cpp_wcwidth() returns 1 for all control
characters, including tabs. The tab is also counted as 1 byte for purpose of
the column number calculation. Then in diagnostic-show-locus, we convert all
tabs to a single space on output. So that much is at least consistent and
doesn't lead to any spacing issues AFAIK, as far as internal consistency of
diagnostic-show-locus's generated output goes. But it does make the source
which is output look weird for mixed-tabs-and-spaces styles.

Below patch would fix it including respecting -ftabstop. This patch causes
cpp_wcwidth() to return the actual tab width for '\t', and also changes
diagnostic-show-locus to convert the tab to the same number of spaces. My main
concern with this approach is I am not quite sure what's the "right" way to
expose the option value everywhere it needs to be. It lives inside a
cpp_options struct, but I think this is only accessible from a cpp_file
object, which I don't have in all relevant contexts. For this illustration I
used a global variable but I wonder whether there's a more preferred way to do
it. Other issue would be that the -ftabstop option is only processed in
c-family but presumably it's applicable to other languages too. Anyway if that
much seems OK, I could prepare this properly including comments + tests.

There would of course be the question of what the column number should show in
this case. It would be trivial to change it to show the display width of all
characters (multibyte and tabs). Changing it to show the byte count of
multibyte characters as it does now, but the actual width of tabs newly, would
require adding some new concepts that don't exist yet.

-Lewis

diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
index c913291c07c..ff85344d63f 100644
--- a/gcc/c-family/c-opts.c
+++ b/gcc/c-family/c-opts.c
@@ -506,7 +506,7 @@ c_common_handle_option (size_t scode, const char *arg, HOST_WIDE_INT value,
     case OPT_ftabstop_:
       /* It is documented that we silently ignore silly values.  */
       if (value >= 1 && value <= 100)
-	cpp_opts->tabstop = value;
+	cpp_opts->tabstop = cpp_tab_width = value;
       break;
 
     case OPT_fexec_charset_:
diff --git a/gcc/diagnostic-show-locus.c b/gcc/diagnostic-show-locus.c
index 9cd16952ce0..4f9a1800030 100644
--- a/gcc/diagnostic-show-locus.c
+++ b/gcc/diagnostic-show-locus.c
@@ -1508,15 +1508,23 @@ layout::print_source_line (linenum_type row, const char *line, int line_bytes,
 	    m_colorizer.set_normal_text ();
 	}
       char c = *line;
-      if (c == '\0' || c == '\t' || c == '\r')
-	c = ' ';
-      if (c != ' ')
+      if (c == '\t')
 	{
-	  last_non_ws = col_byte;
-	  if (first_non_ws == INT_MAX)
-	    first_non_ws = col_byte;
+	  for (int i = 0; i < cpp_tab_width; ++i)
+	    pp_space (m_pp);
+	}
+      else
+	{
+	  if (c == '\0' || c == '\r')
+	    c = ' ';
+	  else if (c != ' ')
+	    {
+	      last_non_ws = col_byte;
+	      if (first_non_ws == INT_MAX)
+		first_non_ws = col_byte;
+	    }
+	  pp_character (m_pp, c);
 	}
-      pp_character (m_pp, c);
       line++;
     }
   print_newline ();
diff --git a/gcc/input.c b/gcc/input.c
index 1dc6b339afe..811627d88c8 100644
--- a/gcc/input.c
+++ b/gcc/input.c
@@ -3612,8 +3612,10 @@ void test_cpp_utf8 ()
   {
     int w_bad = cpp_display_width ("\xf0!\x9f!\x98!\x82!", 8);
     ASSERT_EQ (8, w_bad);
-    int w_ctrl = cpp_display_width ("\r\t\n\v\0\1", 6);
-    ASSERT_EQ (6, w_ctrl);
+    int w_ctrl = cpp_display_width ("\r\n\v\0\1", 5);
+    ASSERT_EQ (5, w_ctrl);
+    int w_tab = cpp_display_width ("a\tb\tc", 5);
+    ASSERT_EQ (3 + 2*cpp_tab_width, w_tab);
   }
 
   /* Verify that wcwidth of valid UTF-8 is as expected.  */
diff --git a/libcpp/charset.c b/libcpp/charset.c
index 956d2dad5c8..b700856a5cc 100644
--- a/libcpp/charset.c
+++ b/libcpp/charset.c
@@ -2328,6 +2328,8 @@ cpp_display_column_to_byte_column (const char *data, int data_length,
   return data_length - inbytesleft + MAX (0, display_col - column);
 }
 
+int cpp_tab_width = 8;
+
 /* Our own version of wcwidth().  We don't use the actual wcwidth() in glibc,
    because that will inspect the user's locale, and in particular in an ASCII
    locale, it will not return anything useful for extended characters.  But GCC
@@ -2344,7 +2346,7 @@ cpp_display_column_to_byte_column (const char *data, int data_length,
 int cpp_wcwidth (cppchar_t c)
 {
   if (__builtin_expect (c <= wcwidth_range_ends[0], true))
-    return wcwidth_widths[0];
+    return c == '\t' ? cpp_tab_width : wcwidth_widths[0];
 
   /* Binary search the tables.  */
   int begin = 1;
diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
index e199aecfa48..214f2b868d3 100644
--- a/libcpp/include/cpplib.h
+++ b/libcpp/include/cpplib.h
@@ -1321,6 +1321,7 @@ extern const char * cpp_get_userdef_suffix
   (const cpp_token *);
 
 /* In charset.c */
+extern int cpp_tab_width;
 int cpp_byte_column_to_display_column (const char *data, int data_length,
 				       int column);
 inline int cpp_display_width (const char *data, int data_length)