This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: [PATCH] Multibyte awareness for diagnostics (PR 49973)

From: David Malcolm <dmalcolm at redhat dot com>
To: Lewis Hyatt <lhyatt at gmail dot com>
Cc: gcc-patches at gcc dot gnu dot org, joseph at codesourcery dot com
Date: Thu, 21 Nov 2019 21:43:58 -0500
Subject: Re: [PATCH] Multibyte awareness for diagnostics (PR 49973)
References: <20190926201639.GA82807@ldh.local> <alpine.DEB.2.21.1909262040450.14912@digraph.polyomino.org.uk> <20190927204144.GA86720@ldh.local> <8a30a5a30078714f399822b0513b070af59c3e88.camel@redhat.com> <20191120162708.GA45237@ldh.local> <20191120163543.GA45766@ldh.local>
On Wed, 2019-11-20 at 11:35 -0500, Lewis Hyatt wrote:
> My apologies, that patch had whitespace issues. Corrected version
> attached.
> 
> On Wed, Nov 20, 2019 at 11:27:08AM -0500, Lewis Hyatt wrote:
> > On Tue, Nov 19, 2019 at 12:30:39PM -0500, David Malcolm wrote:
> > > Thanks for posting this patch; I'm sorry about how long it's
> > > taken me
> > > to review it.
> > > 
> > 
> > Thank you very much for the thorough review and the great
> > suggestions. I
> > know it was a lot to look through...  I attached an updated patch
> > that
> > incorporates your comments. I also put some more responses inline
> > below.
> >

Thanks for the updated patch.

Various comments inline throughout.

[...]

> > > I should confess that it took me a while to realize the whole
> > > multi-column display thing (I had an "aha" moment, then felt
> > > rather
> > > foolish, given that I'd been playing with the examples in the PR;
> > > it was
> > > on reading through the ASCII art in the new selftests and going
> > > "huh"
> > > that I had my epiphany on the problem your patch is solving).
> > > 
> > > I think this file could use a high-level introductory comment at
> > > the top
> > > talking about the various meanings of "column".  I liked the two
> > > example
> > > code points you used below, so perhaps have a comment up at the
> > > top
> > > talking about the distinction between byte vs display column,
> > > using
> > > those code points as examples (and a plain ASCII character, by
> > > way of
> > > contrast).  Perhaps have the comment describing the enum be the
> > > big
> > > introductory comment.
> > > 
> > > It would be good for that introductory comment to have a copy of
> > > the
> > > ASCII art you used in the selftests below, or similar.
> > > 
> > 
> > Done, I beefed up the introductory comments as you suggested.

Thanks.

[...snip...]

> > > > @@ -574,20 +621,23 @@ test_layout_range_for_single_point ()
> > > >  
> > > >    /* Tests for layout_range::contains_point.  */
> > > >  
> > > > -  /* Before the line. */
> > > > -  ASSERT_FALSE (point.contains_point (6, 1));
> > > > +  for (int use_display = 0; use_display <= 1; ++use_display)
> > > > +    {
> > > > +      /* Before the line.  */
> > > > +      ASSERT_FALSE (point.contains_point (6, 1, use_display));
> > > 
> > > [...snip...]
> > > 
> > > Here you generalize the layout_range tests to iterate over both
> > > meanings of "column".
> > > 
> > > If I'm reading things right, implicit here is that the
> > > layout_point
> > > ctors within the layout range are now calling:
> > >   m_display_col (location_compute_display_column (exploc))
> > > which in this selftest is looking for a file named "test.c",
> > > presumably not finding it, and hitting the case of a NULL "line"
> > > char_span.
> > > 
> > > So if there happens to be a test.c in the current directory
> > > containing the "right" characters, this selftest could break.
> > > 
> > > Previously it's never mattered to this selftest whether or not
> > > there was an actual test.c, so it might be good to modify it
> > > to use temp_source_file (and maybe even to have some multicolumn
> > > chars in it, though that might be taking things too far).
> > > 
> > 
> > Assuming we're comfortable that the other selftests exercise the
> > multibyte logic sufficiently, it seems simplest just to use an
> > empty
> > string rather than "test.c", since nothing is actually expecting to
> > read a file here.

Fair enough.

> > I did that for now, and also made sure that
> > location_get_display_column() will not try to open an empty
> > filename
> > either.

location_compute_display_column FWIW



> > > >  static int
> > > >  get_line_width_without_trailing_whitespace (const char *line,
> > > > int line_width)
> > > 
> > > Why is get_line_width_without_trailing_whitespace done in bytes?
> > > It's used for calculating the maximum number of printed columns,
> > > to try
> > > to cope with extra wide source lines, offsetting things to fit
> > > within the width
> > > of the user's terminal.
> > > 
> > 
> > It's used in two places. Once in the layout::layout() constructor
> > to
> > compute the m_x_offset in display columns for wide lines, and then
> > again in layout::print_source_line(), which handles printing
> > arbitrary
> > source lines, not necessarily the primary line that was inspected
> > in
> > the constructor. So the m_x_offset needs to be in display column
> > units, that are then translated back to bytes for each different
> > line
> > to which it applies. For this it seemed most convenient for
> > get_line_width_without_trailing_whitespace() to return the bytes
> > offset, since that's what we need in layout::print_source_line() to
> > find the end of the line.

Ah.  Thanks.

I think the code would be clearer if the patch also renamed
layout's m_x_offset - say to "m_x_offset_display"?

[...snip...]

> > > Thanks for exercising all this with selftests.
> > > 
> > > Presumably this involved a big copy-and-paste from the existing
> > > selftests.
> > > 
> > > How did you generate the expected output for the various _utf8
> > > selftests?  Was it a lot of tedious manual editing, or is there a
> > > handy
> > > way to do this? (I'm nervous about how much work it will be to
> > > update
> > > these if e.g. we want to experiment with new ways of printing
> > > fix-it
> > > hints)
> > > 
> > 
> > I just went through the existing tests one by one and adapted them
> > manually. I made them with the actual UTF-8 chars initially so that
> > it
> > made sense visually, and then replaced to hex escapes and aligned
> > the
> > lines manually at the end. Wasn't so bad this way. I don't think it
> > would have been especially easy to automate because there were some
> > non-mechanical adjustments made, e.g. to make sure to exercise edge
> > cases like where two strings would overlap in byte units, but not
> > in
> > display units. I feel like it wouldn't be necessary to add UTF-8
> > duplicates of all new future tests, hopefully -- rather could just
> > insure that all new test cases include a multibyte character or
> > two?
> > Anyway I am happy to help with that if it comes up in the future
> > too.

OK.

[...snip...]

> > > > +    ASSERT_STREQ ("\n"
> > > > +		  " \xf0\x9f\x98\x82"
> > > > +		     "_foo = \xcf\x80"
> > > > +			     "_bar.\xf0\x9f\x98\x82"
> > > > +				    "_field\xcf\x80"
> > > > +					   ";\n"
> > > > +		  " ^~~~~~   ~~~~~ ~~~~~~~~~\n"
> > > > +		  " |        |     |\n"
> > > > +		  " |        |     c\n"
> > > > +		  " aaaaa\xf0\x9f\x98\x82\xcf\x80"
> > > > +			   "
> > > > bb\xf0\x9f\x98\x82\xf0\x9f\x98\x82\n",
> > > > +		  pp_formatted_text (dc.printer));
> > > 
> > > It's hard to tell from the escaped expected string, but
> > > presumably this
> > > matches the comment about boundary conditions, right? (compared
> > > with
> > > the ASCII case).
> > > 
> > 
> > Correct. What I found useful for testing, was just to add an extra
> > character to the expected string to make the test fail. Then the
> > expected output goes to stderr the way it should look with the
> > actual
> > extended characters, and you can verify it tests what was intended.

Excellent (indeed, that's what I did when I wrote the ASCII cases).

> > > > diff --git a/gcc/diagnostic.c b/gcc/diagnostic.c
> > > > index 96b6fa30052..8638fbebb2d 100644
> > > > --- a/gcc/diagnostic.c
> > > > +++ b/gcc/diagnostic.c
> > > > @@ -346,9 +346,13 @@ diagnostic_get_location_text
> > > > (diagnostic_context *context,
> > > >    const char *locus_cs = colorize_start (pp_show_color (pp),
> > > > "locus");
> > > >    const char *locus_ce = colorize_stop (pp_show_color (pp));
> > > >    const char *file = s.file ? s.file : progname;
> > > > -  int line = strcmp (file, N_("<built-in>")) ? s.line : 0;
> > > > -  int col = context->show_column ? s.column : 0;
> > > > -
> > > > +  int line = 0;
> > > > +  int col = 0;
> > > > +  if (strcmp (file, N_("<built-in>")))
> > > > +    {
> > > > +      line = s.line;
> > > > +      col = context->show_column ?
> > > > location_compute_display_column (s) : 0;
> > > > +    }
> > > 
> > > Why does the patch use the display column here?
> > > 
> > > Ideally it would be the count of unicode characters, but I think
> > > we want
> > > to preserve the current behavior of using a byte offset.
> > > 
> > 
> > So that column number was actually the original motivation for the
> > PR
> > 49973 from several years ago. It seems to me that the discussion
> > there
> > concluded that the column number should be the display column
> > (Joseph's comment
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973#c10). It makes
> > sense to me that this column should match what you see in your
> > editor,
> > which would be the display column, but I don't know if some
> > standards
> > have evolved here.

This is the column number as reported in the diagnostic i.e the COL_NUM
when printing e.g.
  warning: FILENAME:LINE_NUM:COL_NUM: some message

It seems to me that PR 49973 and this patch cover two separate things:
(a) bytes vs display columns in diagnostic-show-locus.c
(b) the "COL_NUM" mentioned above.

I'd prefer to omit (b) from the patch, and have the focus of the patch
be (a), to tackle (b) in a separate patch.

[There's also the meaning of column numbers in the JSON output, and in
the output of -fdiagnostics-parseable-fixits (which is intended to mimic
clang's output format)]

It's unclear to me what the reported COL_NUM should be.
There are various possibilities:

Units:
  (A) [status quo] report a count of bytes within the line
  (B) report a count of unicode characters
  (C) report a count of unicode graphemes
  (D) report based on the wcwidth of the characters
  etc

Origin/baseline:
  (A) [status quo] use 1 for the leftmost column
  (B) use 0 for the leftmost column

Tab-handling:
  (A) [status quo] don't give any kind of special status to tab characters
  (B) implement tab stops, somehow.  For example, get_visual_column in
      c-family/c-indentation implements tab stops based on bytes.

(so at least 4*2*2 = 16 possible meanings, ugh)

See also e.g.:
  https://github.com/oasis-tcs/sarif-spec/issues/178

The GNU Coding Standards say

   Line numbers should start from 1 at the beginning of the file, and
   column numbers should start from 1 at the beginning of the line.
   (Both of these conventions are chosen for compatibility.) Calculate
   column numbers assuming that space and all ASCII printing characters
   have equal width, and assuming tab stops every 8 columns. For
   non-ASCII characters, Unicode character widths should be used when in
   a UTF-8 locale; GNU libc and GNU gnulib provide suitable wcwidth
   functions.
(https://www.gnu.org/prep/standards/standards.html#Errors)

I think if we do change the meaning of the "COL_NUM" output, we should
probably add an option for it, to help with the transition (so that
people can easily revert to the old behavior).

Perhaps something like:

  -fdiagnostics-column-unit=[bytes|gnu]

     bytes: [status-quo]; 1-based count of bytes, not respecting tab stops
     gnu: as per GNU Coding Standards above

and have gcc 10 default to "gnu" (or whatever we call it), so that
people can override it back to "bytes".

(again, I'm thinking aloud here)

But please can you split that out as a separate patch? (it's arguably
still in time for GCC 10, as it's from a patch was posted before the
stage 1 deadline).

[...]

> > > Maybe a brute force test for inverse, something like:
> > > 
> > > for (int display_column = 0; display_column < 20;
> > > display_column++)
> > >   {
> > >     int byte_column = cpp_display_column_to_byte_column (str, 15,
> > > display_column);
> > >     ASSERT_EQ (cpp_byte_column_to_display_column (str, 15,
> > > byte_column),
> > >                display_column);
> > >   }
> > > 
> > > or similar?
> > > 
> > > What happens if you request a display column that's in the middle
> > > of a
> > > character?  It feels like we ought to have selftest coverage for
> > > that.
> > > 
> > 
> > In this case the UTF-8 conversion fails so it treats each byte as a
> > display width of 1, basically falling back to the existing behavior
> > of
> > GCC whenever things don't make sense. I added a selftest for the
> > round-trip conversion as you suggested; this test needs to make an
> > exception for partial codepoints, so it tests both things
> > effectively.

Thanks.

> > > > +int cpp_wcwidth (cppchar_t c)
> > > > +{
> > > > +  if (__builtin_expect (c <= wcwidth_range_ends[0], true))
> > > > +    return wcwidth_widths[0];
> > > > +
> > > > +  /* Binary search the tables.  */
> > > > +  int begin = 1;
> > > > +  static const int end
> > > > +      = sizeof wcwidth_range_ends / sizeof
> > > > (*wcwidth_range_ends);
> > > > +  int len = end - begin;
> > > > +  do
> > > > +    {
> > > > +      int half = len/2;
> > > > +      int middle = begin + half;
> > > > +      if (c > wcwidth_range_ends[middle])
> > > > +	{
> > > > +	  begin = middle + 1;
> > > > +	  len -= half + 1;
> > > > +	}
> > > > +      else
> > > > +	len = half;
> > > > +    } while (len);
> > > > +
> > > > +  if (__builtin_expect (begin != end, true))
> > > > +    return wcwidth_widths[begin];
> > > > +  return 1;
> > > > +}
> > > 
> > > Please can you add some unit-testing for this function in
> > > selftest form to input.c
> > > (i.e. testing a few specific code points).
> > > 
> > > [...snip...]
> > 
> > The second block of test_cpp_utf() in input.c contained tests for a
> > couple codepoints already (pi, an emoji, ascii, and invalid utf8).
> > I
> > added a couple more specific codepoints, something from latin-1, a
> > combining character, and a Chinese character. Please let me know if
> > that seems good now or if there should be more. It's easy enough to
> > add more.

That's great, thanks.

> > > Again, thanks for this patch, and sorry again for the delay in
> > > reviewing it.
> > 
> > Thanks for your time, I appreciate it! I think it would be great if
> > this can get in for GCC 10, since otherwise the new support for
> > UTF-8
> > identifiers would feel rather incomplete.
> > 
> > -Lewis

AIUI, it's OK for us to iterate on a patch like this that was posted
before the stage 1 deadline, for a while into stage 3, at least.


Working through the updated patch...

> diff --git a/contrib/unicode/README b/contrib/unicode/README
> new file mode 100644
> index 00000000000..fbee919647b
> --- /dev/null
> +++ b/contrib/unicode/README
> @@ -0,0 +1,36 @@
> +This directory contains a mechanism for GCC to have its own internal
> +implementation of wcwidth functionality.  (cpp_wcwidth () in libcpp/charset.c).
> +
> +The idea is to produce the necessary lookup table
> +(../../libcpp/generated_cpp_wcwidth.h) in a reproducible way, starting from the
> +following files that are distributed by the Unicode Consortium:
> +
> +ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
> +ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt
> +ftp://ftp.unicode.org/Public/UNIDATA/PropList.txt
> +
> +These three files have been added to source control in this directory.

Presumably the licence for these files is:
  https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/unicode-gen/unicode-license.txt

We should include a copy of that license, like glibc does.

> +In order to keep in sync with glibc's wcwidth as much as possible, it is
> +desirable for the logic that processes the Unicode data to be the same as
> +glibc's.  To that end, we also put in this directory, in the from_glibc/
> +directory, the glibc python code that implements their logic.  This code was
> +copied verbatim from glibc, and it can be updated at any time from the glibc
> +source code repository.

Specifially from glibc's localedata/unicode-gen/, presumably?
Please can the README state that.

[...]

> diff --git a/gcc/diagnostic-show-locus.c b/gcc/diagnostic-show-locus.c
> index cb920f6b9d0..3460da3cf32 100644
> --- a/gcc/diagnostic-show-locus.c
> +++ b/gcc/diagnostic-show-locus.c

[...]

> @@ -112,18 +113,81 @@ class colorizer
>    const char *m_stop_color;
>  };
>  
> -/* A point within a layout_range; similar to an expanded_location,
> +/* In order to handle multibyte sources properly, all of this logic needs to be
> +   aware of the distinction between the number of bytes and the number of
> +   display columns occupied by a character, which are not the same for non-ASCII
> +   characters.  For example, the Unicode pi symbol, U+03C0, is encoded in UTF-8
> +   as "\xcf\x80", and thus occupies 2 bytes of space while only occupying 1
> +   display column when it is output.  A typical emoji, such as U+1F602 (in
> +   UTF-8, "\xf0\x9f\x98\x82"), requires 4 bytes and has a display width of 2.
> +
> +   The below example line, which is also used for selftests below, shows how the
> +   display column and byte column are related:

I don't love the term "byte column" here, but it's probably the least
bad name to describe the status quo concept.

> +
> +     0000000001111111111222222   display
> +     1234567890123456789012345   columns
> +     SS_foo = P_bar.SS_fieldP;
> +     0000000111111111222222223   byte
> +     1356789012456789134567891   columns
> +
> +   Here SS represents the two display columns for the U+1F602 emoji, and P
> +   represents the one display column for the U+03C0 pi symbol.  As an example, a
> +   diagnostic pointing to the final P on this line is at byte column 29 and
> +   display column 24.  This reflects the fact that the three extended characters
> +   before the final P occupy cumulatively 5 more bytes than they do display
> +   columns (a difference of 2 for each of the two SSs, and one for the other P).
> +
> +   One or the other of the two column units is more useful depending on the
> +   context.  For instance, in order to output the caret at the correct location,
> +   we need to count display columns; in order to colorize a source line, we need
> +   to count the bytes.  All locations are provided to us as byte counts, which
> +   we augment with the display column on demand so that it can be used when
> +   needed.  This is not the most efficient way to do things since it requires
> +   looping over the whole line each time, but it should be fine for the purpose
> +   of outputting diagnostics.
> +
> +   In order to keep straight which units (byte or display) are in use at a
> +   given time, the following enum lets us specify that explicitly.  */
> +
> +enum column_unit {
> +  /* Measured in raw bytes.  */
> +  CU_BYTES = 0,
> +
> +  /* Measured in display units.  */
> +  CU_DISPLAY_COLS,
> +
> +  /* For arrays indexed by column_unit.  */
> +  CU_NUM_UNITS
> +};

[...snip...]

> @@ -554,9 +636,9 @@ static layout_range
>  make_range (int start_line, int start_col, int end_line, int end_col)
>  {
>    const expanded_location start_exploc
> -    = {"test.c", start_line, start_col, NULL, false};
> +    = {"", start_line, start_col, NULL, false};
>    const expanded_location finish_exploc
> -    = {"test.c", end_line, end_col, NULL, false};
> +    = {"", end_line, end_col, NULL, false};
>    return layout_range (&start_exploc, &finish_exploc, SHOW_RANGE_WITHOUT_CARET,
>  		       &start_exploc, 0, NULL);
>  }

Please can you add a comment explaining the significance of the empty
string here (summarizing the discussion above, so it's in the source
rather than just in this list archive).

> @@ -687,8 +784,8 @@ test_layout_range_for_multiple_lines ()
>  
>  #endif /* #if CHECKING_P */
>  
> -/* Given a source line LINE of length LINE_WIDTH, determine the width
> -   without any trailing whitespace.  */
> +/* Given a source line LINE of length LINE_WIDTH bytes, determine the width
> +   (in bytes, not display cols) without any trailing whitespace.  */
>  
>  static int
>  get_line_width_without_trailing_whitespace (const char *line, int line_width)

We really should rename that "line_width" param to "line_width_bytes" or similar.

> @@ -897,17 +994,35 @@ layout::layout (diagnostic_context * context,
>       will be adjusted accordingly.  */
>    size_t max_width = m_context->caret_max_width;
>    char_span line = location_get_source_line (m_exploc.file, m_exploc.line);
> -  if (line && (size_t)m_exploc.column <= line.length ())
> +  if (line && max_width)
>      {
> -      size_t right_margin = CARET_LINE_MARGIN;
> -      size_t column = m_exploc.column;
> -      if (m_show_line_numbers_p)
> -	column += m_linenum_width + 2;
> -      right_margin = MIN (line.length () - column, right_margin);
> -      right_margin = max_width - right_margin;
> -      if (line.length () >= max_width && column > right_margin)
> -	m_x_offset = column - right_margin;
> -      gcc_assert (m_x_offset >= 0);
> +      size_t column = m_exploc.m_display_col;
> +      int line_width
> +	= get_line_width_without_trailing_whitespace (line.get_buffer (),
> +						      line.length ());
> +      size_t eol = cpp_display_width (line.get_buffer (), line_width);
> +      const size_t eol_before_linenum = eol;
> +
> +      if (column <= eol)
> +	{
> +	  if (m_show_line_numbers_p)
> +	    {
> +	      column += m_linenum_width + 2;
> +	      eol += m_linenum_width + 2;
> +	    }
> +	  size_t right_margin = CARET_LINE_MARGIN;
> +	  right_margin = MIN (eol - column, right_margin);
> +	  right_margin = max_width - right_margin;
> +	  /* Note: if right_margin > max_width, we end up failing this next
> +	     check due to wrapping, and we don't offset anything.  Otherwise we
> +	     would conclude we can't output the line at all.  */
> +	  if (eol >= max_width && column > right_margin)
> +	    {
> +	      m_x_offset = column - right_margin;
> +	      m_x_offset = MIN (m_x_offset, (int) eol_before_linenum - 1);
> +	    }
> +	  gcc_assert (m_x_offset >= 0);
> +	}
>      }

As noted above, m_x_offset should be renamed to clarify its units
("m_x_offset_display"?)

Can you move this calculation of the offset to a subroutine please.
(I wonder if it can be unit-tested, but don't feel obliged to).

[...snip...]

Thanks again for the updated patch; this feels close to being ready.

[FWIW, part of me would love to express the two different units in the
C++ type system (using templates) so that the compiler can check them,
rather than relying on naming conventions... but let's get this patch
finished without trying to be too clever, especially as we can only
rely on C++98]

Dave
Follow-Ups:
- Re: [PATCH] Multibyte awareness for diagnostics (PR 49973)
  - From: David Malcolm
- Re: [PATCH] Multibyte awareness for diagnostics (PR 49973)
  - From: Lewis Hyatt
References:
- Re: [PATCH] Multibyte awareness for diagnostics (PR 49973)
  - From: David Malcolm
- Re: [PATCH] Multibyte awareness for diagnostics (PR 49973)
  - From: Lewis Hyatt
- Re: [PATCH] Multibyte awareness for diagnostics (PR 49973)
  - From: Lewis Hyatt
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]