Bug 49973 - Column numbers count multibyte characters as multiple columns
Summary: Column numbers count multibyte characters as multiple columns
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: preprocessor (show other bugs)
Version: 4.5.2
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL: https://gcc.gnu.org/ml/gcc-patches/20...
Keywords: easyhack, patch
Depends on:
Blocks:
 
Reported: 2011-08-04 10:21 UTC by Timothy Liang
Modified: 2019-12-09 20:39 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2011-08-04 15:18:20


Attachments
partial patch illustrating the intended approach (1.39 KB, patch)
2019-09-15 04:21 UTC, Lewis Hyatt
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Timothy Liang 2011-08-04 10:21:25 UTC
int main()
{
	/* 中 */ asdf;
}

g++ -finput-charset=utf8 hello.cpp

hello.cpp: In function ‘int main()’:
hello.cpp:3:12: error: ‘asdf’ was not declared in this scope

The column number should be 10, not 12.
Comment 1 Jakub Jelinek 2011-08-04 10:27:53 UTC
Depends on how the column numbers are defined.  I think gcc uses bytes from the beginning of the line, then 12 is correct (and e.g. for tab characters gcc counts them as one instead of 1-8 depending on position too).
Comment 2 Timothy Liang 2011-08-04 10:36:53 UTC
(In reply to comment #1)
> Depends on how the column numbers are defined.  I think gcc uses bytes from the
> beginning of the line, then 12 is correct (and e.g. for tab characters gcc
> counts them as one instead of 1-8 depending on position too).

That isn't the case here.  Substituting the '中' for another character makes the column number 10.  Setting -finput-charset=latin1 makes the column number 15.
Comment 3 Andreas Schwab 2011-08-04 10:55:13 UTC
Why 10? "    /* 中 */ " has 12 characters (and 14 bytes as utf8).
Comment 4 Timothy Liang 2011-08-04 11:03:32 UTC
(In reply to comment #3)
> Why 10? "    /* 中 */ " has 12 characters (and 14 bytes as utf8).

The four spaces is supposed to be a tab.  Also, the column number starts from one.  So:

 /* 中 */ asdf
         |
1234567890

Since I set the input charset as UTF-8, g++ should count the '中' as one character, not three.  And when I set it to latin1, g++ should count the '中' as three, not six.
Comment 5 Manuel López-Ibáñez 2011-08-04 12:18:19 UTC
GNU Emacs 23.2.1 counts it as two, and puts the cursor at s.

For the simpler case of:

/* ñ */ asdf;

we print 

test.c:3:10: error: ‘asdf’ was not declared in this scope

whereas emacs counts only 1 char, so it again puts the cursor at s. I am not sure whether Emacs is following some GNU standard, but the case of ñ versus n, should at least produce the same result.

Unfortunately, I don't have time to work on this.
Comment 6 joseph@codesourcery.com 2011-08-04 14:38:16 UTC
The GCS says "column numbers should start from 1 at the beginning of the 
line ... Calculate column numbers assuming that space and all ASCII 
printing characters have equal width, and assuming tab stops every 8 
columns.".  This doesn't say how other characters should be counted, 
although for the results of wcswidth seem appropriate.
Comment 7 Manuel López-Ibáñez 2011-08-04 15:18:20 UTC
(In reply to comment #6)
> The GCS says "column numbers should start from 1 at the beginning of the 
> line ... Calculate column numbers assuming that space and all ASCII 
> printing characters have equal width, and assuming tab stops every 8 
> columns.".  This doesn't say how other characters should be counted, 
> although for the results of wcswidth seem appropriate.

Then GCC is not using wcswidth to count or it is setting the locale inappropriately because it is counting 2 for ñ and 3 for 中, while it should be 1 and 2.
Comment 8 Timothy Liang 2011-08-04 19:52:31 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > The GCS says "column numbers should start from 1 at the beginning of the 
> > line ... Calculate column numbers assuming that space and all ASCII 
> > printing characters have equal width, and assuming tab stops every 8 
> > columns.".  This doesn't say how other characters should be counted, 
> > although for the results of wcswidth seem appropriate.
> 
> Then GCC is not using wcswidth to count or it is setting the locale
> inappropriately because it is counting 2 for ñ and 3 for 中, while it should be
> 1 and 2.

I'm confused.  Shouldn't 中 be 1?
Comment 9 Tom Tromey 2011-12-07 17:59:52 UTC
(In reply to comment #6)
> The GCS says "column numbers should start from 1 at the beginning of the 
> line ... Calculate column numbers assuming that space and all ASCII 
> printing characters have equal width, and assuming tab stops every 8 
> columns.".  This doesn't say how other characters should be counted, 
> although for the results of wcswidth seem appropriate.

Note that GCC also handles the tab case incorrectly here.
This shows up if you M-x next-error in Emacs in the case where
gcc emits column numbers and your source code includes tabs.\

Refiling this to preprocessor.
Comment 10 joseph@codesourcery.com 2011-12-07 20:56:01 UTC
On Wed, 7 Dec 2011, tromey at gcc dot gnu.org wrote:

> Note that GCC also handles the tab case incorrectly here.

Yes, GCC should be fixed to follow the GCS there as well.

The GCS now explicitly say "For non-ASCII characters, Unicode character 
widths should be used when in a UTF-8 locale; GNU libc and GNU gnulib 
provide suitable @code{wcwidth} functions."
Comment 11 Manuel López-Ibáñez 2016-02-04 02:14:42 UTC
This should be fixed in libcpp, probably in lex.c, but maybe other places also. A good testcase to start with would be:

/* ñ /* */
/* a /* */

cc1 -Wcomment

prog.cc:1:7: warning: "/*" within comment [-Wcomment]
 /* ñ /* */
       ^
prog.cc:2:6: warning: "/*" within comment [-Wcomment]
 /* a /* */
      ^

Both locations should point to column 6. Look for places where column info is converted to source_location (linemap_position_for_column or linemap_position_for_line_and_colum). Figure out where the column info got the wrong value. Use something like wcwidth() to measure the width of non-ASCII characters and get the right width of 'ñ'.

Unfortunately, GCC 6 seems to be broken for the above testcase (https://gcc.gnu.org/PR69664). Revision 229665 seems fine.
Comment 12 Eric Gallager 2018-01-27 19:47:34 UTC
(In reply to Manuel López-Ibáñez from comment #11)
> This should be fixed in libcpp, probably in lex.c, but maybe other places
> also. A good testcase to start with would be:
> 
> /* ñ /* */
> /* a /* */
> 
> cc1 -Wcomment
> 
> prog.cc:1:7: warning: "/*" within comment [-Wcomment]
>  /* ñ /* */
>        ^
> prog.cc:2:6: warning: "/*" within comment [-Wcomment]
>  /* a /* */
>       ^
> 
> Both locations should point to column 6. Look for places where column info
> is converted to source_location (linemap_position_for_column or
> linemap_position_for_line_and_colum). Figure out where the column info got
> the wrong value. Use something like wcwidth() to measure the width of
> non-ASCII characters and get the right width of 'ñ'.
> 
> Unfortunately, GCC 6 seems to be broken for the above testcase
> (https://gcc.gnu.org/PR69664). Revision 229665 seems fine.

Bug 69664 is fixed now.
Comment 13 Lewis Hyatt 2019-09-15 04:21:32 UTC
Created attachment 46882 [details]
partial patch illustrating the intended approach

Hello-

I would like to help fix this up. I have another patch, hopefully being approved soon, that will implement support for extended characters in identifiers. If that sees some use, then this problem will become more noticeable, so it would be nice to address the column numbering issue at the same time.

It's worth noting that there is a related problem not specifically mentioned so far in this bug report -- in addition to the column number, the caret output in diagnostic-show-locus.c also goes to the same wrong column.

Attached for comment is a preliminary patch (no tests yet) that fixes the column number only (not the caret). If the overall approach seems sound, I am happy to flesh it out to handle all the fancy diagnostics too. I have a couple questions about the implementation, though, which I thought worthwhile to clear up before proceeding further.

Here is a useful test case for this:

=======
const char* smile = "😊😊😊😊😊😊😊😊"; int x = y;
=======

Currently it outputs:

t.cpp:1:65: error: ‘y’ was not declared in this scope
 const char* smile = "😊😊😊😊😊😊😊😊"; int x = y;
                                                                 ^

And indeed the column number and caret are both incorrect (but consistent with each other). Not visible here is that the 'y' is properly colorized in the source line. That works as-is because the ANSI color escapes should be inserted at the proper byte, not the proper display column. So the machinery in diagnostic-show-locus.c requires both the byte count and the logical column count in order to both colorize the source line and put the caret at the right place. 

Therefore the approach I chose was to keep most things exactly as they are -- all locations are tracked via byte counts just as now. Then all that's necessary is to compute the logical display column just when it's needed, at the time the diagnostic is output. This seems tenable because location_get_source_line() already lets us retrieve the line and work with it efficiently. That's the idea I went with here to adjust the column number; happy to hear any feedback before diving in to applying the same idea throughout diagnostic-show-locus.c.

I have one other question though. This quick attempt uses wchar.h, namely the mbrtowc() and wcwidth() functions. Firstly, it seems unfortunate to introduce a dependency on those, which may be problematic for Windows, etc. (I borrowed the same configure checks used in intl.c for that.) But also, this will always convert multibyte input files using the user's locale, whereas GCC doesn't work like this when lexing files -- it ignores the locale and rather defaults to UTF-8 unless overridden by -finput-charset. It seems that in a perfect world diagnostics parsing of the source should work the same way.

I feel like the most portable solution is just to use directly the necessary code (from glibc or gnulib or from scratch or wherever) to handle the wcwidth() functionality, and tweak it for this purpose. It's in essence just a binary search in a table. Basically I would convert the source line from the input charset to UTF-8 the same way the file is read on original input (using the facilities in libcpp/charset.c), and then I would just need a variant of wcwidth() that, rather than dealing with wchar_t and locales, simply takes UTF-8 input and returns the necessary width. Does this seem like a good approach? Otherwise, if I keep the use of wchar APIs, I think it would be necessary to make a temporary change to the locale when computing the display column, as dictated by the input charset.

Note that this also brings up an unrelated question with -finput-charset. Say right now, I have a file in latin1 encoding and it is processed with -finput-charset=latin1, and then I compile it from a UTF-8 locale. If a source line is output in diagnostics, currently it gets output in the latin1 encoding and produces unreadable characters on the terminal. With changes along the lines I propose in the previous paragraph, the diagnostics would end up being output in UTF-8 (or whatever the current locale demands), which I think is maybe an improvement as well.

Thanks for any suggestions.

-Lewis
Comment 14 joseph@codesourcery.com 2019-09-16 20:32:53 UTC
On Sun, 15 Sep 2019, lhyatt at gmail dot com wrote:

> I feel like the most portable solution is just to use directly the necessary
> code (from glibc or gnulib or from scratch or wherever) to handle the wcwidth()
> functionality, and tweak it for this purpose. It's in essence just a binary

Both of those use data generated in some way from Unicode data (stored in 
the locale in the glibc case).

A standalone generator implementing UAX#11 rules for character width 
should be straightforward (we'd need to check in the generator sources as 
well as the generated table).

> search in a table. Basically I would convert the source line from the input
> charset to UTF-8 the same way the file is read on original input (using the
> facilities in libcpp/charset.c), and then I would just need a variant of

Yes, sources need to be processed consistently (converted from input 
charset to UTF-8).  And then of course converted from UTF-8 to the locale 
character set for the final output on the terminal (with some form of 
graceful degradation if the source character set has characters that can't 
be represented in the locale character set - extended identifiers 
diagnostics use UCNs in that case, but I don't know if that's best in 
general).

If a source line in the default -finput-charset=utf-8 case contains 
non-UTF-8 bytes in strings or comments, we can't safely display them on 
the terminal, so my inclination in such a case would be to treat such 
bytes as width-1 and output them as '?'.
Comment 15 Manuel López-Ibáñez 2019-09-17 17:20:30 UTC
(In reply to Lewis Hyatt from comment #13)
> I have one other question though. This quick attempt uses wchar.h, namely
> the mbrtowc() and wcwidth() functions. Firstly, it seems unfortunate to
> introduce a dependency on those, which may be problematic for Windows, etc.

Apart from Joseph suggestion, in a more general sense, there is no issue with introducing a dependency on gnulib. It is a long-term goal to replace parts of libiberty with gnulib: https://gcc.gnu.org/wiki/replacelibibertywithgnulib
Comment 16 Lewis Hyatt 2019-09-17 22:35:06 UTC
Thank you both for the feedback so far. Regarding the use of wcwidth(), one thing I noticed is that glibc has a much different result than gnulib does, say for instance emojis return width 2 in the former rather than 1. (Which seems better based on what I can tell.) It seems that glibc has undergone a fair amount of tweaking to match what applications expect and so what it provides is not coming directly from parsing the Unicode specs, although that's probably the bulk of it. But I wonder, perhaps this is a sign that it might be better to just make use of glibc and not try to add in a third implementation to the mix?

In any case, the underlying source of wcwidth() could easily be changed as a drop-in replacement so I guess it can also be decided later. The use of mbrtowc() is the bigger problem, since this converts from the user's locale and it needs to convert from what -finput-charset asked for (or else UTF-8) instead.

I have a more or less fully-baked patch at this point, that fixes up all diagnostics that I am aware of (changes mostly in diagnostic.c and diagnostic-show-locus.c) to be multi-byte aware. That includes column numbers, carets, annotations, notes, fixit hints, etc. The patch still ignores the input-charset issue and uses mbrtowc(), so that is the last thing for me to add before I think it is worth sharing. I was wondering if I could get some advice as to where to start here please?

It seems that basically location_get_source_line() in input.c needs to return the lines converted to UTF-8, since all parsing has been working with the lines in this form, and all the byte offsets they populated rich_locations with, etc, are relative to the converted data too. I am not sure what's the correct way though for location_get_source_line() to know the value of the -finput-charset option. Typically this is inspected from a cpp_reader object, but none is available in the context where this runs, that I understand anyway. It seems that in order to make use of the existing conversion machinery in libcpp/charset.c, I need to have a cpp_reader instance available too. Appreciate any suggestions here. Thanks!

-Lewis
Comment 17 joseph@codesourcery.com 2019-09-17 22:46:34 UTC
On Tue, 17 Sep 2019, lhyatt at gmail dot com wrote:

> In any case, the underlying source of wcwidth() could easily be changed as a
> drop-in replacement so I guess it can also be decided later. The use of
> mbrtowc() is the bigger problem, since this converts from the user's locale and
> it needs to convert from what -finput-charset asked for (or else UTF-8)
> instead.

If __STDC_ISO_10646__ is defined, wchar_t is Unicode and so local code 
converting from UTF-8 to wchar_t can be used (together with wcwidth from 
libc if available).

If __STDC_ISO_10646__ is not defined, the encoding of wchar_t is unknown.  
Maybe in that case it's best to avoid libc's wcwidth (if any) and just use 
a local implementation of wcwidth on the results of converting UTF-8 to 
Unicode code points.
Comment 18 Lewis Hyatt 2019-09-26 20:19:45 UTC
(In reply to joseph@codesourcery.com from comment #17)
> On Tue, 17 Sep 2019, lhyatt at gmail dot com wrote:
> 
> > In any case, the underlying source of wcwidth() could easily be changed as a
> > drop-in replacement so I guess it can also be decided later. The use of
> > mbrtowc() is the bigger problem, since this converts from the user's locale and
> > it needs to convert from what -finput-charset asked for (or else UTF-8)
> > instead.
> 
> If __STDC_ISO_10646__ is defined, wchar_t is Unicode and so local code 
> converting from UTF-8 to wchar_t can be used (together with wcwidth from 
> libc if available).
> 
> If __STDC_ISO_10646__ is not defined, the encoding of wchar_t is unknown.  
> Maybe in that case it's best to avoid libc's wcwidth (if any) and just use 
> a local implementation of wcwidth on the results of converting UTF-8 to 
> Unicode code points.

It seems to erase a lot of complexity just to always use an internal wcwidth(), so that's what I ended up doing. Patch was posted to https://gcc.gnu.org/ml/gcc-patches/2019-09/msg01558.html for your consideration. This one just addresses diagnostics, not the input-charset or user locale conversion stuff. I will submit those separately after this one is reviewed. Thanks!
Comment 19 David Malcolm 2019-12-09 20:04:19 UTC
Author: dmalcolm
Date: Mon Dec  9 20:03:47 2019
New Revision: 279137

URL: https://gcc.gnu.org/viewcvs?rev=279137&root=gcc&view=rev
Log:
Byte vs column awareness for diagnostic-show-locus.c (PR 49973)

contrib/ChangeLog

2019-12-09  Lewis Hyatt  <lhyatt@gmail.com>

	PR preprocessor/49973
	* unicode/from_glibc/unicode_utils.py: Support script from
	glibc (commit 464cd3) to extract character widths from Unicode data
	files.
	* unicode/from_glibc/utf8_gen.py: Likewise.
	* unicode/UnicodeData.txt: Unicode v. 12.1.0 data file.
	* unicode/EastAsianWidth.txt: Likewise.
	* unicode/PropList.txt: Likewise.
	* unicode/gen_wcwidth.py: New utility to generate
	libcpp/generated_cpp_wcwidth.h with help from the glibc support
	scripts and the Unicode data files.
	* unicode/unicode-license.txt: Added.
	* unicode/README: New explanatory file.

libcpp/ChangeLog

2019-12-09  Lewis Hyatt  <lhyatt@gmail.com>

	PR preprocessor/49973
	* generated_cpp_wcwidth.h: New file generated by
	../contrib/unicode/gen_wcwidth.py, supports new cpp_wcwidth function.
	* charset.c (compute_next_display_width): New function to help
	implement display columns.
	(cpp_byte_column_to_display_column): Likewise.
	(cpp_display_column_to_byte_column): Likewise.
	(cpp_wcwidth): Likewise.
	* include/cpplib.h (cpp_byte_column_to_display_column): Declare.
	(cpp_display_column_to_byte_column): Declare.
	(cpp_wcwidth): Declare.
	(cpp_display_width): New function.

gcc/ChangeLog

2019-12-09  Lewis Hyatt  <lhyatt@gmail.com>

	PR preprocessor/49973
	* input.c (location_compute_display_column): New function to help with
	multibyte awareness in diagnostics.
	(test_cpp_utf8): New self-test.
	(input_c_tests): Call the new test.
	* input.h (location_compute_display_column): Declare.
	* diagnostic-show-locus.c: Pervasive changes to add multibyte awareness
	to all classes and functions.
	(enum column_unit): New enum.
	(class exploc_with_display_col): New class.
	(class layout_point): Convert m_column member to array m_columns[2].
	(layout_range::contains_point): Add col_unit argument.
	(test_layout_range_for_single_point): Pass new argument.
	(test_layout_range_for_single_line): Likewise.
	(test_layout_range_for_multiple_lines): Likewise.
	(line_bounds::convert_to_display_cols): New function.
	(layout::get_state_at_point): Add col_unit argument.
	(make_range): Use empty filename rather than dummy filename.
	(get_line_width_without_trailing_whitespace): Rename to...
	(get_line_bytes_without_trailing_whitespace): ...this.
	(test_get_line_width_without_trailing_whitespace): Rename to...
	(test_get_line_bytes_without_trailing_whitespace): ...this.
	(class layout): m_exploc changed to exploc_with_display_col from
	plain expanded_location.
	(layout::get_linenum_width): New accessor member function.
	(layout::get_x_offset_display): Likewise.
	(layout::calculate_linenum_width): New subroutine for the constuctor.
	(layout::calculate_x_offset_display): Likewise.
	(layout::layout): Use the new subroutines. Add multibyte awareness.
	(layout::print_source_line): Add multibyte awareness.
	(layout::print_line): Likewise.
	(layout::print_annotation_line): Likewise.
	(line_label::line_label): Likewise.
	(layout::print_any_labels): Likewise.
	(layout::annotation_line_showed_range_p): Likewise.
	(get_printed_columns): Likewise.
	(class line_label): Rename m_length to m_display_width.
	(get_affected_columns): Rename to...
	(get_affected_range): ...this; add col_unit argument and multibyte
	awareness.
	(class correction): Add m_affected_bytes and m_display_cols
	members.  Rename m_len to m_byte_length for clarity.  Add multibyte
	awareness throughout.
	(correction::insertion_p): Add multibyte awareness.
	(correction::compute_display_cols): New function.
	(correction::ensure_terminated): Use new member name m_byte_length.
	(line_corrections::add_hint): Add multibyte awareness.
	(layout::print_trailing_fixits): Likewise.
	(layout::get_x_bound_for_row): Likewise.
	(test_one_liner_simple_caret_utf8): New self-test analogous to the one
	with _utf8 suffix removed, testing multibyte awareness.
	(test_one_liner_caret_and_range_utf8): Likewise.
	(test_one_liner_multiple_carets_and_ranges_utf8): Likewise.
	(test_one_liner_fixit_insert_before_utf8): Likewise.
	(test_one_liner_fixit_insert_after_utf8): Likewise.
	(test_one_liner_fixit_remove_utf8): Likewise.
	(test_one_liner_fixit_replace_utf8): Likewise.
	(test_one_liner_fixit_replace_non_equal_range_utf8): Likewise.
	(test_one_liner_fixit_replace_equal_secondary_range_utf8): Likewise.
	(test_one_liner_fixit_validation_adhoc_locations_utf8): Likewise.
	(test_one_liner_many_fixits_1_utf8): Likewise.
	(test_one_liner_many_fixits_2_utf8): Likewise.
	(test_one_liner_labels_utf8): Likewise.
	(test_diagnostic_show_locus_one_liner_utf8): Likewise.
	(test_overlapped_fixit_printing_utf8): Likewise.
	(test_overlapped_fixit_printing): Adapt for changes to
	get_affected_columns, get_printed_columns and class corrections.
	(test_overlapped_fixit_printing_2): Likewise.
	(test_linenum_sep): New constant.
	(test_left_margin): Likewise.
	(test_offset_impl): Helper function for new test.
	(test_layout_x_offset_display_utf8): New test.
	(diagnostic_show_locus_c_tests): Call new tests.

gcc/testsuite/ChangeLog:

2019-12-09  Lewis Hyatt  <lhyatt@gmail.com>

	PR preprocessor/49973
	* gcc.dg/plugin/diagnostic_plugin_test_show_locus.c
	(test_show_locus): Tweak so that expected output is the same as
	before the diagnostic-show-locus.c changes.
	* gcc.dg/cpp/pr66415-1.c: Likewise.


Added:
    trunk/contrib/unicode/
    trunk/contrib/unicode/EastAsianWidth.txt
    trunk/contrib/unicode/PropList.txt
    trunk/contrib/unicode/README
    trunk/contrib/unicode/UnicodeData.txt
    trunk/contrib/unicode/from_glibc/
    trunk/contrib/unicode/from_glibc/unicode_utils.py
    trunk/contrib/unicode/from_glibc/utf8_gen.py   (with props)
    trunk/contrib/unicode/gen_wcwidth.py   (with props)
    trunk/contrib/unicode/unicode-license.txt
    trunk/libcpp/generated_cpp_wcwidth.h
Modified:
    trunk/contrib/ChangeLog
    trunk/gcc/ChangeLog
    trunk/gcc/diagnostic-show-locus.c
    trunk/gcc/input.c
    trunk/gcc/input.h
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/cpp/pr66415-1.c
    trunk/gcc/testsuite/gcc.dg/plugin/diagnostic_plugin_test_show_locus.c
    trunk/libcpp/ChangeLog
    trunk/libcpp/charset.c
    trunk/libcpp/include/cpplib.h

Propchange: trunk/contrib/unicode/from_glibc/utf8_gen.py
            ('svn:executable' added)

Propchange: trunk/contrib/unicode/gen_wcwidth.py
            ('svn:executable' added)
Comment 20 David Malcolm 2019-12-09 20:27:37 UTC
I've committed r279137 on Lewis's behalf, which fixes the issues identified in patch #13.

As noted in review of the patch, we didn't attempt to change the behavior of diagnostic_get_location_text with this change.  Quoting myself from:
  https://gcc.gnu.org/ml/gcc-patches/2019-11/msg02171.html

> This is the column number as reported in the diagnostic i.e the COL_NUM
> when printing e.g.
>   warning: FILENAME:LINE_NUM:COL_NUM: some message
> 
> It seems to me that PR 49973 and this patch cover two separate things:
> (a) bytes vs display columns in diagnostic-show-locus.c
> (b) the "COL_NUM" mentioned above.
> 
> I'd prefer to omit (b) from the patch, and have the focus of the patch
> be (a), to tackle (b) in a separate patch.
> 
> [There's also the meaning of column numbers in the JSON output, and in
> the output of -fdiagnostics-parseable-fixits (which is intended to mimic
> clang's output format)]
> 
> It's unclear to me what the reported COL_NUM should be.
> There are various possibilities:
> 
> Units:
>   (A) [status quo] report a count of bytes within the line
>   (B) report a count of unicode characters
>   (C) report a count of unicode graphemes
>   (D) report based on the wcwidth of the characters
>   etc
> 
> Origin/baseline:
>   (A) [status quo] use 1 for the leftmost column
>   (B) use 0 for the leftmost column
> 
> Tab-handling:
>   (A) [status quo] don't give any kind of special status to tab characters
>   (B) implement tab stops, somehow.  For example, get_visual_column in
>       c-family/c-indentation implements tab stops based on bytes.
> 
> (so at least 4*2*2 = 16 possible meanings, ugh)
> 
> See also e.g.:
>   https://github.com/oasis-tcs/sarif-spec/issues/178
> 
> The GNU Coding Standards say
> 
>    Line numbers should start from 1 at the beginning of the file, and
>    column numbers should start from 1 at the beginning of the line.
>    (Both of these conventions are chosen for compatibility.) Calculate
>    column numbers assuming that space and all ASCII printing characters
>    have equal width, and assuming tab stops every 8 columns. For
>    non-ASCII characters, Unicode character widths should be used when in
>    a UTF-8 locale; GNU libc and GNU gnulib provide suitable wcwidth
>    functions.
> (https://www.gnu.org/prep/standards/standards.html#Errors)
> 
> I think if we do change the meaning of the "COL_NUM" output, we should
> probably add an option for it, to help with the transition (so that
> people can easily revert to the old behavior).
> 
> Perhaps something like:
> 
>   -fdiagnostics-column-unit=[bytes|gnu]
> 
>      bytes: [status-quo]; 1-based count of bytes, not respecting tab stops
>      gnu: as per GNU Coding Standards above
> 
> and have gcc 10 default to "gnu" (or whatever we call it), so that
> people can override it back to "bytes".
> 
> (again, I'm thinking aloud here)
> 
> But please can you split that out as a separate patch? (it's arguably
> still in time for GCC 10, as it's from a patch was posted before the
> stage 1 deadline).
Comment 21 David Malcolm 2019-12-09 20:39:31 UTC
(In reply to David Malcolm from comment #20)
> I've committed r279137 on Lewis's behalf, which fixes the issues identified
> in patch #13.
comment #13, I meant to say