Bug 49973 - Column numbers count special characters as multiple columns
Summary: Column numbers count special characters as multiple columns
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: preprocessor (show other bugs)
Version: 4.5.2
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: easyhack
Depends on:
Blocks:
 
Reported: 2011-08-04 10:21 UTC by Timothy Liang
Modified: 2016-02-04 02:14 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2011-08-04 15:18:20


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Timothy Liang 2011-08-04 10:21:25 UTC
int main()
{
	/* 中 */ asdf;
}

g++ -finput-charset=utf8 hello.cpp

hello.cpp: In function ‘int main()’:
hello.cpp:3:12: error: ‘asdf’ was not declared in this scope

The column number should be 10, not 12.
Comment 1 Jakub Jelinek 2011-08-04 10:27:53 UTC
Depends on how the column numbers are defined.  I think gcc uses bytes from the beginning of the line, then 12 is correct (and e.g. for tab characters gcc counts them as one instead of 1-8 depending on position too).
Comment 2 Timothy Liang 2011-08-04 10:36:53 UTC
(In reply to comment #1)
> Depends on how the column numbers are defined.  I think gcc uses bytes from the
> beginning of the line, then 12 is correct (and e.g. for tab characters gcc
> counts them as one instead of 1-8 depending on position too).

That isn't the case here.  Substituting the '中' for another character makes the column number 10.  Setting -finput-charset=latin1 makes the column number 15.
Comment 3 Andreas Schwab 2011-08-04 10:55:13 UTC
Why 10? "    /* 中 */ " has 12 characters (and 14 bytes as utf8).
Comment 4 Timothy Liang 2011-08-04 11:03:32 UTC
(In reply to comment #3)
> Why 10? "    /* 中 */ " has 12 characters (and 14 bytes as utf8).

The four spaces is supposed to be a tab.  Also, the column number starts from one.  So:

 /* 中 */ asdf
         |
1234567890

Since I set the input charset as UTF-8, g++ should count the '中' as one character, not three.  And when I set it to latin1, g++ should count the '中' as three, not six.
Comment 5 Manuel López-Ibáñez 2011-08-04 12:18:19 UTC
GNU Emacs 23.2.1 counts it as two, and puts the cursor at s.

For the simpler case of:

/* ñ */ asdf;

we print 

test.c:3:10: error: ‘asdf’ was not declared in this scope

whereas emacs counts only 1 char, so it again puts the cursor at s. I am not sure whether Emacs is following some GNU standard, but the case of ñ versus n, should at least produce the same result.

Unfortunately, I don't have time to work on this.
Comment 6 joseph@codesourcery.com 2011-08-04 14:38:16 UTC
The GCS says "column numbers should start from 1 at the beginning of the 
line ... Calculate column numbers assuming that space and all ASCII 
printing characters have equal width, and assuming tab stops every 8 
columns.".  This doesn't say how other characters should be counted, 
although for the results of wcswidth seem appropriate.
Comment 7 Manuel López-Ibáñez 2011-08-04 15:18:20 UTC
(In reply to comment #6)
> The GCS says "column numbers should start from 1 at the beginning of the 
> line ... Calculate column numbers assuming that space and all ASCII 
> printing characters have equal width, and assuming tab stops every 8 
> columns.".  This doesn't say how other characters should be counted, 
> although for the results of wcswidth seem appropriate.

Then GCC is not using wcswidth to count or it is setting the locale inappropriately because it is counting 2 for ñ and 3 for 中, while it should be 1 and 2.
Comment 8 Timothy Liang 2011-08-04 19:52:31 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > The GCS says "column numbers should start from 1 at the beginning of the 
> > line ... Calculate column numbers assuming that space and all ASCII 
> > printing characters have equal width, and assuming tab stops every 8 
> > columns.".  This doesn't say how other characters should be counted, 
> > although for the results of wcswidth seem appropriate.
> 
> Then GCC is not using wcswidth to count or it is setting the locale
> inappropriately because it is counting 2 for ñ and 3 for 中, while it should be
> 1 and 2.

I'm confused.  Shouldn't 中 be 1?
Comment 9 Tom Tromey 2011-12-07 17:59:52 UTC
(In reply to comment #6)
> The GCS says "column numbers should start from 1 at the beginning of the 
> line ... Calculate column numbers assuming that space and all ASCII 
> printing characters have equal width, and assuming tab stops every 8 
> columns.".  This doesn't say how other characters should be counted, 
> although for the results of wcswidth seem appropriate.

Note that GCC also handles the tab case incorrectly here.
This shows up if you M-x next-error in Emacs in the case where
gcc emits column numbers and your source code includes tabs.\

Refiling this to preprocessor.
Comment 10 joseph@codesourcery.com 2011-12-07 20:56:01 UTC
On Wed, 7 Dec 2011, tromey at gcc dot gnu.org wrote:

> Note that GCC also handles the tab case incorrectly here.

Yes, GCC should be fixed to follow the GCS there as well.

The GCS now explicitly say "For non-ASCII characters, Unicode character 
widths should be used when in a UTF-8 locale; GNU libc and GNU gnulib 
provide suitable @code{wcwidth} functions."
Comment 11 Manuel López-Ibáñez 2016-02-04 02:14:42 UTC
This should be fixed in libcpp, probably in lex.c, but maybe other places also. A good testcase to start with would be:

/* ñ /* */
/* a /* */

cc1 -Wcomment

prog.cc:1:7: warning: "/*" within comment [-Wcomment]
 /* ñ /* */
       ^
prog.cc:2:6: warning: "/*" within comment [-Wcomment]
 /* a /* */
      ^

Both locations should point to column 6. Look for places where column info is converted to source_location (linemap_position_for_column or linemap_position_for_line_and_colum). Figure out where the column info got the wrong value. Use something like wcwidth() to measure the width of non-ASCII characters and get the right width of 'ñ'.

Unfortunately, GCC 6 seems to be broken for the above testcase (https://gcc.gnu.org/PR69664). Revision 229665 seems fine.