Bug 33167

Summary:	Hex constant characters with \x escape not parsing correctly
Product:	gcc	Reporter:	Weston Hopkins <weston>
Component:	c	Assignee:	Not yet assigned to anyone <unassigned>
Status:	RESOLVED INVALID
Severity:	minor	CC:	gcc-bugs, weston
Priority:	P3
Version:	4.1.0
Target Milestone:	---
Host:	i586-suse-linux	Target:	i586-suse-linux
Build:	i586-suse-linux	Known to work:
Known to fail:		Last reconfirmed:

Description Weston Hopkins 2007-08-23 21:49:33 UTC

There seems to be a problem with how gcc parses the \x escape sequences.  It doesn't look at just the first 2 hex digits, but will take the right most 2 hex digits in a string of hex digits.  

[Recreate]
---------------[ SNIP ]---------------------
// test.c
#include <stdio.h>
#include <string.h>

int main() {
	char *string = "\x01\x02\x03Bob";
	printf("len: %d\n",  strlen(string) );
	return 1;
}
---------------[ SNIP ]---------------------

[Compilation options]
gcc -Wall test.c -o test

[Expected Results]
You would expect this to print out "len: 6", but it actually prints out "len: 5" It seems that its parsing the last \x escape as the hex value 0x3B instead of 2 characters, 0x03 and 'B'.

[Platforms]
I've noticed this problem in gcc 4.1.0 and 4.0.1 (on a mac).  Heres more info on one of the systems I've experiences this on:

gcc (GCC) 4.1.0 (SUSE Linux)
Using built-in specs.
Target: i586-suse-linux
Configured with: ../configure --enable-threads=posix --prefix=/usr --with-local-prefix=/usr/local --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib --libexecdir=/usr/lib --enable-languages=c,c++,objc,fortran,java,ada --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.1.0 --enable-ssp --disable-libssp --enable-java-awt=gtk --enable-gtk-cairo --disable-libjava-multilib --with-slibdir=/lib --with-system-zlib --enable-shared --enable-__cxa_atexit --enable-libstdcxx-allocator=new --without-system-libunwind --with-cpu=generic --host=i586-suse-linux
Thread model: posix
gcc version 4.1.0 (SUSE Linux)

$ uname -a
linux haldol 2.6.16.13-4-default #1 Wed May 3 04:53:23 UTC 2006 i686 athlon i386 GNU/Linux

Comment 1 Andrew Pinski 2007-08-23 21:59:37 UTC

No GCC is correct.
The standard says:
Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence.

So that means the B is going to be taken and be used for the hexadecimal escape sequence.

Comment 2 Weston Hopkins 2007-08-24 16:04:37 UTC

Yep, looks like you are right from the standard.  That sucks then. I wish it were the other way because I don't see a way to enter a literal single character in hex followed a by single character [A-Z0-9] without escape sequences.  Thanks for the quick response.

Comment 3 Ken Raeburn 2007-08-24 19:45:36 UTC

(In reply to comment #2)
> Yep, looks like you are right from the standard.  That sucks then. I wish it
> were the other way because I don't see a way to enter a literal single
> character in hex followed a by single character [A-Z0-9] without escape
> sequences.

char *string = "\x01\x02\x03" "Bob";

should work.

Comment 4 Albert Chan 2018-01-07 17:16:29 UTC

if gcc hex escapes is right, then gcc octal escape is wrong
(it just look at first 3 octals)

"\123" = "S"
"\0123" = "\n3"      ??
"\00123" = "\1" "23" ??

personally, i like this octal escape "bug"

Comment 5 jsm-csl@polyomino.org.uk 2018-01-09 22:01:11 UTC

The standard syntax production for octal-escape-sequence (C11 6.4.4.4#1) 
only allows one, two or three digits.

Comment 6 Albert Chan 2018-01-09 22:48:58 UTC

if gcc hex escape AND octal is right, does it contradict comment #1 ?

"octal or hexadecimal ... longest sequence that constitute escape sequence"

I noticed OLD python (2.0) also use the C rule regarding hex escapes,
but later switch to a more sensible 2 hex = 1 byte rule (\xX or \xXX)

Comment 7 jsm-csl@polyomino.org.uk 2018-01-09 23:14:39 UTC

"longest sequence of characters that can constitute the escape sequence" 
resolves an ambiguity between alternative parses permitted by the syntax; 
it doesn't need to deal with anything that is not permitted by the syntax.  
Four or more octal characters in an octal sequence are not a parse 
permitted by the syntax, whereas more than two hex characters in a hex 
sequence are a parse permitted by the syntax.