This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Make GO string literals properly NUL terminated


On Mon, 20 Aug 2018, Bernd Edlinger wrote:

> On 08/20/18 13:01, Richard Biener wrote:
> > On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> >>
> >>
> >>
> >> On 08/01/18 11:29, Richard Biener wrote:
> >>>
> >>> Hmm.  I think it would be nice if TREE_STRING_LENGTH would
> >>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient
> >>> for your check above.  Because the '\0' doesn't belong to the
> >>> string.  Then build_string internally appends a '\0' outside
> >>> of TREE_STRING_LENGTH.
> >>>
> >>
> >> Hmm. Yes, but the outside-0 byte is just one byte, not a wide
> >> character.
> > 
> > That could be fixed though (a wide 0 is just N 0s).  Add a elsz = 1
> > parameter to build_string and allocate as many extra 0s as needed.
> > 
> >    There are STRING_CSTs which are not string literals,
> >> for instance attribute tags, Pragmas, asm constrants, etc.
> >> They use the '\0' outside, and have probably no TREE_TYPE.
> >>
> >>>
> >>>> So I would like to be able to assume that the STRING_CST objects
> >>>> are internally always generated properly by the front end.
> >>>
> >>> Yeah, I guess we need to define what "properly" is ;)
> >>>
> >> Yes.
> >>
> >>>> And that the ARRAY_TYPE of the string literal either has the
> >>>> same length than the TREE_STRING_LENGTH or if it is shorter,
> >>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH
> >>>
> >>> I think it should be always the same...
> >>>
> >>
> >> One could not differentiate between "\0" without zero-termination
> >> and "" with zero-termination, theoretically.
> > 
> > Is that important?  Doesn't the C standard say how to parse string literals?
> > 
> >> We also have char x[100] = "ab";
> >> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100.
> >> Of course one could create it with a TREE_STRING_LENGTH = 100,
> >> but imagine char x[100000000000] = "ab"
> > 
> > The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I
> > hope matches "ab" and not 'x'.  If it matches 'x' then I'd rather have it
> > unconditionally be [], thus incomplete (because the literals "size" depends
> > on the context/LHS it is used on).
> > 
> 
> Sorry, but I must say, it is not at all like that.
> 
> If I compile x.c:
> const char x[100] = "ab";
> 
> and set a breakpoint at output_constant:
> 
> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256,
>      reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821
> 4821	  if (size == 0 || flag_syntax_only)
> (gdb) p size
> $1 = 100
> (gdb) call debug(exp)
> "ab"
> (gdb) p *exp
> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1,
> (gdb) p exp->typed.type->type_common.size_unit
> $5 = (tree) 0x7ffff6ff9d80
> (gdb) call debug(exp->typed.type->type_common.size_unit)
> 100
> (gdb) p exp->string.length
> $6 = 3
> (gdb) p exp->string.str[0]
> $8 = 97 'a'
> (gdb) p exp->string.str[1]
> $9 = 98 'b'
> (gdb) p exp->string.str[2]
> $10 = 0 '\000'
> (gdb) p exp->string.str[3]
> $11 = 0 '\000'
> 
> 
> This is an important property of string_cst objects, that is used in c_strlen:
> 
> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH
> is guaranteed to be zero up to the type size.
> 
> I would not have spent one thought on implementing an optimization like that,
> but that's how it is right now.

Huh.  So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka
they have zero-padding up to its type size.  I don't see this documented
anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0"
with appropriate TYPE_SIZE.

This is also a relatively new thing on trunk (coming out of the added
mem_size parameter of string_constant).  That this looks at the STRING_CST
type like

  if (TREE_CODE (array) == STRING_CST)
    {
      *ptr_offset = fold_convert (sizetype, offset);
      if (mem_size)
        *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
      return array;

I'd call simply a bug.  As said, interpretation of STRING_CSTs should
depend on the context.  And for

char a[4] = "abc";
char b[5] = "abc";

we should better be able to share STRING_CSTs [you can see LTO
sharing the nodes if you do b[4] but not when b[5] I suppose].

> All I want to do, is make sure that all string constants have the same look and feel
> in the middle-end, and restrict the variations that are allowed by the current
> implementation.

Sure, I understand that.  But I'd like to simplify things and not add
complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide
whether sth is 0-terminated.

Richard.

> 
> Bernd.
> 
> 
> >>>> The idea is to use this property of string literals where needed,
> >>>> and check rigorously in varasm.c.
> >>>>
> >>>> Does that make sense?
> >>>
> >>> So if it is not the same then the excess character needs to be
> >>> a (wide) NUL in your model?  ISTR your varasm.c patch didn't verify
> >>> that.
> >>>
> >>
> >> I think it does.
> >>
> >>
> >> Bernd.
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]