This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Make GO string literals properly NUL terminated


On Wed, Aug 22, 2018 at 6:57 AM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
> On 08/21/18 10:33, Richard Biener wrote:
> > On Mon, 20 Aug 2018, Bernd Edlinger wrote:
> >
> >> On 08/20/18 15:19, Richard Biener wrote:
> >>> On Mon, 20 Aug 2018, Bernd Edlinger wrote:
> >>>
> >>>> On 08/20/18 13:01, Richard Biener wrote:
> >>>>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 08/01/18 11:29, Richard Biener wrote:
> >>>>>>>
> >>>>>>> Hmm.  I think it would be nice if TREE_STRING_LENGTH would
> >>>>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient
> >>>>>>> for your check above.  Because the '\0' doesn't belong to the
> >>>>>>> string.  Then build_string internally appends a '\0' outside
> >>>>>>> of TREE_STRING_LENGTH.
> >>>>>>>
> >>>>>>
> >>>>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide
> >>>>>> character.
> >>>>>
> >>>>> That could be fixed though (a wide 0 is just N 0s).  Add a elsz = 1
> >>>>> parameter to build_string and allocate as many extra 0s as needed.
> >>>>>
> >>>>>      There are STRING_CSTs which are not string literals,
> >>>>>> for instance attribute tags, Pragmas, asm constrants, etc.
> >>>>>> They use the '\0' outside, and have probably no TREE_TYPE.
> >>>>>>
> >>>>>>>
> >>>>>>>> So I would like to be able to assume that the STRING_CST objects
> >>>>>>>> are internally always generated properly by the front end.
> >>>>>>>
> >>>>>>> Yeah, I guess we need to define what "properly" is ;)
> >>>>>>>
> >>>>>> Yes.
> >>>>>>
> >>>>>>>> And that the ARRAY_TYPE of the string literal either has the
> >>>>>>>> same length than the TREE_STRING_LENGTH or if it is shorter,
> >>>>>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH
> >>>>>>>
> >>>>>>> I think it should be always the same...
> >>>>>>>
> >>>>>>
> >>>>>> One could not differentiate between "\0" without zero-termination
> >>>>>> and "" with zero-termination, theoretically.
> >>>>>
> >>>>> Is that important?  Doesn't the C standard say how to parse string literals?
> >>>>>
> >>>>>> We also have char x[100] = "ab";
> >>>>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100.
> >>>>>> Of course one could create it with a TREE_STRING_LENGTH = 100,
> >>>>>> but imagine char x[100000000000] = "ab"
> >>>>>
> >>>>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I
> >>>>> hope matches "ab" and not 'x'.  If it matches 'x' then I'd rather have it
> >>>>> unconditionally be [], thus incomplete (because the literals "size" depends
> >>>>> on the context/LHS it is used on).
> >>>>>
> >>>>
> >>>> Sorry, but I must say, it is not at all like that.
> >>>>
> >>>> If I compile x.c:
> >>>> const char x[100] = "ab";
> >>>>
> >>>> and set a breakpoint at output_constant:
> >>>>
> >>>> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256,
> >>>>        reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821
> >>>> 4821         if (size == 0 || flag_syntax_only)
> >>>> (gdb) p size
> >>>> $1 = 100
> >>>> (gdb) call debug(exp)
> >>>> "ab"
> >>>> (gdb) p *exp
> >>>> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1,
> >>>> (gdb) p exp->typed.type->type_common.size_unit
> >>>> $5 = (tree) 0x7ffff6ff9d80
> >>>> (gdb) call debug(exp->typed.type->type_common.size_unit)
> >>>> 100
> >>>> (gdb) p exp->string.length
> >>>> $6 = 3
> >>>> (gdb) p exp->string.str[0]
> >>>> $8 = 97 'a'
> >>>> (gdb) p exp->string.str[1]
> >>>> $9 = 98 'b'
> >>>> (gdb) p exp->string.str[2]
> >>>> $10 = 0 '\000'
> >>>> (gdb) p exp->string.str[3]
> >>>> $11 = 0 '\000'
> >>>>
> >>>>
> >>>> This is an important property of string_cst objects, that is used in c_strlen:
> >>>>
> >>>> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH
> >>>> is guaranteed to be zero up to the type size.
> >>>>
> >>>> I would not have spent one thought on implementing an optimization like that,
> >>>> but that's how it is right now.
> >>>
> >>> Huh.  So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka
> >>> they have zero-padding up to its type size.  I don't see this documented
> >>> anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0"
> >>> with appropriate TYPE_SIZE.
> >>>
> >>> This is also a relatively new thing on trunk (coming out of the added
> >>> mem_size parameter of string_constant).  That this looks at the STRING_CST
> >>> type like
> >>>
> >>>     if (TREE_CODE (array) == STRING_CST)
> >>>       {
> >>>         *ptr_offset = fold_convert (sizetype, offset);
> >>>         if (mem_size)
> >>>           *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
> >>>         return array;
> >>>
> >>> I'd call simply a bug.  As said, interpretation of STRING_CSTs should
> >>> depend on the context.  And for
> >>>
> >>
> >> This use of the TYPE_SIZE_UNIT was pre-existing to r263607
> >> before that (but not in the gcc-8-branch) we had this in c_strlen:
> >>
> >>     HOST_WIDE_INT maxelts = strelts;
> >>     tree type = TREE_TYPE (src);
> >>     if (tree size = TYPE_SIZE_UNIT (type))
> >>       if (tree_fits_shwi_p (size))
> >>         {
> >>          maxelts = tree_to_uhwi (size);
> >>          maxelts = maxelts / eltsize - 1;
> >>         }
> >>
> >> which I moved to string_constant and return the result through memsize.
> >>
> >> It seems to be working somehow, and I'd bet removing that will immediately
> >> break twenty or thirty of the strlenopt tests.
> >>
> >> Obviously the tree string objects allow way too much variations,
> >> and it has to be restricted in some way or another.
> >>
> >> In the moment my approach is: look at what most string constants do
> >> and add assertions to ensure that there are no exceptions.
> >>
> >>
> >>> char a[4] = "abc";
> >>> char b[5] = "abc";
> >>>
> >>> we should better be able to share STRING_CSTs [you can see LTO
> >>> sharing the nodes if you do b[4] but not when b[5] I suppose].
> >>>
> >>>> All I want to do, is make sure that all string constants have the same look and feel
> >>>> in the middle-end, and restrict the variations that are allowed by the current
> >>>> implementation.
> >>>
> >>> Sure, I understand that.  But I'd like to simplify things and not add
> >>> complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide
> >>> whether sth is 0-terminated.
> >>>
> >>
> >> This is not only about 0-terminated. (*)
> >>
> >> It is also about when you get a STRING_CST with a TREE_STRING_LENGTH,
> >> there are places in the middle-end that assume that the object contains
> >> _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT.
> >> Those I want to protect.
> >
> > Well, but string_constant handles &STRING_CST just fine but in that
> > context there's no "object" we look at.
> >
> > IMHO whenever string_constant ends up with a DECL, looking at
> > ctor_for_folding and we end up with a STRING_CST that doesn't fit
> > in the DECL we looked at we have a bug (non-truncated STRING_CST)
> > and either should do the truncation or simply return NULL.
> >
> > So we should look at DECL_SIZE_UNIT and compare that with
> > TREE_STRING_LENGTH.  Otherwise you are relying on TYPE_SIZE_UNIT
> > of the STRING_CST being "correct" (fitting the decl context).
> >
> >> Bernd.
> >>
> >>
> >> *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero,
> >> the string is only returned when it is properly zero terminated.",
> >> but maybe I should have written:
> >> "If MEM_SIZE is zero, the string is only returned when the storage size
> >> of the string object is at least TREE_STRING_LENGTH."
> >> That's at least exactly what the check does.
> >
> > Well, as said above
> >
> >    if (TREE_CODE (array) == STRING_CST)
> >      {
> >        *ptr_offset = fold_convert (sizetype, offset);
> >        if (mem_size)
> >          *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
> >        return array;
> >      }
> >
> > simply assumes the "storage size" of a STRING_CST is determined
> > by its TYPE_SIZE_UNIT.  That may be true as you noted in the folloup,
> > but clearly in the above case there's nothing to protect?  And in
> > the case we pulled the STRING_CST from some DECL_INITIAL it doesn't
> > protect from overflow of the FIELD_DECL unless frontends never
> > generate "wrong" STRING_CSTs?
> >
>
> Hmm, I digged some more in varasm.c
>
> Normal STRING_CST use get_constant_size to allocate
> MAX(TYPE_SIZE_UNIT, TREE_STRING_LENGTH)
> thus they can have excess zero bytes.
>
> while STRING_CST that are in DECL_INITIALIZERs
> are shrinked to DECL_SIZE_UNIT.
>
> At least that difference looks unnecessary to me.
>
>
> But compare_constant does not compare the TYPE_SIZE_UNIT:
>
>      case STRING_CST:
>        if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2)))
>          return 0;
>
>        return (TREE_STRING_LENGTH (t1) == TREE_STRING_LENGTH (t2)
>                && ! memcmp (TREE_STRING_POINTER (t1), TREE_STRING_POINTER (t2),
>                           TREE_STRING_LENGTH (t1)));
>
>
> This looks now like a bug.

Well.  I think that we emit excess zero bytes for "truncated" STRING_CSTs isn't
a designed feature.  I'd rather not have it that way because the length of a
STRING_CST becomes defined in multiple places if you consider STRING_CST
with [] or too small type then TREE_STRING_LENGTH is authorative while
if the type is larger then that is authorative.

The length of a STRING_CST if output should be only determined from
TREE_STRING_LENGTH.

>
> > Btw, get_constant_size / mergeable_string_section suggsts that
> > we may view STRING_CST as general target-encoded byte blob.
> > That may be useful to compress CONSTRUCTORs memory use.
> >
> > It looks like mergeable_string_section wrongly would reject
> > a char[] typed string because int_size_in_bytes returns -1
> > for incomplete types.  I still believe using char[] for most
> > STRING_CSTs generated by FEs would be a good thing for
> > IL memory use.  Shouldn't the mem_size initializations use
> > sth like get_constant_size as well?
> >
>
> I think incomplete types exist only in structs with variable
> length:
>
> struct {
>    int x;
>    char y[];
> } s = { 1, "test" };
>
> this is no candidate for a mergeable string.
>
> but
> char y[] = "test";
>
> is fixed by the FE to:
> char y[5] = "test";
>
> So that does not make problems with mergeable_string_section.
>
>
> > Also
> >
> >    if (mem_size)
> >      *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (init));
> >    else if (compare_tree_int (array_size, length + 1) < 0)
> >      return NULL_TREE;
> >
> > the latter check doesn't seem to honor 'offset' and *mem_size
> > is again looking at the STRING_CST, not at the FIELD_DECL or
> > wherever we got the STRING_CST from.
> >
>
> the caller compares offset with *mem_size, but we do not have
> the FIELD_DECL at hand here (Or I did not know where it was).
> I'd rather fail the cases when the TYPE_SIZE_UNIT is null,
> or something non-constant.
>
> All those are variations of vla with initializer and similar.
> Once the STRING_CST has a type, I would like to use it,
> when it doesn't the whole thing is probably not worth it
> anyway.

I would rather have STRING_CSTs only have a type for the
elements, not the whole thing...  you can get that
if you drop TYPE_DOMAIN on all STRING_CSTs array types.

Not sure what breaks if we change get_constant_size to
return TREE_STRING_LENGTH unconditionally.  At least
somehow the need to special-case STRING_CSTs looks
odd and I wonder if we do this in other functions as well...

Richard.

>
>
> Bernd.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]