This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Make GO string literals properly NUL terminated


On Fri, Aug 24, 2018 at 1:49 PM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
> On 08/24/18 12:52, Richard Biener wrote:
> > On Wed, Aug 22, 2018 at 6:57 AM Bernd Edlinger
> > <bernd.edlinger@hotmail.de> wrote:
> >>
> >> On 08/21/18 10:33, Richard Biener wrote:
> >>> On Mon, 20 Aug 2018, Bernd Edlinger wrote:
> >>>
> >>>> On 08/20/18 15:19, Richard Biener wrote:
> >>>>> On Mon, 20 Aug 2018, Bernd Edlinger wrote:
> >>>>>
> >>>>>> On 08/20/18 13:01, Richard Biener wrote:
> >>>>>>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 08/01/18 11:29, Richard Biener wrote:
> >>>>>>>>>
> >>>>>>>>> Hmm.  I think it would be nice if TREE_STRING_LENGTH would
> >>>>>>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient
> >>>>>>>>> for your check above.  Because the '\0' doesn't belong to the
> >>>>>>>>> string.  Then build_string internally appends a '\0' outside
> >>>>>>>>> of TREE_STRING_LENGTH.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide
> >>>>>>>> character.
> >>>>>>>
> >>>>>>> That could be fixed though (a wide 0 is just N 0s).  Add a elsz = 1
> >>>>>>> parameter to build_string and allocate as many extra 0s as needed.
> >>>>>>>
> >>>>>>>       There are STRING_CSTs which are not string literals,
> >>>>>>>> for instance attribute tags, Pragmas, asm constrants, etc.
> >>>>>>>> They use the '\0' outside, and have probably no TREE_TYPE.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> So I would like to be able to assume that the STRING_CST objects
> >>>>>>>>>> are internally always generated properly by the front end.
> >>>>>>>>>
> >>>>>>>>> Yeah, I guess we need to define what "properly" is ;)
> >>>>>>>>>
> >>>>>>>> Yes.
> >>>>>>>>
> >>>>>>>>>> And that the ARRAY_TYPE of the string literal either has the
> >>>>>>>>>> same length than the TREE_STRING_LENGTH or if it is shorter,
> >>>>>>>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH
> >>>>>>>>>
> >>>>>>>>> I think it should be always the same...
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> One could not differentiate between "\0" without zero-termination
> >>>>>>>> and "" with zero-termination, theoretically.
> >>>>>>>
> >>>>>>> Is that important?  Doesn't the C standard say how to parse string literals?
> >>>>>>>
> >>>>>>>> We also have char x[100] = "ab";
> >>>>>>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100.
> >>>>>>>> Of course one could create it with a TREE_STRING_LENGTH = 100,
> >>>>>>>> but imagine char x[100000000000] = "ab"
> >>>>>>>
> >>>>>>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I
> >>>>>>> hope matches "ab" and not 'x'.  If it matches 'x' then I'd rather have it
> >>>>>>> unconditionally be [], thus incomplete (because the literals "size" depends
> >>>>>>> on the context/LHS it is used on).
> >>>>>>>
> >>>>>>
> >>>>>> Sorry, but I must say, it is not at all like that.
> >>>>>>
> >>>>>> If I compile x.c:
> >>>>>> const char x[100] = "ab";
> >>>>>>
> >>>>>> and set a breakpoint at output_constant:
> >>>>>>
> >>>>>> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256,
> >>>>>>         reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821
> >>>>>> 4821         if (size == 0 || flag_syntax_only)
> >>>>>> (gdb) p size
> >>>>>> $1 = 100
> >>>>>> (gdb) call debug(exp)
> >>>>>> "ab"
> >>>>>> (gdb) p *exp
> >>>>>> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1,
> >>>>>> (gdb) p exp->typed.type->type_common.size_unit
> >>>>>> $5 = (tree) 0x7ffff6ff9d80
> >>>>>> (gdb) call debug(exp->typed.type->type_common.size_unit)
> >>>>>> 100
> >>>>>> (gdb) p exp->string.length
> >>>>>> $6 = 3
> >>>>>> (gdb) p exp->string.str[0]
> >>>>>> $8 = 97 'a'
> >>>>>> (gdb) p exp->string.str[1]
> >>>>>> $9 = 98 'b'
> >>>>>> (gdb) p exp->string.str[2]
> >>>>>> $10 = 0 '\000'
> >>>>>> (gdb) p exp->string.str[3]
> >>>>>> $11 = 0 '\000'
> >>>>>>
> >>>>>>
> >>>>>> This is an important property of string_cst objects, that is used in c_strlen:
> >>>>>>
> >>>>>> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH
> >>>>>> is guaranteed to be zero up to the type size.
> >>>>>>
> >>>>>> I would not have spent one thought on implementing an optimization like that,
> >>>>>> but that's how it is right now.
> >>>>>
> >>>>> Huh.  So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka
> >>>>> they have zero-padding up to its type size.  I don't see this documented
> >>>>> anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0"
> >>>>> with appropriate TYPE_SIZE.
> >>>>>
> >>>>> This is also a relatively new thing on trunk (coming out of the added
> >>>>> mem_size parameter of string_constant).  That this looks at the STRING_CST
> >>>>> type like
> >>>>>
> >>>>>      if (TREE_CODE (array) == STRING_CST)
> >>>>>        {
> >>>>>          *ptr_offset = fold_convert (sizetype, offset);
> >>>>>          if (mem_size)
> >>>>>            *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
> >>>>>          return array;
> >>>>>
> >>>>> I'd call simply a bug.  As said, interpretation of STRING_CSTs should
> >>>>> depend on the context.  And for
> >>>>>
> >>>>
> >>>> This use of the TYPE_SIZE_UNIT was pre-existing to r263607
> >>>> before that (but not in the gcc-8-branch) we had this in c_strlen:
> >>>>
> >>>>      HOST_WIDE_INT maxelts = strelts;
> >>>>      tree type = TREE_TYPE (src);
> >>>>      if (tree size = TYPE_SIZE_UNIT (type))
> >>>>        if (tree_fits_shwi_p (size))
> >>>>          {
> >>>>           maxelts = tree_to_uhwi (size);
> >>>>           maxelts = maxelts / eltsize - 1;
> >>>>          }
> >>>>
> >>>> which I moved to string_constant and return the result through memsize.
> >>>>
> >>>> It seems to be working somehow, and I'd bet removing that will immediately
> >>>> break twenty or thirty of the strlenopt tests.
> >>>>
> >>>> Obviously the tree string objects allow way too much variations,
> >>>> and it has to be restricted in some way or another.
> >>>>
> >>>> In the moment my approach is: look at what most string constants do
> >>>> and add assertions to ensure that there are no exceptions.
> >>>>
> >>>>
> >>>>> char a[4] = "abc";
> >>>>> char b[5] = "abc";
> >>>>>
> >>>>> we should better be able to share STRING_CSTs [you can see LTO
> >>>>> sharing the nodes if you do b[4] but not when b[5] I suppose].
> >>>>>
> >>>>>> All I want to do, is make sure that all string constants have the same look and feel
> >>>>>> in the middle-end, and restrict the variations that are allowed by the current
> >>>>>> implementation.
> >>>>>
> >>>>> Sure, I understand that.  But I'd like to simplify things and not add
> >>>>> complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide
> >>>>> whether sth is 0-terminated.
> >>>>>
> >>>>
> >>>> This is not only about 0-terminated. (*)
> >>>>
> >>>> It is also about when you get a STRING_CST with a TREE_STRING_LENGTH,
> >>>> there are places in the middle-end that assume that the object contains
> >>>> _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT.
> >>>> Those I want to protect.
> >>>
> >>> Well, but string_constant handles &STRING_CST just fine but in that
> >>> context there's no "object" we look at.
> >>>
> >>> IMHO whenever string_constant ends up with a DECL, looking at
> >>> ctor_for_folding and we end up with a STRING_CST that doesn't fit
> >>> in the DECL we looked at we have a bug (non-truncated STRING_CST)
> >>> and either should do the truncation or simply return NULL.
> >>>
> >>> So we should look at DECL_SIZE_UNIT and compare that with
> >>> TREE_STRING_LENGTH.  Otherwise you are relying on TYPE_SIZE_UNIT
> >>> of the STRING_CST being "correct" (fitting the decl context).
> >>>
> >>>> Bernd.
> >>>>
> >>>>
> >>>> *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero,
> >>>> the string is only returned when it is properly zero terminated.",
> >>>> but maybe I should have written:
> >>>> "If MEM_SIZE is zero, the string is only returned when the storage size
> >>>> of the string object is at least TREE_STRING_LENGTH."
> >>>> That's at least exactly what the check does.
> >>>
> >>> Well, as said above
> >>>
> >>>     if (TREE_CODE (array) == STRING_CST)
> >>>       {
> >>>         *ptr_offset = fold_convert (sizetype, offset);
> >>>         if (mem_size)
> >>>           *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
> >>>         return array;
> >>>       }
> >>>
> >>> simply assumes the "storage size" of a STRING_CST is determined
> >>> by its TYPE_SIZE_UNIT.  That may be true as you noted in the folloup,
> >>> but clearly in the above case there's nothing to protect?  And in
> >>> the case we pulled the STRING_CST from some DECL_INITIAL it doesn't
> >>> protect from overflow of the FIELD_DECL unless frontends never
> >>> generate "wrong" STRING_CSTs?
> >>>
> >>
> >> Hmm, I digged some more in varasm.c
> >>
> >> Normal STRING_CST use get_constant_size to allocate
> >> MAX(TYPE_SIZE_UNIT, TREE_STRING_LENGTH)
> >> thus they can have excess zero bytes.
> >>
> >> while STRING_CST that are in DECL_INITIALIZERs
> >> are shrinked to DECL_SIZE_UNIT.
> >>
> >> At least that difference looks unnecessary to me.
> >>
> >>
> >> But compare_constant does not compare the TYPE_SIZE_UNIT:
> >>
> >>       case STRING_CST:
> >>         if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2)))
> >>           return 0;
> >>
> >>         return (TREE_STRING_LENGTH (t1) == TREE_STRING_LENGTH (t2)
> >>                 && ! memcmp (TREE_STRING_POINTER (t1), TREE_STRING_POINTER (t2),
> >>                            TREE_STRING_LENGTH (t1)));
> >>
> >>
> >> This looks now like a bug.
> >
> > Well.  I think that we emit excess zero bytes for "truncated" STRING_CSTs isn't
> > a designed feature.  I'd rather not have it that way because the length of a
> > STRING_CST becomes defined in multiple places if you consider STRING_CST
> > with [] or too small type then TREE_STRING_LENGTH is authorative while
> > if the type is larger then that is authorative.
> >
> > The length of a STRING_CST if output should be only determined from
> > TREE_STRING_LENGTH.
> >
>
> Hmm. I think about that.  It is possible that I change my mind....
>
> Currently the semantic of STRING_CST objects differs for literals
> and for initializer elements.
>
> For literals the TREE_STRING_LENGTH is guaranteed to resemble
> the runtime string object.
>
> For initializer elements the type domain overrides the TREE_STRING_LENGTH
> if it is available.  If it is a flexible array member the type domain
> is not available, and the TREE_STRING_LENGTH determines the memory size.
>
> The regression in tree-ssa-forwprop.c (PR 86714) is caused by
> the fact that previously constant_string did always return string literals
> while now it can as well return initializer elements.
> But tree-ssa-forwprop expects string literal semantics.
>
> I would like to remove the semantic differences.
>
> I have a patch in the test, that fixes the TYPE_DOMAIN of flexible
> array members.
>
> Maybe the most important property is TYPE_DOMAIN of the STRING_CST should never
> be shorter than the TREE_STRING_LENGTH. Then it the difference between
> string literals and string initializers will disappear.
>
> And that should probably take precedence over the zero-termination property of
> the string_cst objects.
>
> Which will of course put one more question mark on the correctness of the
> string_constant, c_strlen etc.
>
>
> >>
> >>> Btw, get_constant_size / mergeable_string_section suggsts that
> >>> we may view STRING_CST as general target-encoded byte blob.
> >>> That may be useful to compress CONSTRUCTORs memory use.
> >>>
> >>> It looks like mergeable_string_section wrongly would reject
> >>> a char[] typed string because int_size_in_bytes returns -1
> >>> for incomplete types.  I still believe using char[] for most
> >>> STRING_CSTs generated by FEs would be a good thing for
> >>> IL memory use.  Shouldn't the mem_size initializations use
> >>> sth like get_constant_size as well?
> >>>
> >>
> >> I think incomplete types exist only in structs with variable
> >> length:
> >>
> >> struct {
> >>     int x;
> >>     char y[];
> >> } s = { 1, "test" };
> >>
> >> this is no candidate for a mergeable string.
> >>
> >> but
> >> char y[] = "test";
> >>
> >> is fixed by the FE to:
> >> char y[5] = "test";
> >>
> >> So that does not make problems with mergeable_string_section.
> >>
> >>
> >>> Also
> >>>
> >>>     if (mem_size)
> >>>       *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (init));
> >>>     else if (compare_tree_int (array_size, length + 1) < 0)
> >>>       return NULL_TREE;
> >>>
> >>> the latter check doesn't seem to honor 'offset' and *mem_size
> >>> is again looking at the STRING_CST, not at the FIELD_DECL or
> >>> wherever we got the STRING_CST from.
> >>>
> >>
> >> the caller compares offset with *mem_size, but we do not have
> >> the FIELD_DECL at hand here (Or I did not know where it was).
> >> I'd rather fail the cases when the TYPE_SIZE_UNIT is null,
> >> or something non-constant.
> >>
> >> All those are variations of vla with initializer and similar.
> >> Once the STRING_CST has a type, I would like to use it,
> >> when it doesn't the whole thing is probably not worth it
> >> anyway.
> >
> > I would rather have STRING_CSTs only have a type for the
> > elements, not the whole thing...  you can get that
> > if you drop TYPE_DOMAIN on all STRING_CSTs array types.
> >
>
> This will break the string_constant code even more.
> It depends on the TYPE_DOMAIN to mean something.
>
> > Not sure what breaks if we change get_constant_size to
> > return TREE_STRING_LENGTH unconditionally.  At least
> > somehow the need to special-case STRING_CSTs looks
> > odd and I wonder if we do this in other functions as well...
> >
> I think as well this special handling in get_constant_size
> is something that has to go away.
>
> It is even worse, the DECL_SIZE_UNIT is also NULL in
> flexible array members.
>
> But I think I can make it work the other way round.
> So that STRING_CST always have a TYPE_DOMAIN which
> may only be larger than the TREE_STRING_LENGTH.

OK, that would then be at least clear semantics and follow
that of CONSTRUCTORs (missing "elements" are zero).

Richard.

>
>
> Bernd.
>
> > Richard.
> >
> >>
> >>
> >> Bernd.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]