[PATCH] Check the STRING_CSTs in varasm.c

Bernd Edlinger bernd.edlinger@hotmail.de
Wed Aug 22 22:52:00 GMT 2018


On 08/22/18 22:54, Martin Sebor wrote:
> On 08/22/2018 08:27 AM, Bernd Edlinger wrote:
>> Hi,
>>
>>
>> this is an updated version of my STRING_CSTs checking in varasm.c
>> patch.
>>
>> It tries to answer the questions that were raised in th GO string literals
>> thread.
>>
>> The answers are:
>> a) strings with TYPE_SIZE_UNIT == NULL do exist, but only for STRING_CSTs
>> in constructors of flexible array struct members.  Not for string literals.
>> b) In all cases where the STRING_CSTs do have a TYPE_SIZE_UNIT, the
>> DECL_SIZE_UNIT has the same value.
>> c) When STRING_CSTs do not have a TYPE_SIZE_UNIT, that is in the constructor
>> of a flexible array member.  In that case the TREE_STRING_LENGTH
>> determines the flexible array size.
>>
>>
>> It changes varasm's get_constant_size to return the TYPE_SIZE_UNIT of
>> a STRING_CST literal as it is, without increasing the storage size
>> to TREE_STRING_LENGTH.  I added an assertion to make sure that
>> all STRING_CSTs have a type size; size == 0 can happen for empty Ada
>> strings.
>>
>> Furthermore it adds code to compare_constant to also compare the
>> STRING_CSTs TYPE_SIZE_UNIT if available.
>>
>> So I want to remove that from get_constant_size in order to not change
>> the memory layout of GO and Ada strings, while still having them look
>> mostly like C string literals.
>>
>> Furthermore I added one more consistency check to check_string_literal,
>> that makes sure that all STRING_CSTs that do have a TYPE_SIZE_UNIT,
>> the size matches the DECL_SIZE_UNIT.
>>
>> Those newly discovered properties of string literals make the code in
>> c_strlen and friends a lot simpler.
>>
>>
>>
>> Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
>> Is it OK for trunk?
> 
> 
> @@ -1162,9 +1162,13 @@ These nodes represent string-constants.
> 
>   returns the length of the string, as an @code{int}.  The
> 
>   @code{TREE_STRING_POINTER} is a @code{char*} containing the string
> 
>   itself.  The string may not be @code{NUL}-terminated, and it may contain
> 
> -embedded @code{NUL} characters.  Therefore, the
> 
> -@code{TREE_STRING_LENGTH} includes the trailing @code{NUL} if it is
> 
> -present.
> 
> +embedded @code{NUL} characters.  However, the
> 
> +@code{TREE_STRING_LENGTH} always includes a trailing @code{NUL} that
> 
> +is not part of the language string literal but appended by the front end.
> 
> +If the string shall not be @code{NUL}-terminated the @code{TREE_TYPE}
> 
> 
> If the string is not NUL-terminated (not shall not be -- that
> makes it sound like it must not be nul-terminated).
> 
> +is one character shorter than @code{TREE_STRING_LENGTH}.
> 
> 
> Presumably the phrase "the @code{TREE_TYPE} is shorter" means
> that the type of the string is an array whose domain is
> [0, TREE_STRING_LENGTH - 1], or something like that.  It would
> help to make this clear in a less informal way, especially if
> not all STRING_CST types have a domain (sounds like some don't
> if they have a null TYPE_SIZE_UNIT).
> 

Yes, good point.

> +Excess characters other than one trailing @code{NUL} character are not
> 
> +permitted.
> 
> 
> 
> Does this mean that they can be counted on not to exist
> because the front ends make sure they don't and the middle
> end doesn't create them, or that should not be created but
> that they might still exist?
> 
> You also mentioned a lot of detail in your answers above that
> isn't captured here.  Can that be added?  E.g., the part about
> TYPE_SIZE_UNIT of a STRING_CST being allowed to be null seems
> especially important, as does the bit about flexible array
> members.
>
Yes, initially I just saw the strings which are returned
from string_constant were much longer than the memory they lived in.
Therefore I wrote that patch that shortens the overlength
C strings.  I wanted to keep the property that the bare
string constants are null terminated, and not longer than they
should be, but that the type domain in the array type can
be used to strip off just the wide NUL termination, where the
frontend creates those properly, and the middle-end can
make use of these properties, and varasm.c checks that.

It was Richard's suggestion to add some checking code that the
string constants are actually how we want them to be in varasm.c,
so I did it and discovered a lot of things that are probably
not like we expected them to be.
Some background, and Richards comments are in this thread:
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01328.html

Richard does not think that it is safe to use the
TYPE_SIZE_UNIT like c_strlen, and other places does, but most
of the time the TYPE_SIZE_UNIT of the STRING_CST is valid.
So I used the assertions in this patch to locate places where
that assumption fails.

This revealed that string literals do always have a type domain,
and it is so far always such that
TYPE_SIZE_UNIT(TREE_TYPE) is something, most of the time = TREE_STRING_LENGTH.

Even if it is smaller than TREE_STRING_LENGTH that does not
matter for varasm, when these are actually streamed out they
are made as large as MAX(TREE_STRING_LENGTH, TYPE_SIZE_UNIT).

Thus code that looked at STRING_CST that were previously returned from
string_constant (before the initializer handling was implemented),
were in fact correct to use all bytes up to TREE_STRING_LENGTH-1.

Now for STRING_CST that are in DECL_INITIAL the rules are different.
Their memory size is MIN(TYPE_SIZE_UNIT,TREE_STRING_LENGTH) when
TYPE_SIZE_UNIT or the TYPE_DOMAIN MAX is != NULL.  If it is NULL
we usually have an initializer of a flexible array structure members.
The memory size is then simly TREE_STRING_LENGTH.  I do not know
if string_constant is able to return such incomplete strings or not,
maybe it is better not to return them.

However incomplete array initializers like
const char x[] = "abc";
are only incomplete while parsing (where you call braced_list_to_string)
and are made complete types a bit later (where I would like to move the
braced_list_to_string, after the array type domain is complete).
See: https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01361.html

So with the change to return the DECL_INITIAL STRING_CSTs
from string_constant, all places where the STRING_CSTs are used
are now broken.

This is an attempt to restore correctness on the STRING_CSTs.

So something has to change here.

I do currently think that I want to use the same rules for
string literals and string initializers.  That if they have
a type domain, that defines the memory size.  The Frontends
create always string constants that use a zero-termination.
And the array type domain for C literals is always
[0..TREE_STRING_LENGTH-1]
while Ada uses [0..TREE_STRING_LENGTH-2].

For sting_constant it is okay to use the array type domain,
at least when it is available, and varasm protects the
overall integrity.

Richard wants to use incomplete types everywhere.
But I cannot say if it is always easy to get a DECL_SIZE_UNIT
when one needs it.
I am also not sure that this is also an even larger change.

It is also possible to not do the NUL termination and
striping again for Ada/GO, However, even Ada (and Go) has a need to
get the strings in the string merge section, and that
plays nice with the additional NUL termination when the
string goes into the merge section.

And of course it is also possible to do everything without
any NUL termination, the termination could be simply encoded
in the array type domain.

My suggestions are mostly due to how one can make c_getstr
and c_strlen's life more easy.

But that is I would say details are still in discussion.


Thanks
Bernd.


More information about the Gcc-patches mailing list