This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Make GO string literals properly NUL terminated


On 08/20/18 17:59, Bernd Edlinger wrote:
> On 08/20/18 15:19, Richard Biener wrote:
>> On Mon, 20 Aug 2018, Bernd Edlinger wrote:
>>
>>> On 08/20/18 13:01, Richard Biener wrote:
>>>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 08/01/18 11:29, Richard Biener wrote:
>>>>>>
>>>>>> Hmm.  I think it would be nice if TREE_STRING_LENGTH would
>>>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient
>>>>>> for your check above.  Because the '\0' doesn't belong to the
>>>>>> string.  Then build_string internally appends a '\0' outside
>>>>>> of TREE_STRING_LENGTH.
>>>>>>
>>>>>
>>>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide
>>>>> character.
>>>>
>>>> That could be fixed though (a wide 0 is just N 0s).  Add a elsz = 1
>>>> parameter to build_string and allocate as many extra 0s as needed.
>>>>
>>>>     There are STRING_CSTs which are not string literals,
>>>>> for instance attribute tags, Pragmas, asm constrants, etc.
>>>>> They use the '\0' outside, and have probably no TREE_TYPE.
>>>>>
>>>>>>
>>>>>>> So I would like to be able to assume that the STRING_CST objects
>>>>>>> are internally always generated properly by the front end.
>>>>>>
>>>>>> Yeah, I guess we need to define what "properly" is ;)
>>>>>>
>>>>> Yes.
>>>>>
>>>>>>> And that the ARRAY_TYPE of the string literal either has the
>>>>>>> same length than the TREE_STRING_LENGTH or if it is shorter,
>>>>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH
>>>>>>
>>>>>> I think it should be always the same...
>>>>>>
>>>>>
>>>>> One could not differentiate between "\0" without zero-termination
>>>>> and "" with zero-termination, theoretically.
>>>>
>>>> Is that important?  Doesn't the C standard say how to parse string literals?
>>>>
>>>>> We also have char x[100] = "ab";
>>>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100.
>>>>> Of course one could create it with a TREE_STRING_LENGTH = 100,
>>>>> but imagine char x[100000000000] = "ab"
>>>>
>>>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I
>>>> hope matches "ab" and not 'x'.  If it matches 'x' then I'd rather have it
>>>> unconditionally be [], thus incomplete (because the literals "size" depends
>>>> on the context/LHS it is used on).
>>>>
>>>
>>> Sorry, but I must say, it is not at all like that.
>>>
>>> If I compile x.c:
>>> const char x[100] = "ab";
>>>
>>> and set a breakpoint at output_constant:
>>>
>>> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256,
>>>       reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821
>>> 4821      if (size == 0 || flag_syntax_only)
>>> (gdb) p size
>>> $1 = 100
>>> (gdb) call debug(exp)
>>> "ab"
>>> (gdb) p *exp
>>> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1,
>>> (gdb) p exp->typed.type->type_common.size_unit
>>> $5 = (tree) 0x7ffff6ff9d80
>>> (gdb) call debug(exp->typed.type->type_common.size_unit)
>>> 100
>>> (gdb) p exp->string.length
>>> $6 = 3
>>> (gdb) p exp->string.str[0]
>>> $8 = 97 'a'
>>> (gdb) p exp->string.str[1]
>>> $9 = 98 'b'
>>> (gdb) p exp->string.str[2]
>>> $10 = 0 '\000'
>>> (gdb) p exp->string.str[3]
>>> $11 = 0 '\000'
>>>
>>>
>>> This is an important property of string_cst objects, that is used in c_strlen:
>>>
>>> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH
>>> is guaranteed to be zero up to the type size.
>>>
>>> I would not have spent one thought on implementing an optimization like that,
>>> but that's how it is right now.
>>
>> Huh.  So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka
>> they have zero-padding up to its type size.  I don't see this documented
>> anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0"
>> with appropriate TYPE_SIZE.
>>
>> This is also a relatively new thing on trunk (coming out of the added
>> mem_size parameter of string_constant).  That this looks at the STRING_CST
>> type like
>>
>>    if (TREE_CODE (array) == STRING_CST)
>>      {
>>        *ptr_offset = fold_convert (sizetype, offset);
>>        if (mem_size)
>>          *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
>>        return array;
>>
>> I'd call simply a bug.  As said, interpretation of STRING_CSTs should
>> depend on the context.  And for
>>
> 
> This use of the TYPE_SIZE_UNIT was pre-existing to r263607
> before that (but not in the gcc-8-branch) we had this in c_strlen:
> 
>    HOST_WIDE_INT maxelts = strelts;
>    tree type = TREE_TYPE (src);
>    if (tree size = TYPE_SIZE_UNIT (type))
>      if (tree_fits_shwi_p (size))
>        {
>         maxelts = tree_to_uhwi (size);
>         maxelts = maxelts / eltsize - 1;
>        }
> 
> which I moved to string_constant and return the result through memsize.
> 
> It seems to be working somehow, and I'd bet removing that will immediately
> break twenty or thirty of the strlenopt tests.
> 
> Obviously the tree string objects allow way too much variations,
> and it has to be restricted in some way or another.
> 
> In the moment my approach is: look at what most string constants do
> and add assertions to ensure that there are no exceptions.
> 
> 

I found some more places where TYPE_SIZE_UNIT (TREE_TYPE (str))
and TREE_STRING_LENGTH (str) are used together, in some very old code
and clearly Martin is not to blame for them:

varasm.c:

static HOST_WIDE_INT
get_constant_size (tree exp)
{
   HOST_WIDE_INT size;

   size = int_size_in_bytes (TREE_TYPE (exp));
   if (TREE_CODE (exp) == STRING_CST)
     size = MAX (TREE_STRING_LENGTH (exp), size);
   return size;
}

mergeable_string_section (tree decl ATTRIBUTE_UNUSED,
                           unsigned HOST_WIDE_INT align ATTRIBUTE_UNUSED,
                           unsigned int flags ATTRIBUTE_UNUSED)
{
   HOST_WIDE_INT len;

   if (HAVE_GAS_SHF_MERGE && flag_merge_constants
       && TREE_CODE (decl) == STRING_CST
       && TREE_CODE (TREE_TYPE (decl)) == ARRAY_TYPE
       && align <= 256
       && (len = int_size_in_bytes (TREE_TYPE (decl))) > 0
       && TREE_STRING_LENGTH (decl) >= len)

So, no sorry, I don't see how to change the STRING_CST from using the
TREE_TYPE in that way.


Bernd.

>> char a[4] = "abc";
>> char b[5] = "abc";
>>
>> we should better be able to share STRING_CSTs [you can see LTO
>> sharing the nodes if you do b[4] but not when b[5] I suppose].
>>
>>> All I want to do, is make sure that all string constants have the same look and feel
>>> in the middle-end, and restrict the variations that are allowed by the current
>>> implementation.
>>
>> Sure, I understand that.  But I'd like to simplify things and not add
>> complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide
>> whether sth is 0-terminated.
>>
> 
> This is not only about 0-terminated. (*)
> 
> It is also about when you get a STRING_CST with a TREE_STRING_LENGTH,
> there are places in the middle-end that assume that the object contains
> _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT.
> Those I want to protect.
> 
> Bernd.
> 
> 
> *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero,
> the string is only returned when it is properly zero terminated.",
> but maybe I should have written:
> "If MEM_SIZE is zero, the string is only returned when the storage size
> of the string object is at least TREE_STRING_LENGTH."
> That's at least exactly what the check does.
> 
>> Richard.
>>
>>>
>>> Bernd.
>>>
>>>
>>>>>>> The idea is to use this property of string literals where needed,
>>>>>>> and check rigorously in varasm.c.
>>>>>>>
>>>>>>> Does that make sense?
>>>>>>
>>>>>> So if it is not the same then the excess character needs to be
>>>>>> a (wide) NUL in your model?  ISTR your varasm.c patch didn't verify
>>>>>> that.
>>>>>>
>>>>>
>>>>> I think it does.
>>>>>
>>>>>
>>>>> Bernd.
>>>
>>

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]