This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: PING [PATCH] warn for strlen of arrays with missing nul (PR 86552, 86711, 86714) )


On 08/24/18 08:36, Jeff Law wrote:
> On 08/02/2018 07:26 AM, Bernd Edlinger wrote:
>> On 08/02/18 04:44, Martin Sebor wrote:
>>> Since the foundation of the patch is detecting and avoiding
>>> the overly aggressive folding of unterminated char arrays,
>>> besides issuing a warning for such arguments to strlen,
>>> the patch also fixes pr86711 - wrong folding of memchr, and
>>> pr86714 - tree-ssa-forwprop.c confused by too long initializer.
>>>
>>> The substance of the attached updated patch is unchanged,
>>> I have just added test cases for the two additional bugs.
>>>
>>> Bernd, as I mentioned Wednesday, the patch supersedes
>>> yours here:
>>> https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01800.html
>>>
>>
>> No problem, but I hope you understand, that I still uphold
>> my patch.
>>
>> So we have two patches now:
>> - mine, fixing a wrong code bug,
>> - yours, implementing a new warning and fixing a wrong
>> code bug at the same time.
>>
>> I will add a few comments to your patch below.
> [ ... ]
> 
> So a lot of the comments are out of date, presumably because Martin
> fixed the issues you pointed out in his second version of the patch.
> But there's still some useful nuggets in your comments that are still
> relevant.
> 

Yes, possible, and therefore I would like updated patches summarize
the changes.

> FYI it appears that sometimes you comment above a chunk of code, and
> other times below.  That makes it exceedingly difficult to figure out
> the issue you're trying to raise.
> 

Okydoky.


> 
>>
>>> Martin
>>>
>>> On 07/30/2018 01:17 PM, Martin Sebor wrote:
>>>> Attached is an updated version of the patch that handles more
>>>> instances of calling strlen() on a constant array that is not
>>>> a nul-terminated string.
>>>>
>>>> No other functions except strlen are explicitly handled yet,
>>>> and neither are constant arrays with braced-initializer lists
>>>> like const char a[] = { 'a', 'b', 'c' };  I am testing
>>>> an independent solution for those (bug 86552).  Once those
>>>> are handled the warning will be able to detect those as well.
>>>>
>>>> Tested on x86_64-linux.
>>>>
>>>> On 07/25/2018 05:38 PM, Martin Sebor wrote:
>>>>> Ping: https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01124.html
>>>>>
>>>>> The fix for bug 86532 has been checked in so this enhancement
>>>>> can now be applied on top of it (with only minor adjustments).
>>>>>
>>>>> On 07/19/2018 02:08 PM, Martin Sebor wrote:
>>>>>> In the discussion of my patch for pr86532 Bernd noted that
>>>>>> GCC silently accepts constant character arrays with no
>>>>>> terminating nul as arguments to strlen (and other string
>>>>>> functions).
>>>>>>
>>>>>> The attached patch is a first step in detecting these kinds
>>>>>> of bugs in strlen calls by issuing -Wstringop-overflow.
>>>>>> The next step is to modify all other handlers of built-in
>>>>>> functions to detect the same problem (not part of this patch).
>>>>>> Yet another step is to detect these problems in arguments
>>>>>> initialized using the non-string form:
>>>>>>
>>>>>>    const char a[] = { 'a', 'b', 'c' };
>>>>>>
>>>>>> This patch is meant to apply on top of the one for bug 86532
>>>>>> (I tested it with an earlier version of that patch so there
>>>>>> is code in the context that does not appear in the latest
>>>>>> version of the other diff).
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>
>>>>
>>>
>>> PR tree-optimization/86714 - tree-ssa-forwprop.c confused by too long initializer
>>> PR tree-optimization/86711 - wrong folding of memchr
>>> PR tree-optimization/86552 - missing warning for reading past the end of non-string arrays
>>>
>>> gcc/ChangeLog:
>>>
>>> 	PR tree-optimization/86714
>>> 	PR tree-optimization/86711
>>> 	PR tree-optimization/86552
>>> 	* builtins.h (warn_string_no_nul): Declare..
>>> 	(c_strlen): Add argument.
>>> 	* builtins.c (warn_string_no_nul): New function.
>>> 	(fold_builtin_strlen): Add argument.  Detect missing nul.
>>> 	(fold_builtin_1): Adjust.
>>> 	(string_length): Add argument and use it.
>>> 	(c_strlen): Same.
>>> 	(expand_builtin_strlen): Detect missing nul.
>>> 	* expr.c (string_constant): Add arguments.  Detect missing nul
>>> 	terminator and outermost declaration it's missing in.
>>> 	* expr.h (string_constant): Add argument.
>>> 	* fold-const.c (c_getstr): Change argument to bool*, rename
>>> 	other arguments.
>>> 	* fold-const-call.c (fold_const_call): Detect missing nul.
>>> 	* gimple-fold.c (get_range_strlen): Add argument.
>>> 	(get_maxval_strlen): Adjust.
>>> 	* gimple-fold.h (get_range_strlen): Add argument.
>>>
>>> gcc/testsuite/ChangeLog:
>>>
>>> 	PR tree-optimization/86714
>>> 	PR tree-optimization/86711
>>> 	PR tree-optimization/86552
>>> 	* gcc.c-torture/execute/memchr-1.c: New test.
>>> 	* gcc.c-torture/execute/pr86714.c: New test.
>>> 	* gcc.dg/warn-strlen-no-nul.c: New test.
>>>
>>> diff --git a/gcc/builtins.c b/gcc/builtins.c
>>> index aa3e0d8..f4924d5 100644
>>> --- a/gcc/builtins.c
>>> +++ b/gcc/builtins.c
>>> @@ -150,7 +150,7 @@ static tree stabilize_va_list_loc (location_t, tree, int);
>>>   static rtx expand_builtin_expect (tree, rtx);
>>>   static tree fold_builtin_constant_p (tree);
>>>   static tree fold_builtin_classify_type (tree);
>>> -static tree fold_builtin_strlen (location_t, tree, tree);
>>> +static tree fold_builtin_strlen (location_t, tree, tree, tree);
>>>   static tree fold_builtin_inf (location_t, tree, int);
>>>   static tree rewrite_call_expr (location_t, tree, int, tree, int, ...);
>>>   static bool validate_arg (const_tree, enum tree_code code);
>>> @@ -550,6 +550,36 @@ string_length (const void *ptr, unsigned eltsize, unsigned maxelts)
>>>     return n;
>>>   }
>>>   
>>> +/* For a call expression EXP to a function that expects a string argument,
>>> +   issue a diagnostic due to it being a called with an argument NONSTR
>>> +   that is a character array with no terminating NUL.  */
>>> +
>>> +void
>>> +warn_string_no_nul (location_t loc, tree exp, tree fndecl, tree nonstr)
>>> +{
>>> +  loc = expansion_point_location_if_in_system_header (loc);
>>> +
>>> +  bool warned;
>>> +  if (exp)
>>> +    {
>>> +      if (!fndecl)
>>> +	fndecl = get_callee_fndecl (exp);
>>> +      warned = warning_at (loc, OPT_Wstringop_overflow_,
>>> +			   "%K%qD argument missing terminating nul",
>>> +			   exp, fndecl);
>>> +    }
>>> +  else
>>> +    {
>>> +      gcc_assert (fndecl);
>>> +      warned = warning_at (loc, OPT_Wstringop_overflow_,
>>> +			   "%qD argument missing terminating nul",
>>> +			   fndecl);
>>> +    }
>>> +
>>> +  if (warned && DECL_P (nonstr))
>>> +    inform (DECL_SOURCE_LOCATION (nonstr), "referenced argument declared here");
>>> +}
>>> +
>>>   /* Compute the length of a null-terminated character string or wide
>>>      character string handling character sizes of 1, 2, and 4 bytes.
>>>      TREE_STRING_LENGTH is not the right way because it evaluates to
>>> @@ -567,37 +597,60 @@ string_length (const void *ptr, unsigned eltsize, unsigned maxelts)
>>>      accesses.  Note that this implies the result is not going to be emitted
>>>      into the instruction stream.
>>>   
>>> +   When ARR is non-null and the string is not properly nul-terminated,
>>> +   set *ARR to the declaration of the outermost constant object whose
>>> +   initializer (or one of its elements) is not nul-terminated.
>>> +
>>>      The value returned is of type `ssizetype'.
>>>   
>>>      Unfortunately, string_constant can't access the values of const char
>>>      arrays with initializers, so neither can we do so here.  */
>>>   
>>>   tree
>>> -c_strlen (tree src, int only_value)
>>> +c_strlen (tree src, int only_value, tree *arr /* = NULL */)
>>>   {
>>>     STRIP_NOPS (src);
>>> +
>>> +  /* Used to detect non-nul-terminated strings in subexpressions
>>> +     of a conditional expression.  When ARR is null, point it at
>>> +     one of the elements for simplicity.  */
>>> +  tree arrs[] = { NULL_TREE, NULL_TREE };
>>> +  if (!arr)
>>> +    arr = arrs;
>>
>> now arr is always != NULL
> Right.  It's renamed in the follow-up patch from Martin, but yes, it's
> always non-null.  Note in this case your comment is after the code
> you're referring to.
> 
>>
>>> +
>>>     if (TREE_CODE (src) == COND_EXPR
>>>         && (only_value || !TREE_SIDE_EFFECTS (TREE_OPERAND (src, 0))))
>>>       {
>>> -      tree len1, len2;
>>> -
>>> -      len1 = c_strlen (TREE_OPERAND (src, 1), only_value);
>>> -      len2 = c_strlen (TREE_OPERAND (src, 2), only_value);
>>> +      tree len1 = c_strlen (TREE_OPERAND (src, 1), only_value, arrs);
>>> +      tree len2 = c_strlen (TREE_OPERAND (src, 2), only_value, arrs + 1);
>>>         if (tree_int_cst_equal (len1, len2))
>>> -	return len1;
>>> +	{
>>
>> funny, if called with NULL *arr and arrs[0] alias each other.
> 
>>
>>> +	  *arr = arrs[0] ? arrs[0] : arrs[1];
>>> +	  return len1;
>>> +	}
>>>       }
> 
> And in this case it looks like your comment is before the code you're
> commenting about.  It was fairly obvious in this case because the code
> prior to your "funny, if called..." comment didn't reference arr or arrs
> at all.
> 
> And more importantly, why are you concerned about the aliasing?
> 

It is just *arr = arrs[0] does nothing, but it looks like the author
was not aware of it.  It may be okay, but causes head-scratching.
If you don't have the context you will think this does something different.

> 
> 
> 
> 
>>>   
>>>     if (TREE_CODE (src) == COMPOUND_EXPR
>>>         && (only_value || !TREE_SIDE_EFFECTS (TREE_OPERAND (src, 0))))
>>> -    return c_strlen (TREE_OPERAND (src, 1), only_value);
>>> +    return c_strlen (TREE_OPERAND (src, 1), only_value, arr);
>>>   
>>>     location_t loc = EXPR_LOC_OR_LOC (src, input_location);
>>>   
>>>     /* Offset from the beginning of the string in bytes.  */
>>>     tree byteoff;
>>> -  src = string_constant (src, &byteoff);
>>> -  if (src == 0)
>>> -    return NULL_TREE;
>>> +  /* Set if array is nul-terminated, false otherwise.  */
>>> +  bool nulterm;
>>
>> note arr is always != null or pointing to arrs[0].
>>
>>> +  src = string_constant (src, &byteoff, &nulterm, arr);
>>> +  if (!src)
>>> +    {
>>> +      *arr = arrs[0] ? arrs[0] : arrs[1];
>>> +      return NULL_TREE;
>>> +    }
>>> +
>>> +  /* Clear *ARR when the string is nul-terminated.  It should be
>>> +     of no interest to callers.  */
>>> +  if (nulterm)
>>> +    *arr = NULL_TREE;
>>>   
>>>     /* Determine the size of the string element.  */
>>>     unsigned eltsize
>>> @@ -621,6 +674,12 @@ c_strlen (tree src, int only_value)
>>>   	maxelts = maxelts / eltsize - 1;
>>>         }
>>>   
>>> +  /* Unless the caller is prepared to handle it by passing in a non-null
>>> +     ARR, fail if the terminating nul doesn't fit in the array the string
>>> +     is stored in (as in const char a[3] = "123";  */
>>
>> note arr is always != NULL, thus this if is never taken.
>>
>>> +  if (!arr && maxelts < strelts)
>>> +    return NULL_TREE;
>>> +
> Right.  And I think that check is gone in the second version of Martin's
> patch.
> 
> 
>>>     /* PTR can point to the byte representation of any string type, including
>>>        char* and wchar_t*.  */
>>>     const char *ptr = TREE_STRING_POINTER (src);
>>> @@ -650,7 +709,8 @@ c_strlen (tree src, int only_value)
>>>         offsave = fold_convert (ssizetype, offsave);
>>>         tree condexp = fold_build2_loc (loc, LE_EXPR, boolean_type_node, offsave,
>>>   				      build_int_cst (ssizetype, len * eltsize));
>>
>> this computation is wrong, it computes not in units of eltsize,
>> I am however not sure if it is really good that this function tries to
>> compute strlen of wide character strings.
> Note that in the second version of Martin's patch eltsize will always be
> 1 when we get here.  Furthermore, the multiplication by eltsize is gone.
>   It looks like you switched back to commenting after the code for that
> comment, but then immediately thereafter...
> 

Acutally I had no idea that the second patch did resolve some of my comments
and which those were.

I had the impression that it is just splitting up of a large patch into
several smaller without reworking at the same time.

Once again, a summary what was changed would be helpful.

> 
>>
>> That said, please fix this computation first, in a different patch
>> instead of just fixing the indentation. (I know I pointed that lines are too
>> long here, but that was before I realized that the whole length computation
>> here is wrong).
>>
>>> -      tree lenexp = size_diffop_loc (loc, ssize_int (strelts * eltsize), offsave);
>>> +      tree lenexp = size_diffop_loc (loc, ssize_int (strelts * eltsize),
>>> +				     offsave);
>>>         return fold_build3_loc (loc, COND_EXPR, ssizetype, condexp, lenexp,
>>>   			      build_zero_cst (ssizetype));
>>>       }
>>> @@ -690,7 +750,7 @@ c_strlen (tree src, int only_value)
>>>        Since ELTOFF is our starting index into the string, no further
>>>        calculation is needed.  */
> The comment shown above appears to refer to the code below the comment.
> Again, this makes it exceedingly confusing to understand your comments
> and take appropriate action.
> 
> 
> 
> 
>>
>> What are you fixing here, I think that was another bug.
>> If this fixes something then it should be in a different patch,
>> just handling this.
>>
>>>     unsigned len = string_length (ptr + eltoff * eltsize, eltsize,
>>> -				maxelts - eltoff);
>>> +				strelts - eltoff);
>>>   
>>>     return ssize_int (len);
>>>   }
> I'm guessing the comment in this case refers to the code after.
> Presumably questioning the change from maxelts to strelts.
> 
> 

Yes.  I was thinking that could be a patch of its own.

> 
> 
>>> @@ -2855,7 +2915,6 @@ expand_builtin_strlen (tree exp, rtx target,
>>>   
>>>     struct expand_operand ops[4];
>>>     rtx pat;
>>> -  tree len;
>>>     tree src = CALL_EXPR_ARG (exp, 0);
>>>     rtx src_reg;
>>>     rtx_insn *before_strlen;
>>> @@ -2864,20 +2923,39 @@ expand_builtin_strlen (tree exp, rtx target,
>>>     unsigned int align;
>>>   
>>>     /* If the length can be computed at compile-time, return it.  */
>>> -  len = c_strlen (src, 0);
>>> +  tree array;
>>> +  tree len = c_strlen (src, 0, &array);
>>
>> You know the c_strlen tries to compute wide character sizes,
>> but strlen does not do that, strlen (L"abc") should give 1
>> (or 0 on a BE machine)
>> I wonder if that is correct.
> So I think this is fixed by your change which restored the default
> behavior of c_strlen to count bytes.  Which restores the behavior of
> c_strlen to match that of the strlen library call.
> 
> So for something like L"abc", the right value is 1 or 0 for LE and BE
> targets respectively.
> 

Well, yes, although I changed my mind on strlen(L"abc") meanwhile.

This may appear as if I contradict myself but, it is more like I learn.

The installed patch for the ELTSIZE parameter was in part inspired by the
thought about "not folding invalid stuff" that was forming in my mind
at that time, even before I wrote it down and sent it to this list.

So the patch does reject to give a value for strlen(L"abc") since
it is an invalid call.

Previously that was counting bytes in a wide char string, but I doubt
that was the original intention though.


> 
> 
>>
>>>     if (len)
>>> -    return expand_expr (len, target, target_mode, EXPAND_NORMAL);
>>> +    {
>>> +      if (array)
>>> +	{
>>> +	  /* Array refers to the non-nul terminated constant array
>>> +	     whose length is attempted to be computed.  */
>>
>> I really wonder if it would not make more sense to have a
>> nonterminated_string_constant_p instead.
> Rather than a boolean I think we want a tree * so that we can bubble up
> more information to the caller WRT non-terminated strings.
> 
>>
>> Last time I wanted to implement a warning in expand I faced the
>> problem that inlined functions will get one warning per invocation?
> Yea, but that's just the way things are.  I don't think it's really an
> issue for the patches we're looking at right now.
> 

No.

>>> @@ -8255,19 +8333,30 @@ fold_builtin_classify_type (tree arg)
>>>     return build_int_cst (integer_type_node, type_to_class (TREE_TYPE (arg)));
>>>   }
>>>   
>>> -/* Fold a call to __builtin_strlen with argument ARG.  */
>>> +/* Fold a strlen call to FNDECL of TYPE, and with argument ARG.  */
>>>   
>>>   static tree
>>> -fold_builtin_strlen (location_t loc, tree type, tree arg)
>>> +fold_builtin_strlen (location_t loc, tree fndecl, tree type, tree arg)
>>>   {
>>>     if (!validate_arg (arg, POINTER_TYPE))
>>>       return NULL_TREE;
>>>     else
>>>       {
>>> -      tree len = c_strlen (arg, 0);
>>> -
>>> +      tree arr = NULL_TREE;
>>> +      tree len = c_strlen (arg, 0, &arr);
>>
>> Is it possible to write a test case where strlen(L"test") reaches this point?
>> what will c_strlen return then?
> Given your fix to have c_strlen count bytes by default, I think we're OK
> here.
> 

Yes.  But my fix was still incomplete.
So incremental progress is necessary here.

>>
>>> @@ -11328,7 +11332,7 @@ string_constant (tree arg, tree *ptr_offset)
>>>   	return NULL_TREE;
>>>   
>>>         tree offset;
>>> -      if (tree str = string_constant (arg0, &offset))
>>> +      if (tree str = string_constant (arg0, &offset, nulterm, decl))
>>>   	{
>>>   	  /* Avoid pointers to arrays (see bug 86622).  */
>>>   	  if (POINTER_TYPE_P (TREE_TYPE (arg))
>>> @@ -11368,6 +11372,10 @@ string_constant (tree arg, tree *ptr_offset)
>>>     if (TREE_CODE (array) == STRING_CST)
>>
>> Well, actually I think there _are_ STING_CSTs which are not null terminated.
>> Maybe not in C. But Fortran, Ada, Go...
>>
>>>       {
>>>         *ptr_offset = fold_convert (sizetype, offset);
>>> +      if (nulterm)
>>> +	*nulterm = true;
>>> +      if (decl)
>>> +	*decl = NULL_TREE;
>>>         return array;
>>>       }
> So your comment seems to be referring to the code above the comment as
> well as below.  Confusing.  Consistency really helps.
> 
> Are we at a point where we're ready to declare STRING_CSTs as always
> being properly terminated?  If not the hunk of code above needs some
> rethinking since we can't guarantee the string is properly terminated.
> 

No.  That is still in the flux.
And I am not sure if the "properly terminated" will survive.

I am digging deep to the ground now.
And the correctness of the code in questing depends heavily on the
results.  Therefore I would like to fix this from the ground.

Adding lots of new code that is based on wrong assumptions
will not help.


> 
>>>   
>>> @@ -11414,6 +11422,49 @@ string_constant (tree arg, tree *ptr_offset)
>>>     if (!array_size || TREE_CODE (array_size) != INTEGER_CST)
>>>       return NULL_TREE;
>>>   
>>> +  unsigned HOST_WIDE_INT array_elts = tree_to_uhwi (array_size);
>>> +
>>
>> I don't understand why this is necessary at all.
>> It looks way too complicated, to say the least.
>>
>> TREE_TYPE (init) has already the type of the member.
>>
>>> +  /* When ARG refers to an aggregate (of arrays) try to determine
>>> +     the size of the character array within the aggregate.  */
>>> +  tree ref = arg;
>>> +  tree reftype = TREE_TYPE (arg);
>>> +
>>> +  if (TREE_CODE (ref) == MEM_REF)
>>> +    {
>>> +      ref = TREE_OPERAND (ref, 0);
>>> +      if (TREE_CODE (ref) == ADDR_EXPR)
>>> +	{
>>> +	  ref = TREE_OPERAND (ref, 0);
>>> +	  reftype = TREE_TYPE (ref);
>>> +	}
>>> +    }
>>> +  else
>>> +    while (TREE_CODE (ref) == ARRAY_REF)
>>> +      {
>>> +	reftype = TREE_TYPE (ref);
>>> +	ref = TREE_OPERAND (ref, 0);
>>> +      }
>>> +
>>> +  if (TREE_CODE (ref) == COMPONENT_REF)
>>> +    reftype = TREE_TYPE (ref);
>>> +
>>> +  while (TREE_CODE (reftype) == ARRAY_TYPE)
>>> +    {
>>> +      tree next = TREE_TYPE (reftype);
>>> +      if (TREE_CODE (next) == INTEGER_TYPE)
>>> +	{
>>> +	  if (tree size = TYPE_SIZE_UNIT (reftype))
>>> +	    if (tree_fits_uhwi_p (size))
>>> +	      array_elts = tree_to_uhwi (size);
>>
>> so array_elts is measued in bytes.
> So I'm guessing your comment about the code looking way too complicated
> is referring to the code *after* your comment.  That code is not in the
> v2 patch.  At least not 01/06 which addresses the codegen/opt issues.  I
> haven't checked the full kit to see if this code appears in a subsequent
> patch.
> 

No I meant the code between
/* When ARG refers to an aggregate (of arrays) try to determine
    the size of the character array within the aggregate.  */

and here.  It is wrong (assuming C semantic on GIMPLE).

> 
>>
>>> +	  break;
>>> +	}
>>> +
>>> +      reftype = TREE_TYPE (reftype);
>>> +    }
>>> +
>>> +  if (decl)
>>> +    *decl = array;
>>> +
>>>     /* Avoid returning a string that doesn't fit in the array
>>>        it is stored in, like
>>>        const char a[4] = "abcde";
>>> @@ -11427,7 +11478,9 @@ string_constant (tree arg, tree *ptr_offset)
>>>     unsigned HOST_WIDE_INT length = TREE_STRING_LENGTH (init);
>>>     length = string_length (TREE_STRING_POINTER (init), charsize,
>>>   			  length / charsize);
>>
>> Some callers especially those where the wrong code happens, expect
>> to be able to access the STRING_CST up to TREE_STRING_LENGTH,
>> But using string_length assume thy stop at the first nul char.
> Example please?
> 

tree-ssa-forwprop.c:
               str1 = string_constant (src1, &off1);
               if (str1 == NULL_TREE)
                 break;
               if (!tree_fits_uhwi_p (off1)
                   || compare_tree_int (off1, TREE_STRING_LENGTH (str1) - 1) > 0
                   || compare_tree_int (len1, TREE_STRING_LENGTH (str1)
                                              - tree_to_uhwi (off1)) > 0



Bernd.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]