This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Make strlen range computations more conservative


On 08/20/18 22:42, Martin Sebor wrote:
> On 08/20/2018 09:15 AM, Bernd Edlinger wrote:
>> On 08/20/18 16:26, Jeff Law wrote:
>>> On 08/20/2018 04:23 AM, Bernd Edlinger wrote:
>>>> On 08/20/18 12:12, Richard Biener wrote:
>>>>> On Wed, Aug 15, 2018 at 6:39 AM Jeff Law <law@redhat.com> wrote:
>>>>>>
>>>>>> On 08/10/2018 10:56 AM, Martin Sebor wrote:
>>>>>>> On 08/08/2018 11:36 PM, Jeff Law wrote:
>>>>>>>> On 08/02/2018 09:42 AM, Martin Sebor wrote:
>>>>>>>>
>>>>>>>>> The warning bits are definitely not okay by me.  The purpose
>>>>>>>>> of the warnings (-W{format,sprintf}-{overflow,truncation} is
>>>>>>>>> to detect buffer overflows.  When a warning doesn't have access
>>>>>>>>> to string length information for dynamically created strings
>>>>>>>>> (like the strlen pass does) it uses array sizes as a proxy.
>>>>>>>>> This is useful both to detect possible buffer overflows and
>>>>>>>>> to prevent false positives for overflows that cannot happen
>>>>>>>>> in correctly written programs.
>>>>>>>> So how much of this falling-back to array sizes as a proxy would become
>>>>>>>> unnecessary if sprintf had access to the strlen pass as an analysis
>>>>>>>> module?
>>>>>>>>
>>>>>>>> As you know we've been kicking that around and from my investigations
>>>>>>>> that doesn't really look hard to do.  Encapsulate the data structures in
>>>>>>>> a class, break up the statement handling into analysis and optimization
>>>>>>>> and we should be good to go.  I'm hoping to start prototyping this week.
>>>>>>>>
>>>>>>>> If we think that has a reasonable chance to eliminate the array-size
>>>>>>>> fallback, then that seems like the most promising path forward.
>>>>>>>
>>>>>>> We discussed this idea this morning so let me respond here and
>>>>>>> reiterate the answer.  Using the strlen data will help detect
>>>>>>> buffer overflow where the array size isn't available but it
>>>>>>> cannot replace the array size heuristic. Here's a simple
>>>>>>> example:
>>>>>>>
>>>>>>>     struct S { char a[8]; };
>>>>>>>
>>>>>>>     char d[8];
>>>>>>>     void f (struct S *s, int i)
>>>>>>>     {
>>>>>>>       sprintf (d, "%s-%i",  s[i].a, i);
>>>>>>>     }
>>>>>>>
>>>>>>> We don't know the length of s->a but we do know that it can
>>>>>>> be up to 7 bytes long (assuming it's nul-terminated of course)
>>>>>>> so we know the sprintf call can overflow.  Conversely, if
>>>>>>> the size of the destination is increased to 20  the sprintf
>>>>>>> call cannot overflow so the diagnostic can be avoided.
>>>>>>>
>>>>>>> Removing the array size heuristic would force us to either give
>>>>>>> up on diagnosing the first case or issue false positives for
>>>>>>> the second case.  I think the second alternative would make
>>>>>>> the warning too noisy to be useful.
>>>>>>>
>>>>>>> The strlen pass will help detect buffer overflows in cases
>>>>>>> where the array size isn't known (e.g., with dynamically
>>>>>>> allocated buffers) but where the length of the string store
>>>>>>> in the array is known.  It will also help avoid false positives
>>>>>>> in cases where the string stored in an array of known size is
>>>>>>> known to be too short to cause an overflow.  For instance here:
>>>>>>>
>>>>>>>     struct S { char a[8]; };
>>>>>>>
>>>>>>>     char d[8];
>>>>>>>     void f (struct S *s, int i)
>>>>>>>     {
>>>>>>>       if (strlen (s->a) < 4 && i >= 0 && i < 100)
>>>>>>>         sprintf (d, "%s-%i",  s->a, i);
>>>>>>>     }
>>>>>> ACK.  Thanks for explaining things here too.  I can't speak for others,
>>>>>> but seeing examples along with the explanation is easier for me to absorb.
>>>>>>
>>>>>> For Bernd and others -- after kicking things around a bit with Martin,
>>>>>> what we're recommending is that compute_string_length we compute the
>>>>>> bounds using both GIMPLE and C semantics and return both.
>>>>>
>>>>> But you can't do this because GIMPLE did transforms that are not valid in
>>>>> C, thus you can't interpret the GIMPLE IL as "C", you can only interpret
>>>>> it as GIMPLE.  What you'd do is return GIMPLE semantics length
>>>>> and "foobar" semantics length which doesn't match the original source.
>>>>>
>>>>
>>>> If I understood that suggestion right, it means, we
>>>> live with some false positive or missing warnings due to those transformations.
>>>> That means, get_range_strlen with the 2-parameter overload is used
>>>> for warnings only.  And it returns most of the time a correct range info,
>>>> that is good enough for warnings.
>>> Correct.  99.9% of the time using the ranges implied by the array types
>>> is better for the warning code.  So the idea is to return two ranges.
>>> One which uses GIMPLE semantics for code generation and optimization
>>> purposes, the other which uses ranges implied by the array types for
>>> warning purposes.
>>>
>>> Martin suggested that we always compute and return both rather than
>>> having a boolean argument to select between the behavior.
>>>
>>>
>>
>> Okay, but there is already the "strict" parameter:
>>
>> /* Determine the minimum and maximum value or string length that ARG
>>     refers to and store each in the first two elements of MINMAXLEN.
>>     For expressions that point to strings of unknown lengths that are
>>     character arrays, use the upper bound of the array as the maximum
>>     length.  For example, given an expression like 'x ? array : "xyz"'
>>     and array declared as 'char array[8]', MINMAXLEN[0] will be set
>>     to 0 and MINMAXLEN[1] to 7, the longest string that could be
>>     stored in array.
>>     Return true if the range of the string lengths has been obtained
>>     from the upper bound of an array at the end of a struct.  Such
>>     an array may hold a string that's longer than its upper bound
>>     due to it being used as a poor-man's flexible array member.
>>
>>     STRICT is true if it will handle PHIs and COND_EXPRs conservatively
>>     and false if PHIs and COND_EXPRs are to be handled optimistically,
>>     if we can determine string length minimum and maximum; it will use
>>     the minimum from the ones where it can be determined.
>>     STRICT false should be only used for warning code.
>>
>>     ELTSIZE is 1 for normal single byte character strings, and 2 or
>>     4 for wide characer strings.  ELTSIZE is by default 1.  */
>>
>> bool
>> get_range_strlen (tree arg, tree minmaxlen[2], unsigned eltsize, bool strict)
>>
>>
>> I found out, that there is only one call with strict = true: gimple_fold_builtin_strlen
>> and that is the only one that needs GIMPLE semantics.
>> because it does:
>>
>>    if (minlen == maxlen)
>>      {
>>        lenrange[0] = force_gimple_operand_gsi (gsi, lenrange[0], true, NULL,
>>                                                true, GSI_SAME_STMT);
>>        replace_call_with_value (gsi, lenrange[0]);
>>        return true;
>>      }
>>
>>    if (tree lhs = gimple_call_lhs (stmt))
>>      if (TREE_CODE (lhs) == SSA_NAME
>>          && INTEGRAL_TYPE_P (TREE_TYPE (lhs)))
>>        set_range_info (lhs, VR_RANGE, minlen, maxlen);
>>
>>
>>
>> All other places are dealing with warnings, and have no use for GIMPLE semantics.
>>
>> You do not suggest to have set_range_info to hanlde two sets of range info, right?
> 
> The idea is to extend get_range_strlen() to take an array of three
> elements as MINMAXLEN:
> 
>    bool
>    get_range_strlen (tree arg, tree minmaxlen[3], unsigned eltsize,
>                      bool strict)
> 
> The first two elements would correspond to the strict setting used
> for optimization and the last one to the opposite for warnings.

min/max strict + min/max warning = 4, isn't it?

> That way only one call would be needed by clients like sprintf

I must say that I don't like the return value optimization
on the sprintf pass, because it uses knowledge of the glibc implementation
of sprintf to do it's job (mind the 4K buffer limit).

And I'd say compared to the actual sprintf library call the savings should be
negligible when we ignore the result register.  Or have you done a benchmark,
to measure the savings?


Bernd.

> that need both.  That should also eliminate the need for the STRICT
> argument.
> 
> Martin

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]