[PATCH] warn for integer overflow in allocation calls (PR 96838)
Martin Sebor
msebor@gmail.com
Sat Sep 19 21:22:21 GMT 2020
On 9/18/20 12:29 AM, Aldy Hernandez wrote:
>
>
> On 9/17/20 10:18 PM, Martin Sebor wrote:
>> On 9/17/20 12:39 PM, Andrew MacLeod wrote:
>>> On 9/17/20 12:08 PM, Martin Sebor via Gcc-patches wrote:
>>>> On 9/16/20 9:23 PM, Jeff Law wrote:
>>>>>
>>>>> On 9/15/20 1:47 PM, Martin Sebor wrote:
>>>>>> Overflowing the size of a dynamic allocation (e.g., malloc or VLA)
>>>>>> can lead to a subsequent buffer overflow corrupting the heap or
>>>>>> stack. The attached patch diagnoses a subset of these cases where
>>>>>> the overflow/wraparound is still detectable.
>>>>>>
>>>>>> Besides regtesting GCC on x86_64-linux I also verified the warning
>>>>>> doesn't introduce any false positives into Glibc or Binutils/GDB
>>>>>> builds on the same target.
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>> gcc-96838.diff
>>>>>>
>>>>>> PR middle-end/96838 - missing warning on integer overflow in calls
>>>>>> to allocation functions
>>>>>>
>>>>>> gcc/ChangeLog:
>>>>>>
>>>>>> PR middle-end/96838
>>>>>> * calls.c (eval_size_vflow): New function.
>>>>>> (get_size_range): Call it. Add argument.
>>>>>> (maybe_warn_alloc_args_overflow): Diagnose overflow/wraparound.
>>>>>> * calls.h (get_size_range): Add argument.
>>>>>>
>>>>>> gcc/testsuite/ChangeLog:
>>>>>>
>>>>>> PR middle-end/96838
>>>>>> * gcc.dg/Walloc-size-larger-than-19.c: New test.
>>>>>> * gcc.dg/Walloc-size-larger-than-20.c: New test.
>>>>>
>>>>> If an attacker can control an integer overflow that feeds an
>>>>> allocation, then they can do all kinds of bad things. In fact,
>>>>> when my son was asking me attack vectors, this is one I said I'd
>>>>> look at if I were a bad guy.
>>>>>
>>>>>
>>>>> I'm a bit surprised you can't just query the range of the argument
>>>>> and get the overflow status out of that range, but I don't see that
>>>>> in the APIs. How painful would it be to make that part of the API?
>>>>> The conceptual model would be to just ask for the range of the
>>>>> argument to malloc which would include the range and a status bit
>>>>> indicating the computation might have overflowed.
>>>>
>>> Do we know if it did/would have wrapped? sure. since we have to do
>>> the math. so you are correct in that the information is there. but
>>> is it useful?
>>>
>>> We are in the very annoying habit of subtracting one by adding
>>> 0xFFFFFFF. which means you get an overflow for unsigned when you
>>> subtract one. From what I have seen of unsigned math, we would be
>>> flagging very many operations as overflows, so you would still have
>>> the difficulty of figuring out whether its a "real" overflow or a
>>> fake one because of the way we do unsigned math
>>
>> You and me both :)
>>
>>>
>>> At the very start, I did have an overflow flag in the range class...
>>> but it was turning out to be fairly useless so it was removed.
>>> .
>>>
>>>> I agree that being able to evaluate an expression in an as-if
>>>> infinite precision (in addition to its type) would be helpful.
>>>
>>> SO again, we get back to adding 0x0fffff when we are trying to
>>> subtract one... now, with infinite precision you are going to see
>>>
>>> [2,10] - 1 we end up with [2,10]+0xFFFFFF, which will now
>>> give you [0x100000001, 0x100000009] so its going to look like it
>>> overflowed?
>>>
>>>
>>>
>>>> But just to make sure I understood correctly, let me ask again
>>>> using an example:
>>>>
>>>> void* f (size_t n)
>>>> {
>>>> if (n < PTRDIFF_MAX / 2)
>>>> n = PTRDIFF_MAX / 2;
>>>>
>>>> return malloc (n * sizeof (int));
>>>> }
>>>>
>>>> Can the unsigned wraparound in the argument be readily detected?
>>>>
>>>> On trunk, this ends up with the following:
>>>>
>>>> # RANGE [4611686018427387903, 18446744073709551615]
>>>> _6 = MAX_EXPR <n_2(D), 4611686018427387903>;
>>>> # RANGE [0, 18446744073709551615] NONZERO 18446744073709551612
>>>> _1 = _6 * 4;
>>>> ...
>>>> p_5 = mallocD.1206 (_1); [tail call]
>>>> ...
>>>> return p_5;
>>>>
>>>> so _1's range reflects the wraparound in size_t, but _6's range
>>>> has enough information to uncover it. So detecting it is possible
>>>> and is done in the patch so we get a warning:
>>>>
>>>> warning: argument 1 range [18446744073709551612,
>>>> 0x3fffffffffffffffc] is too large to represent in ‘long unsigned
>>>> int’ [-Walloc-size-larger-than=]
>>>> 6 | return malloc (n * sizeof (int));
>>>> | ^~~~~~~~~~~~~~~~~~~~~~~~~
>>>>
>>>> The code is very simplistic and only handles a small subset of cases.
>>>> It could be generalized and exposed by a more generic API but it does
>>>> seem like the ranger must already have all the logic built into it so
>>>> if it isn't exposed now it should be a matter of opening it up.
>>>
>>> everything is exposed in range-ops. well, mostly.
>>> if we have _1 = _6 * 4
>>>
>>> if one wanted to do that infinite precision, you query the range for
>>> _6, and the range for 4 (which would be [4,4] :-)
>>> range_of_expr (op1r, _6, stmt)
>>> range_of_expr (op2r, 4, stmt)
>>>
>>> you could take their current types, and cast those ranges to whatever
>>> the next higher precsion is,
>>> range_cast (op1r, highertype)
>>> range_cast (op2r, highertype)
>>> then invoke the operation on those parameters
>>>
>>> gimple_range_fold (r, stmt, op1r, op2r)
>>>
>>> and that will do your operation in the higher precision. you could
>>> compare that to the value in regular precision too i suppose.
>>
>> The patch does pretty much exactly what you described, except in
>> offset_int, and only for a limited set of arithmetic operations.
>> It figures out if an unsigned expression wrapped simply by checking
>> to see if the mathematically correct result fits in its type.
>>
>> It sounds like I should be able to use the range APIs above instead
>> and get the same result, but for arbitrary expressions. I don't
>> need to (and very well can't) compare the infinitely precise result
>> to the regular result because when it's a range that wraps the result
>> is the full range of the type (i.e., [0, SIZE_MAX] in the test above).
>>
>>>
>>> The ranger is designed to track ranges as they are represented in the
>>> IL. You are asking for us to interpret the IL as something other
>>> than what is there, and increase the precision. Thats a different task.
>>>
>>> could that be done? maybe. It might involve parallel tracking of
>>> ssa-Name ranges in a higher precision... and recalculating every
>>> expression using those values. Im not convinced that a generalized
>>> ranger which works in higher precision is really going to solve the
>>> problem the way you hope it would.. i think we'd get a lot of false
>>> info when unsigned values are involved
>>
>> I don't think that's necessary for unsigned wrapping. Most of
>> the time the (possibly) wrapped result is what we need because
>> that's what the languages say is supposed to happen. The cases
>> when we need to check for the dangerous wraparound (or overflow)
>> should be rare (just allocation calls) and can be handled on
>> demand by walking the IL and doing the computation in the infinite
>> precision.
>>
>> For signed overflow, though, computing the mathematically correct
>> result seems preferable. Or at least tracking (and propagating)
>> the overflow bit should be. That way we could tell when
>> an expression is definitely invalid.
>>
>>>
>>> If I were to see a clear solution/description as to what is a
>>> problematic overflow and what isnt, there might be something more
>>> general that could be done...
>>
>> Does the distinction above (unsigned wrapping/signed overflow) help?
>>
>>>
>>>> I mentioned it to Aldy in response to the irange best practices
>>>> document. He says there's no way to do that and no plans to
>>>> make it possible.
>>>>
>>> Thats not quite what he said. . He said "If the operation may wrap
>>> or overflow, it will show up as so in the range. I don't think we
>>> have any plans of changing this behavior."
>>>
>>> In fact. looking back, he said the same thing I just did.... There
>>> IS a mechanism there for working in higher precision should one
>>> desire to do so, but as I pointed out above, Im not convinced as a
>>> general thing in the ranger it will work how you imagine.
>>>
>>> My guess is what you really want is to be invoking range-ops on stmts
>>> you care about whether they might overflow or not with higher
>>> precision versions and then identify the cases which trigger that you
>>> actually care about.
>>
>> Yes, it looks like that's the way to go. I was hoping the ranger
>> had an API that would do all that for me (i.e., walk the IL and
>> compute the range of the result on demand in the precision of
>> my choice), since it has to do that as it is, except that it uses
>> the precision of the expression. But if not, it sounds like with
>> range_ops I should be able to fairly easily write one myself.
>
> The range_ops class is too low-level for this. It folds pairs of
> ranges, and has no concept of gimple or tree expressions. To roll your
> own solution you'd have to walk the IL, parse the gimple statements and
> then call range-ops on your adjusted arguments. A better approach would
> be to overload range_of_stmt as Andrew suggested (see below).
>
>>
>>> So maybe you want an "overflow calculator" which is similar to the
>>> ranger, and identifies which kinds of stmts expect certain range of
>>> results, and if they get results outside of the expected ranges,
>>> triggers a flag.
>>> maybe it could convert all unsigned and signed values to signed, and
>>> then work in a higher precision, and then if any intermediate result
>>> cant be accurately cast back to the original type, it gets flagged?
>>> is that the kind of overflow that is cared about?
>>
>> There are two kinds of "overflow:" unsigned wrapping and signed
>> overflow. Both are a problem in allocation calls like:
>>
>> int *p = malloc (nelts * size);
>>
>> If nelts and size are signed, the compile-time result seems to
>> be done in saturation arithmetic (its range looks to be capped
>> at TYPE_MAX). If they are unsigned, they wrap around zero.
>> Neither is usually intended by the user but the wraparound is
>> the more dangerous of the two because it's well-defined and
>> because it shrinks the size to a very small number.
>>
>>>
>>> I dont know... this isnt my space :-) Im just telling you that
>>> I'm not sure the current range query mechanism is the right place to
>>> be asking if an "overflow" occurred because I think the general
>>> answer wont be useful.. too many kinds of overflows thrown together.
>>>
>>> I think this requires more specifoc detailed information, and maybe a
>>> simple overflow analyzer will do the trick based on range-ops.... I
>>> think range-ops has the required abilities. whether it needs some
>>> enhancing, or something else exposed, I'm not sure yet
>>
>> Thank you! This was a useful clarification to what Aldy said.
>> I'd been meaning to follow up with him on this point when I was
>> done with what I'm working on but Jeff's question just got me
>> the answers I was looking for quicker. Let me experiment with
>> range_ops and get back to you.
>>
>>>
>>>
>>> Andrew
>>>
>>> PS One could always take a ranger and overload the range_of_stmt()
>>> routine that calculates results, so that it calls current
>>> functionality, then try converting the arguements to higher
>>> precision ranges and re-invoking range-ops on those arguments, get a
>>> new higher precision range back and then look at the results vs the
>>> normal ones. Might be a place to start experimenting... then pass a
>>> flag along sayign an overflow occurred.
>>
>> That sounds like something to look into. I don't see range_of_stmt
>
> This would be my preferred approach. It's clean and builds on top of
> the current infrastructure without polluting other users of the ranger.
> It definitely sounds like something worth pursuing.
>
> > on trunk. Is that the right name or has it not been merged yet?
>
> This is in the ranger proper which should come in, in the next week or
> two (??). You can see it in our staging branch
> (users/aldyh/ranger-staging):
>
> class gimple_ranger : public range_query
> {
> public:
> gimple_ranger (bool use_loop_info) : m_use_loop_info (use_loop_info) { }
> virtual bool range_of_stmt (irange &r, gimple *, tree name = NULL)
> OVERRIDE;
> virtual bool range_of_expr (irange &r, tree name, gimple * = NULL)
> OVERRIDE;
> virtual bool range_on_edge (irange &r, edge e, tree name) OVERRIDE;
> virtual void range_on_entry (irange &r, basic_block bb, tree name);
> virtual void range_on_exit (irange &r, basic_block bb, tree name);
> ...
> ...
> };
>
> The idea is that range_of_stmt() is called for all statements as the IL
> is walked backwards. You could just overload this, build a throwaway
> statement you could pass to gimple_ranger::range_of_stmt, and work from
> there.
Thanks for the tutorial! I see the patch you posted with this API
so I'll give it a try once it's in.
Martin
More information about the Gcc-patches
mailing list