[PATCH] warn for integer overflow in allocation calls (PR 96838)
Aldy Hernandez
aldyh@redhat.com
Sun Sep 20 06:39:17 GMT 2020
On 9/19/20 11:22 PM, Martin Sebor wrote:
> On 9/18/20 12:29 AM, Aldy Hernandez wrote:
>>
>>
>> On 9/17/20 10:18 PM, Martin Sebor wrote:
>>> On 9/17/20 12:39 PM, Andrew MacLeod wrote:
>>>> On 9/17/20 12:08 PM, Martin Sebor via Gcc-patches wrote:
>>>>> On 9/16/20 9:23 PM, Jeff Law wrote:
>>>>>>
>>>>>> On 9/15/20 1:47 PM, Martin Sebor wrote:
>>>>>>> Overflowing the size of a dynamic allocation (e.g., malloc or VLA)
>>>>>>> can lead to a subsequent buffer overflow corrupting the heap or
>>>>>>> stack. The attached patch diagnoses a subset of these cases where
>>>>>>> the overflow/wraparound is still detectable.
>>>>>>>
>>>>>>> Besides regtesting GCC on x86_64-linux I also verified the warning
>>>>>>> doesn't introduce any false positives into Glibc or Binutils/GDB
>>>>>>> builds on the same target.
>>>>>>>
>>>>>>> Martin
>>>>>>>
>>>>>>> gcc-96838.diff
>>>>>>>
>>>>>>> PR middle-end/96838 - missing warning on integer overflow in
>>>>>>> calls to allocation functions
>>>>>>>
>>>>>>> gcc/ChangeLog:
>>>>>>>
>>>>>>> PR middle-end/96838
>>>>>>> * calls.c (eval_size_vflow): New function.
>>>>>>> (get_size_range): Call it. Add argument.
>>>>>>> (maybe_warn_alloc_args_overflow): Diagnose overflow/wraparound.
>>>>>>> * calls.h (get_size_range): Add argument.
>>>>>>>
>>>>>>> gcc/testsuite/ChangeLog:
>>>>>>>
>>>>>>> PR middle-end/96838
>>>>>>> * gcc.dg/Walloc-size-larger-than-19.c: New test.
>>>>>>> * gcc.dg/Walloc-size-larger-than-20.c: New test.
>>>>>>
>>>>>> If an attacker can control an integer overflow that feeds an
>>>>>> allocation, then they can do all kinds of bad things. In fact,
>>>>>> when my son was asking me attack vectors, this is one I said I'd
>>>>>> look at if I were a bad guy.
>>>>>>
>>>>>>
>>>>>> I'm a bit surprised you can't just query the range of the argument
>>>>>> and get the overflow status out of that range, but I don't see
>>>>>> that in the APIs. How painful would it be to make that part of
>>>>>> the API? The conceptual model would be to just ask for the range
>>>>>> of the argument to malloc which would include the range and a
>>>>>> status bit indicating the computation might have overflowed.
>>>>>
>>>> Do we know if it did/would have wrapped? sure. since we have to
>>>> do the math. so you are correct in that the information is there.
>>>> but is it useful?
>>>>
>>>> We are in the very annoying habit of subtracting one by adding
>>>> 0xFFFFFFF. which means you get an overflow for unsigned when you
>>>> subtract one. From what I have seen of unsigned math, we would be
>>>> flagging very many operations as overflows, so you would still have
>>>> the difficulty of figuring out whether its a "real" overflow or a
>>>> fake one because of the way we do unsigned math
>>>
>>> You and me both :)
>>>
>>>>
>>>> At the very start, I did have an overflow flag in the range class...
>>>> but it was turning out to be fairly useless so it was removed.
>>>> .
>>>>
>>>>> I agree that being able to evaluate an expression in an as-if
>>>>> infinite precision (in addition to its type) would be helpful.
>>>>
>>>> SO again, we get back to adding 0x0fffff when we are trying to
>>>> subtract one... now, with infinite precision you are going to see
>>>>
>>>> [2,10] - 1 we end up with [2,10]+0xFFFFFF, which will now
>>>> give you [0x100000001, 0x100000009] so its going to look like it
>>>> overflowed?
>>>>
>>>>
>>>>
>>>>> But just to make sure I understood correctly, let me ask again
>>>>> using an example:
>>>>>
>>>>> void* f (size_t n)
>>>>> {
>>>>> if (n < PTRDIFF_MAX / 2)
>>>>> n = PTRDIFF_MAX / 2;
>>>>>
>>>>> return malloc (n * sizeof (int));
>>>>> }
>>>>>
>>>>> Can the unsigned wraparound in the argument be readily detected?
>>>>>
>>>>> On trunk, this ends up with the following:
>>>>>
>>>>> # RANGE [4611686018427387903, 18446744073709551615]
>>>>> _6 = MAX_EXPR <n_2(D), 4611686018427387903>;
>>>>> # RANGE [0, 18446744073709551615] NONZERO 18446744073709551612
>>>>> _1 = _6 * 4;
>>>>> ...
>>>>> p_5 = mallocD.1206 (_1); [tail call]
>>>>> ...
>>>>> return p_5;
>>>>>
>>>>> so _1's range reflects the wraparound in size_t, but _6's range
>>>>> has enough information to uncover it. So detecting it is possible
>>>>> and is done in the patch so we get a warning:
>>>>>
>>>>> warning: argument 1 range [18446744073709551612,
>>>>> 0x3fffffffffffffffc] is too large to represent in ‘long unsigned
>>>>> int’ [-Walloc-size-larger-than=]
>>>>> 6 | return malloc (n * sizeof (int));
>>>>> | ^~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>
>>>>> The code is very simplistic and only handles a small subset of cases.
>>>>> It could be generalized and exposed by a more generic API but it does
>>>>> seem like the ranger must already have all the logic built into it so
>>>>> if it isn't exposed now it should be a matter of opening it up.
>>>>
>>>> everything is exposed in range-ops. well, mostly.
>>>> if we have _1 = _6 * 4
>>>>
>>>> if one wanted to do that infinite precision, you query the range for
>>>> _6, and the range for 4 (which would be [4,4] :-)
>>>> range_of_expr (op1r, _6, stmt)
>>>> range_of_expr (op2r, 4, stmt)
>>>>
>>>> you could take their current types, and cast those ranges to
>>>> whatever the next higher precsion is,
>>>> range_cast (op1r, highertype)
>>>> range_cast (op2r, highertype)
>>>> then invoke the operation on those parameters
>>>>
>>>> gimple_range_fold (r, stmt, op1r, op2r)
>>>>
>>>> and that will do your operation in the higher precision. you could
>>>> compare that to the value in regular precision too i suppose.
>>>
>>> The patch does pretty much exactly what you described, except in
>>> offset_int, and only for a limited set of arithmetic operations.
>>> It figures out if an unsigned expression wrapped simply by checking
>>> to see if the mathematically correct result fits in its type.
>>>
>>> It sounds like I should be able to use the range APIs above instead
>>> and get the same result, but for arbitrary expressions. I don't
>>> need to (and very well can't) compare the infinitely precise result
>>> to the regular result because when it's a range that wraps the result
>>> is the full range of the type (i.e., [0, SIZE_MAX] in the test above).
>>>
>>>>
>>>> The ranger is designed to track ranges as they are represented in
>>>> the IL. You are asking for us to interpret the IL as something
>>>> other than what is there, and increase the precision. Thats a
>>>> different task.
>>>>
>>>> could that be done? maybe. It might involve parallel tracking of
>>>> ssa-Name ranges in a higher precision... and recalculating every
>>>> expression using those values. Im not convinced that a generalized
>>>> ranger which works in higher precision is really going to solve the
>>>> problem the way you hope it would.. i think we'd get a lot of false
>>>> info when unsigned values are involved
>>>
>>> I don't think that's necessary for unsigned wrapping. Most of
>>> the time the (possibly) wrapped result is what we need because
>>> that's what the languages say is supposed to happen. The cases
>>> when we need to check for the dangerous wraparound (or overflow)
>>> should be rare (just allocation calls) and can be handled on
>>> demand by walking the IL and doing the computation in the infinite
>>> precision.
>>>
>>> For signed overflow, though, computing the mathematically correct
>>> result seems preferable. Or at least tracking (and propagating)
>>> the overflow bit should be. That way we could tell when
>>> an expression is definitely invalid.
>>>
>>>>
>>>> If I were to see a clear solution/description as to what is a
>>>> problematic overflow and what isnt, there might be something more
>>>> general that could be done...
>>>
>>> Does the distinction above (unsigned wrapping/signed overflow) help?
>>>
>>>>
>>>>> I mentioned it to Aldy in response to the irange best practices
>>>>> document. He says there's no way to do that and no plans to
>>>>> make it possible.
>>>>>
>>>> Thats not quite what he said. . He said "If the operation may
>>>> wrap or overflow, it will show up as so in the range. I don't think
>>>> we have any plans of changing this behavior."
>>>>
>>>> In fact. looking back, he said the same thing I just did.... There
>>>> IS a mechanism there for working in higher precision should one
>>>> desire to do so, but as I pointed out above, Im not convinced as a
>>>> general thing in the ranger it will work how you imagine.
>>>>
>>>> My guess is what you really want is to be invoking range-ops on
>>>> stmts you care about whether they might overflow or not with higher
>>>> precision versions and then identify the cases which trigger that
>>>> you actually care about.
>>>
>>> Yes, it looks like that's the way to go. I was hoping the ranger
>>> had an API that would do all that for me (i.e., walk the IL and
>>> compute the range of the result on demand in the precision of
>>> my choice), since it has to do that as it is, except that it uses
>>> the precision of the expression. But if not, it sounds like with
>>> range_ops I should be able to fairly easily write one myself.
>>
>> The range_ops class is too low-level for this. It folds pairs of
>> ranges, and has no concept of gimple or tree expressions. To roll
>> your own solution you'd have to walk the IL, parse the gimple
>> statements and then call range-ops on your adjusted arguments. A
>> better approach would be to overload range_of_stmt as Andrew suggested
>> (see below).
>>
>>>
>>>> So maybe you want an "overflow calculator" which is similar to the
>>>> ranger, and identifies which kinds of stmts expect certain range of
>>>> results, and if they get results outside of the expected ranges,
>>>> triggers a flag.
>>>> maybe it could convert all unsigned and signed values to signed,
>>>> and then work in a higher precision, and then if any intermediate
>>>> result cant be accurately cast back to the original type, it gets
>>>> flagged? is that the kind of overflow that is cared about?
>>>
>>> There are two kinds of "overflow:" unsigned wrapping and signed
>>> overflow. Both are a problem in allocation calls like:
>>>
>>> int *p = malloc (nelts * size);
>>>
>>> If nelts and size are signed, the compile-time result seems to
>>> be done in saturation arithmetic (its range looks to be capped
>>> at TYPE_MAX). If they are unsigned, they wrap around zero.
>>> Neither is usually intended by the user but the wraparound is
>>> the more dangerous of the two because it's well-defined and
>>> because it shrinks the size to a very small number.
>>>
>>>>
>>>> I dont know... this isnt my space :-) Im just telling you that
>>>> I'm not sure the current range query mechanism is the right place to
>>>> be asking if an "overflow" occurred because I think the general
>>>> answer wont be useful.. too many kinds of overflows thrown together.
>>>>
>>>> I think this requires more specifoc detailed information, and maybe
>>>> a simple overflow analyzer will do the trick based on range-ops....
>>>> I think range-ops has the required abilities. whether it needs some
>>>> enhancing, or something else exposed, I'm not sure yet
>>>
>>> Thank you! This was a useful clarification to what Aldy said.
>>> I'd been meaning to follow up with him on this point when I was
>>> done with what I'm working on but Jeff's question just got me
>>> the answers I was looking for quicker. Let me experiment with
>>> range_ops and get back to you.
>>>
>>>>
>>>>
>>>> Andrew
>>>>
>>>> PS One could always take a ranger and overload the range_of_stmt()
>>>> routine that calculates results, so that it calls current
>>>> functionality, then try converting the arguements to higher
>>>> precision ranges and re-invoking range-ops on those arguments, get a
>>>> new higher precision range back and then look at the results vs the
>>>> normal ones. Might be a place to start experimenting... then pass
>>>> a flag along sayign an overflow occurred.
>>>
>>> That sounds like something to look into. I don't see range_of_stmt
>>
>> This would be my preferred approach. It's clean and builds on top of
>> the current infrastructure without polluting other users of the
>> ranger. It definitely sounds like something worth pursuing.
>>
>> > on trunk. Is that the right name or has it not been merged yet?
>>
>> This is in the ranger proper which should come in, in the next week or
>> two (??). You can see it in our staging branch
>> (users/aldyh/ranger-staging):
>>
>> class gimple_ranger : public range_query
>> {
>> public:
>> gimple_ranger (bool use_loop_info) : m_use_loop_info
>> (use_loop_info) { }
>> virtual bool range_of_stmt (irange &r, gimple *, tree name = NULL)
>> OVERRIDE;
>> virtual bool range_of_expr (irange &r, tree name, gimple * = NULL)
>> OVERRIDE;
>> virtual bool range_on_edge (irange &r, edge e, tree name) OVERRIDE;
>> virtual void range_on_entry (irange &r, basic_block bb, tree name);
>> virtual void range_on_exit (irange &r, basic_block bb, tree name);
>> ...
>> ...
>> };
>>
>> The idea is that range_of_stmt() is called for all statements as the
>> IL is walked backwards. You could just overload this, build a
>> throwaway statement you could pass to gimple_ranger::range_of_stmt,
>> and work from there.
>
> Thanks for the tutorial! I see the patch you posted with this API
> so I'll give it a try once it's in.
Note that the patch I posted was to standardize valuation throughout the
compiler. It's not the ranger per se, but the underlying API for
querying ranges that ranger, vr_values, etc will use. So you need to
wait for the full ranger to work on this.
Aldy
More information about the Gcc-patches
mailing list