This is the mail archive of the
gcc-help@gcc.gnu.org
mailing list for the GCC project.
Re: Optimisations and undefined behaviour
- From: Richard Earnshaw <Richard dot Earnshaw at foss dot arm dot com>
- To: David Brown <david dot brown at hesbynett dot no>, Andrew Haley <aph at redhat dot com>, Florian Weimer <fweimer at redhat dot com>
- Cc: Segher Boessenkool <segher at kernel dot crashing dot org>, [gcc-help] <gcc-help at gcc dot gnu dot org>
- Date: Tue, 10 Nov 2015 11:06:39 +0000
- Subject: Re: Optimisations and undefined behaviour
- Authentication-results: sourceware.org; auth=none
- References: <563BC190 dot 7080406 at hesbynett dot no> <563C7EB6 dot 9050401 at redhat dot com> <563C9DD3 dot 9030407 at hesbynett dot no> <563F9E4C dot 5000504 at redhat dot com> <20151108193430 dot GA28206 at gate dot crashing dot org> <56407162 dot 7050106 at redhat dot com> <56408D14 dot 2090101 at redhat dot com> <5640A8D3 dot 8060706 at redhat dot com> <5640AAC5 dot 9090509 at redhat dot com> <5640ADC5 dot 4090604 at redhat dot com> <5640B40C dot 9000906 at foss dot arm dot com> <5640B51F dot 1080401 at redhat dot com> <5640B632 dot 9040802 at foss dot arm dot com> <5640BA3E dot 3030508 at redhat dot com> <5640C248 dot 7040904 at hesbynett dot no> <5640CA59 dot 8090300 at redhat dot com> <5641B6C9 dot 6050806 at hesbynett dot no> <5641C415 dot 7050204 at foss dot arm dot com> <5641CA93 dot 3090302 at hesbynett dot no>
On 10/11/15 10:44, David Brown wrote:
> On 10/11/15 11:16, Richard Earnshaw wrote:
>> On 10/11/15 09:20, David Brown wrote:
>>> On 09/11/15 17:31, Andrew Haley wrote:
>>>> On 11/09/2015 03:56 PM, David Brown wrote:
>>>>
>>>>> We typically cannot use "sanatize" options, nor can we accept that a
>>>>> bug in one part of the program causes undue and unnecessarily
>>>>> damaging side-effects in other parts.
>>>>
>>>> Well, you have to get used to that. It is reality: computers work
>>>> that way. I'm sure you know that if you hit the wrong I/O port
>>>> with a wild write odd things will happen. Whether that's "undue" or
>>>> "unnecessary" I couldn't say: it just is.
>>>
>>> There is no doubt that with C, it is easy to make mistakes that can lead
>>> to disaster. One little typo, and you access an array outside of its
>>> bounds - with the result of corrupting a completely independent piece of
>>> data, showing up as a problem in a different part of the program that
>>> was already tested and working fine. These things are part of the risk
>>> of C programming - it's the cost you pay for maximum speed and efficiency.
>>>
>>> But let's look back at Annex L in the C11 standards, and ask what it is
>>> /for/. It is labelled "Analyzability" - the aim is to help developers
>>> (and code reviewers, and static analysers) figure out if code is correct
>>> or not. Part of that is to place bounds on the effects of some types of
>>> undefined behaviour. Now, you can certainly argue that Annex L doesn't
>>> go far enough to be useful, or it is too vague, or that it is
>>> impractical or limiting to implement, or that knock-on effects can
>>> quickly turn bounded undefined behaviour into unbounded undefined
>>> behaviour. But the very existence of this Annex in the C standards
>>> shows how important these things are.
>>>
>>> To my mind, if execution hits UB, then it's a bug in the program. It
>>> doesn't really matter if the bug is UB, or an error in the algorithm, or
>>> a coding mistake, or even a mistake in the customer's requirement
>>> specifications. It's a point where the computer does not do what the
>>> user expects it to do. Is it so unreasonable to want code that is
>>> correct up to that bug, to run correctly until the bug is hit? Is it
>>> unreasonable to want the compiler to warn if it sees such a bug, and
>>> uses it to change the past action of the code?
>>>
>>> I know that once code hits the bug, all sorts of things can go wrong.
>>> All I ask is that the compiler should not make things worse - at least
>>> not without informing the developer.
>>>
>>>
>>>>
>>>> C definitely works that way. Maybe there should be a nice small
>>>> language which is useful for embedded developers and doesn't have
>>>> all the interesting UB properties that C has. (Ada, maybe? Probably
>>>> not.) Maybe you could define a language compatible with C with the UB
>>>> removed. But defining the semantics of such a language would not be
>>>> easy. And I don't think it makes much sense to change GCC without
>>>> such a rigorous language definition.
>>>>
>>>
>>> I don't believe it is possible to make a general and efficient
>>> programming language without UB. And even if it could be eliminated at
>>> the language level, it would still exist at a higher level - a square
>>> root function could well have undefined behaviour for negative inputs.
>>> I also am happy that gcc can make many small optimisations based on its
>>> knowledge of UB - strength reduction and constant folding in integer
>>> expressions is a key example.
>>>
>>> But I can also see the potential for optimisations based on UB to make
>>> developers' lives much more difficult by working logically backwards in
>>> time.
>>>
>>
>> You have to remember that a lot (but by no means all) of UB in C comes
>> from the desire to have an efficient language that is portable to many
>> architectures. C has to support 'int' of unspecified maximum size,
>> machines that have two's complement, one's complement and sign-magnitude
>> arithmetic, multiple different floating point formats. The way the
>> language committee has enabled such support without requiring the tools
>> to produce grossly inefficient code is to say that if you step outside
>> the limits of the machine you are running on, then the behaviour is
>> undefined. However, it didn't specifically qualify what had to happen
>> in those cases and didn't distinguish them from other cases where UB can
>> occur; it just said, if this happens your program is broken.
>> Furthermore, it said quite clearly that compilers do not HAVE to detect
>> such brokenness - because in the extreme cases that would be very
>> expensive or even impossible to prove.
>
> I understand that - and I agree with the principle behind the use of UB
> in the C standards. (I think perhaps some of the outdated hardware,
> such as support for anything other than two's complement signed
> integers, could be depreciated in newer C standards - but that's another
> issue.)
>
> However, just because the C standards make certain behaviour undefined,
> does not mean that C /compilers/ must leave it that way. In particular,
> some undefined behaviour situations could usefully be turned into
> unspecified behaviour while retaining much of the optimisation
> flexibility, but with less of the danger. I think Annex L could be
> described in that way. Some types of undefined behaviour, such as
> accessing arrays out of bounds, cannot reasonably be bounded or limited
> in any way - you may be writing over some other data, and the results
> are completely unpredictable.
>
> But consider if integer overflow were made "unspecified" rather than
> "undefined" - i.e., it were guaranteed to return a valid integer, but
> you know nothing about /which/ integer it is. For efficiency, that
> would require that the hardware does not have enforced traps on
> overflow. But the compiler is still able to optimise "x * 3 / 6" to "x
> / 2", and still able to assume that "x + 1 > x". But given "x = 1290;
> foo(); y = x * x * x;" it is no longer able to eliminate the call to "foo".
>
I /think/ you've just said that x86 can no-longer use its divide
instruction. That instruction faults if you have n/0 or INT_MIN/-1.
R.
> (It is quite possible that my analysis above only applies in some
> circumstances, not others - I haven't attempted to think through a range
> of cases, but merely to give an example of how it is possible to
> generate efficient code without surprising the programmer /too/ much.)
>
>>
>> (I would add at this point, that it is also very hard at times to prove
>> something would invoke UB, so the ability to emit a diagnostic could
>> depend on previous optimizations done on the code before the check is
>> performed -- you only have to look at the history of
>> -Wmaybe-uninitialized to realize how hard it is to get things like this
>> right; and false positives can be as painful for developers as missing
>> warnings or warnings that seem to appear and disappear at random).
>
> I appreciate that this is very hard, and that giving a useful balance
> between warnings with few false positives is often a difficult process.
>
>>
>>> (Note that in all my testing, I have not found gcc to perform any of
>>> these unwanted "optimisations" that I fear - but I haven't found good
>>> reason to be sure that they won't be performed in the future.)
>>>
>>
>> And I doubt we would ever make such a guarantee. It might severely
>> hamper GCC's stated goal of being a production-quality optimizing compiler.
>
> gcc is already a highly optimising compiler - if it is not "production
> quality", then I don't know any compiler that /is/ "production quality".
> And I support the goal of making gcc ever better.
>
> But the goal of efficient code generation must not be the only one, and
> gcc must not lose sight of other important goals. For the programmer,
> writing /correct/ code must always outweigh /fast/ code - and the
> compiler should be a tool to aid the programmer.
>
>>
>>>
>>> And yes, I realise that taken to extremes, this is ultimately an
>>> impossible task. The point is merely to aim in that direction. I don't
>>> expect gcc to warn on every possible error - but I am happy to see it
>>> warning on more and more possible errors. What I don't want to see is
>>> is that since it is impossible to know /exactly/ what the programmer
>>> wanted when there is UB in the code, that means the compiler can refuse
>>> all responsibility for trying. That would IMHO be a disservice to the
>>> user, making it harder to find and fix bugs.
>>
>
>