59083 – -fisolate-erroneous-paths produces illegal instruction with enabled -fprofile-generate

Bug 59083 - -fisolate-erroneous-paths produces illegal instruction with enabled -fprofile-generate

Summary: -fisolate-erroneous-paths produces illegal instruction with enabled -fprofile...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	c++ (show other bugs)
Version:	4.9.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2013-11-11 22:04 UTC by Martin Liška
Modified:	2013-12-18 13:08 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2013-11-11 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin Liška 2013-11-11 22:04:37 UTC

Hello,
   I've encountered a new bug that was introduced by SVN commit: 204414.

Program: gimp
commit 88ecd59c3436d302b644a5d25c1938c0e7b60ae0
Author: Michael Natterer <mitch@gimp.org>
Date:   Tue Feb 5 20:43:53 2013 +0100

GCC: 204414

uname -a:
Linux marxinbox 3.7.4-gentoo #6 SMP Mon Sep 30 20:53:52 CEST 2013 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux

Build flags:
CFLAGS="-flto=9 -fno-fat-lto-objects -O2 -fprofile-generate"

When I added -fno-isolate-erroneous-paths the program run correctly.

Stacktrace:

    │0x988226 <windows_actions_dock_window_added+646>        addq   $0x1,0x6a8732(%rip)        # 0x1030960 <__gcov0.gimp_action_group_add_actions.lto_priv.4837+64>                                                                           │   │0x98822e <windows_actions_dock_window_added+654>        addq   $0x1,0x6a87d2(%rip)        # 0x1030a08 <__gcov0.gimp_action_group_add_actions.lto_priv.4837+232>                                                                          │   │0x988236 <windows_actions_dock_window_added+662>        callq  0x820fa0 <gimp_action_group_check_unique_action.lto_priv.5831>                                                                                                            │   │0x98823b <windows_actions_dock_window_added+667>        test   %eax,%eax                                                                                                                                                                 │   │0x98823d <windows_actions_dock_window_added+669>        jne    0x988258 <windows_actions_dock_window_added+696>                                                                                                                          │   │0x98823f <windows_actions_dock_window_added+671>        addq   $0x1,0x6a8721(%rip)        # 0x1030968 <__gcov0.gimp_action_group_add_actions.lto_priv.4837+72>                                                                           │   │0x988247 <windows_actions_dock_window_added+679>        addq   $0x1,0x6a87b1(%rip)        # 0x1030a00 <__gcov0.gimp_action_group_add_actions.lto_priv.4837+224>                                                                          │   │0x98824f <windows_actions_dock_window_added+687>        jmpq   0x988100 <windows_actions_dock_window_added+352>                                                                                                                          │   │0x988254 <windows_actions_dock_window_added+692>        nopl   0x0(%rax)                                                                                                                                                                 │   │0x988258 <windows_actions_dock_window_added+696>        mov    $0x5,%edx                                                                                                                                                                 │   │0x98825d <windows_actions_dock_window_added+701>        mov    $0xb8b6f6,%esi                                                                                                                                                            │   │0x988262 <windows_actions_dock_window_added+706>        xor    %edi,%edi                                                                                                                                                                 │   │0x988264 <windows_actions_dock_window_added+708>        callq  0x478da0 <dcgettext@plt>                                                                                                                                                  │  >│0x988269 <windows_actions_dock_window_added+713>        ud2                                                                                                                                                                              │   │0x98826b                                                nopl   0x0(%rax,%rax,1)                                                                                                                                                          │   │0x988270 <windows_actions_update_display_accels>        push   %r14                                                                                                                                                                      │   │0x988272 <windows_actions_update_display_accels+2>      mov    %rdi,%r14                                                                                                                                                                 │   │0x988275 <windows_actions_update_display_accels+5>      push   %r13                                                                                                                                                                      │   │0x988277 <windows_actions_update_display_accels+7>      push   %r12                                                                                                                                                                      │   │0x988279 <windows_actions_update_display_accels+9>      push   %rbp                                                                                                                                                                      │   │0x98827a <windows_actions_update_display_accels+10>     push   %rbx                                                                                                                                                                      │   │0x98827b <windows_actions_update_display_accels+11>     mov    0x20(%rdi),%rdi                                                                                                                                                           │   │0x98827f <windows_actions_update_display_accels+15>     addq   $0x1,0x7e7df9(%rip)        # 0x1170080 <__gcov0.windows_actions_update_display_accels>                                                                                    │   │0x988287 <windows_actions_update_display_accels+23>     callq  0x99e6b0 <gimp_get_display_iter>                                                                                                                                          │   │0x98828c <windows_actions_update_display_accels+28>     test   %rax,%rax                                                                                                                                                                 │   │0x98828f <windows_actions_update_display_accels+31>     mov    %rax,%rbx                                                                                                                                                                 │   │0x988292 <windows_actions_update_display_accels+34>     je     0x988398 <windows_actions_update_display_accels+296>                                                                                                                      │   │0x988298 <windows_actions_update_display_accels+40>     xor    %ebp,%ebp                                                                                                                                                                 │   │0x98829a <windows_actions_update_display_accels+42>     jmpq   0x988366 <windows_actions_update_display_accels+246>                                                                                                                      │   │0x98829f <windows_actions_update_display_accels+47>     nop        

Thank you,
Martin

Comment 1 Andrew Pinski 2013-11-11 22:08:52 UTC

We need a testcase for this.  ud2 translates to __builtin_trap so this could be a bug in their code still.

Comment 2 Jeffrey A. Law 2013-11-11 22:50:05 UTC

I need a compilable/complete testcase.

If a program is faulting with -fisolate-erroneous-paths, then that program is faulty in one way or another.  It's dereferencing a null pointer, pass a null argument in a situation where the argument must be non-null, or returning a null pointer from a function declared as returning non-null.

Comment 3 Martin Liška 2013-11-12 16:00:00 UTC

I've seen the same problem also in Inkscape. I will try to create a testcase.

Would it be possible Jeffrey to build one of these programs?

Thanks,
Martin

Comment 4 Markus Trippelsdorf 2013-11-12 16:17:42 UTC

(In reply to Jeffrey A. Law from comment #2)
> I need a compilable/complete testcase.
> 
> If a program is faulting with -fisolate-erroneous-paths, then that program
> is faulty in one way or another.  It's dereferencing a null pointer, pass a
> null argument in a situation where the argument must be non-null, or
> returning a null pointer from a function declared as returning non-null.

I'm seeing this also quite often. Xorg fails to start when
compiled with current gcc, the Linux kernel doesn't build, because
relocs (build during the kernel build) fails with illegal instruction, 
etc..
So tbe -fisolate-erroneous-paths patch should be reverted IMHO.

Comment 5 Jeffrey A. Law 2013-11-12 16:18:56 UTC

I need testcases.  "the kernel" or "x.org" isn't sufficient for a variety of reasons.

Comment 6 Markus Trippelsdorf 2013-11-12 16:26:54 UTC

(In reply to Jeffrey A. Law from comment #5)
> I need testcases.  "the kernel" or "x.org" isn't sufficient for a variety of
> reasons.

Every program that uses a custom sig_handler which only handles
SIGSEGV will fail.

Comment 7 Markus Trippelsdorf 2013-11-12 16:39:04 UTC

markus@x4 tmp % cat test.c
#include <setjmp.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>

static sigjmp_buf jmpbuf;

static void
sig_handler (int signo)
{
  siglongjmp (jmpbuf, 1);
}

int
main (void)
{
  char *p = NULL;
  volatile int ret = 0;
  struct sigaction sa;

  sa.sa_handler = sig_handler;
  sigemptyset (&sa.sa_mask);
  sa.sa_flags = SA_SIGINFO;

  if (sigaction (SIGSEGV, &sa, 0))
    {
      perror ("installing SIGSEGV handler\n");
      exit (1);
    }

  puts ("Attempting to sprintf to null ptr");
  if (setjmp (jmpbuf))
    {
      puts ("Exiting main...");
      return ret;
    }

  sprintf (p, "This should segv\n");

  return 1;
}

markus@x4 tmp % gcc -O2 test.c && ./a.out
Attempting to sprintf to null ptr
[1]    28519 illegal hardware instruction  ./a.out
markus@x4 tmp % gcc -fno-isolate-erroneous-paths -O2 test.c && ./a.out
Attempting to sprintf to null ptr
Exiting main...
markus@x4 tmp %

Comment 8 Jeffrey A. Law 2013-11-12 23:10:06 UTC

Should be fixed via recent commits.  Specifically, we preserve the *0 for code that wants to catch the null pointer deref.

Comment 9 Markus Trippelsdorf 2013-11-13 05:17:25 UTC

(In reply to Jeffrey A. Law from comment #8)
> Should be fixed via recent commits.  Specifically, we preserve the *0 for
> code that wants to catch the null pointer deref.

Well, if you had run the simple glibc testcase that I've posted above,
you would have found out that your recent commits don't fix this issue.

It's obvious, not only from the amount of churn, that fisolate-erroneous-paths
shouldn't be enabled by default at -O2. The very few users you would like SIGSEGV
to be transformed to SIGILL by the compiler should set this option explicitly.

Comment 10 Jeffrey A. Law 2013-11-13 05:30:41 UTC

I ran the testcase you sent.  It worked fine for me.

THe problems we're having are within the realm of normal development and they will be resolved one way or another.

Comment 11 Jeffrey A. Law 2013-11-13 07:34:42 UTC

Damn it.  Tested the wrong compiler.

The problem with your testcase Markus is you're simply not allowed to pass a null pointer to sprintf, memcpy and a variety of other functions.  Once you execute code which attempts that, you've walked into the realm of undefined behaviour.

Once you cross that boundary the compiler is allowed to do these kinds of transformations.  Your testcode is fundamentally flawed in that it relies upon undefined behaviour.

Comment 12 Jeffrey A. Law 2013-11-13 07:42:13 UTC

I'll note further, that an implementation of sprintf, memcpy, etc could check for a NULL pointer internally and raise a trap on their own rather than dereferencing the invalid pointer and still be a conforming implementation.

In summary, you can't make the assumption that you'll actually get a segfault/bus error like your code wants to do.

Contrast that to an explicit *0 = <whatever> which might appear in user code.  While executing that code still results in undefined behaviour, one can easily make an argument from a QOI standpoint that we should still allow the memory dereference to occur to generate the SIGBUS/SIGSEGV.

Comment 13 Markus Trippelsdorf 2013-11-13 07:50:02 UTC

(In reply to Jeffrey A. Law from comment #11)
> Damn it.  Tested the wrong compiler.
> 
> The problem with your testcase Markus is you're simply not allowed to pass a
> null pointer to sprintf, memcpy and a variety of other functions.  Once you
> execute code which attempts that, you've walked into the realm of undefined
> behaviour.
> 
> Once you cross that boundary the compiler is allowed to do these kinds of
> transformations.  Your testcode is fundamentally flawed in that it relies
> upon undefined behaviour.

Understood. But the problem is that "the kernel" and "x.org" still fail.
I will try to come up with testcases for those two and see if it's also
caused by the same undefined behavior.

Comment 14 Markus Trippelsdorf 2013-11-13 09:08:23 UTC

The 'kernel' case passes a NULL pointer to qsort:

 % < test.i
struct relocs {
  int *offset;
};
static struct relocs a;
void qsort(void *, int) __attribute__((__nonnull__));
void die(int *p1, ...) {}

void sort_relocs(struct relocs *p1) { qsort(p1->offset, 0); }

int main() {
  sort_relocs(&a);
  return 0;
}

Comment 15 Richard Biener 2013-11-13 15:47:00 UTC

Note that technically invalid C programs still may be "reasonable".  For example
writing *0 = 1; and catching that via sigsetjmp/siglongjmp and a signal handler
is possible but of course invokes undefined behavior according to the C
standard.

But yes, if now random "POSIX" programs start failing then that's bad.  Does
-fnon-call-exceptions exeption handling on SJLJ targets still work?  (I suppose
we lower to SJLJ only after RTL expansion or we at least still have EH edges
around).

Comment 16 Jeffrey A. Law 2013-11-13 16:32:40 UTC

Richi,

No doubt about *0 = 1 and the like.  While it's clearly undefined, I think we've got to continue to support catching the SIGSEGV/SIGBUS from a QOI standpoint.  That's why I installed the changes to preserve the actual null dereferences yesterday.

The cases where a NULL is passed as an argument to a function which declares the argument must be non-NULL isn't as clear cut.  Similarly for a NULL which reaches a return statement in a function declared as returning non-NULL.   I feel we're doing the right thing for those by just trapping.  There's no guarantee the caller/callee will dereference those values -- all we know is a NULL pointer is disallowed in those cases by way of the attributes.

And FWIW, we're going to have the same bundle of questions when we want to start isolating and trapping out of bounds array accesses.

All cases look more and more like something we should be warning about.

WRT non-call-exceptions on SJLJ targets, we should still have the EH edges around, so fixing the abnormal edge handing should be all we need to do (of course, SJLJ targets are less and less important, but presumably they're still around?)

Comment 17 Jeffrey A. Law 2013-11-13 16:50:13 UTC

Markus,

For the kernel case, note the qsort prototype and the non-null attribute.  That explicitly states that the pointer arguments must not be null.  Any code which then passes null for those arguments has stepped into the realm of undefined behaviour.

Prior to CCP2 we have:

main ()
{
  int * _3;

;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
;;    prev block 0, next block 1, flags: (NEW, REACHABLE)
;;    pred:       ENTRY [100.0%]  (FALLTHRU,EXECUTABLE)
  _3 = a.offset;
  qsort (_3, 0);
  return 0;
;;    succ:       EXIT [100.0%]

}

a.offset gets folded to zero by CCP2 resulting in:
main ()
{
;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
;;    prev block 0, next block 1, flags: (NEW, REACHABLE)
;;    pred:       ENTRY [100.0%]  (FALLTHRU,EXECUTABLE)
  qsort (0B, 0);
  return 0;
;;    succ:       EXIT [100.0%]

}

Note we're passing NULL as the first argument to qsort, which has a declaration saying that none of its pointer arguments can be NULL.  

Note we're able to fold a.offset to zero because we can see "a"'s initializer. 

AFAICT the code is doing exactly what is expected.

Comment 18 Markus Trippelsdorf 2013-11-13 17:04:02 UTC

(In reply to Jeffrey A. Law from comment #17)
> For the kernel case, note the qsort prototype and the non-null attribute. 
> That explicitly states that the pointer arguments must not be null.  Any
> code which then passes null for those arguments has stepped into the realm
> of undefined behaviour.

Yes, I've already posted a patch to the LKML to fix this issue.

Now the question is if one could also find *real* bugs with the help of
 -fisolate-erroneous-path and if the inconvenience of breaking otherwise 
"harmless" programs is counterbalanced by this ability?

Comment 19 Jeffrey A. Law 2013-11-13 17:10:29 UTC

Yes, the glibc guys have already found real bugs that they've fixed.

Comment 20 Markus Trippelsdorf 2013-11-18 12:03:09 UTC

I've tested this further on my Gentoo box and it turned out
the many nontrivial packages that I've compiled failed
with "trap invalid opcode". From a QOI perspective this is a 
nightmare. Not every user of the compiler is a developer, e.g.
as a Gentoo user I don't care deeply about most ebuilds, I just
want them to compile without fuss. And for this I need a compiler
that is as permissive as possible.

Comment 21 Markus Trippelsdorf 2013-12-18 13:08:06 UTC

Jeff patch to split up the erroneous path optimization into two pieces
(r205689) fixes the issue.