Hello, I've encountered a new bug that was introduced by SVN commit: 204414. Program: gimp commit 88ecd59c3436d302b644a5d25c1938c0e7b60ae0 Author: Michael Natterer <mitch@gimp.org> Date: Tue Feb 5 20:43:53 2013 +0100 GCC: 204414 uname -a: Linux marxinbox 3.7.4-gentoo #6 SMP Mon Sep 30 20:53:52 CEST 2013 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux Build flags: CFLAGS="-flto=9 -fno-fat-lto-objects -O2 -fprofile-generate" When I added -fno-isolate-erroneous-paths the program run correctly. Stacktrace: │0x988226 <windows_actions_dock_window_added+646> addq $0x1,0x6a8732(%rip) # 0x1030960 <__gcov0.gimp_action_group_add_actions.lto_priv.4837+64> │ │0x98822e <windows_actions_dock_window_added+654> addq $0x1,0x6a87d2(%rip) # 0x1030a08 <__gcov0.gimp_action_group_add_actions.lto_priv.4837+232> │ │0x988236 <windows_actions_dock_window_added+662> callq 0x820fa0 <gimp_action_group_check_unique_action.lto_priv.5831> │ │0x98823b <windows_actions_dock_window_added+667> test %eax,%eax │ │0x98823d <windows_actions_dock_window_added+669> jne 0x988258 <windows_actions_dock_window_added+696> │ │0x98823f <windows_actions_dock_window_added+671> addq $0x1,0x6a8721(%rip) # 0x1030968 <__gcov0.gimp_action_group_add_actions.lto_priv.4837+72> │ │0x988247 <windows_actions_dock_window_added+679> addq $0x1,0x6a87b1(%rip) # 0x1030a00 <__gcov0.gimp_action_group_add_actions.lto_priv.4837+224> │ │0x98824f <windows_actions_dock_window_added+687> jmpq 0x988100 <windows_actions_dock_window_added+352> │ │0x988254 <windows_actions_dock_window_added+692> nopl 0x0(%rax) │ │0x988258 <windows_actions_dock_window_added+696> mov $0x5,%edx │ │0x98825d <windows_actions_dock_window_added+701> mov $0xb8b6f6,%esi │ │0x988262 <windows_actions_dock_window_added+706> xor %edi,%edi │ │0x988264 <windows_actions_dock_window_added+708> callq 0x478da0 <dcgettext@plt> │ >│0x988269 <windows_actions_dock_window_added+713> ud2 │ │0x98826b nopl 0x0(%rax,%rax,1) │ │0x988270 <windows_actions_update_display_accels> push %r14 │ │0x988272 <windows_actions_update_display_accels+2> mov %rdi,%r14 │ │0x988275 <windows_actions_update_display_accels+5> push %r13 │ │0x988277 <windows_actions_update_display_accels+7> push %r12 │ │0x988279 <windows_actions_update_display_accels+9> push %rbp │ │0x98827a <windows_actions_update_display_accels+10> push %rbx │ │0x98827b <windows_actions_update_display_accels+11> mov 0x20(%rdi),%rdi │ │0x98827f <windows_actions_update_display_accels+15> addq $0x1,0x7e7df9(%rip) # 0x1170080 <__gcov0.windows_actions_update_display_accels> │ │0x988287 <windows_actions_update_display_accels+23> callq 0x99e6b0 <gimp_get_display_iter> │ │0x98828c <windows_actions_update_display_accels+28> test %rax,%rax │ │0x98828f <windows_actions_update_display_accels+31> mov %rax,%rbx │ │0x988292 <windows_actions_update_display_accels+34> je 0x988398 <windows_actions_update_display_accels+296> │ │0x988298 <windows_actions_update_display_accels+40> xor %ebp,%ebp │ │0x98829a <windows_actions_update_display_accels+42> jmpq 0x988366 <windows_actions_update_display_accels+246> │ │0x98829f <windows_actions_update_display_accels+47> nop Thank you, Martin
We need a testcase for this. ud2 translates to __builtin_trap so this could be a bug in their code still.
I need a compilable/complete testcase. If a program is faulting with -fisolate-erroneous-paths, then that program is faulty in one way or another. It's dereferencing a null pointer, pass a null argument in a situation where the argument must be non-null, or returning a null pointer from a function declared as returning non-null.
I've seen the same problem also in Inkscape. I will try to create a testcase. Would it be possible Jeffrey to build one of these programs? Thanks, Martin
(In reply to Jeffrey A. Law from comment #2) > I need a compilable/complete testcase. > > If a program is faulting with -fisolate-erroneous-paths, then that program > is faulty in one way or another. It's dereferencing a null pointer, pass a > null argument in a situation where the argument must be non-null, or > returning a null pointer from a function declared as returning non-null. I'm seeing this also quite often. Xorg fails to start when compiled with current gcc, the Linux kernel doesn't build, because relocs (build during the kernel build) fails with illegal instruction, etc.. So tbe -fisolate-erroneous-paths patch should be reverted IMHO.
I need testcases. "the kernel" or "x.org" isn't sufficient for a variety of reasons.
(In reply to Jeffrey A. Law from comment #5) > I need testcases. "the kernel" or "x.org" isn't sufficient for a variety of > reasons. Every program that uses a custom sig_handler which only handles SIGSEGV will fail.
markus@x4 tmp % cat test.c #include <setjmp.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <sys/types.h> static sigjmp_buf jmpbuf; static void sig_handler (int signo) { siglongjmp (jmpbuf, 1); } int main (void) { char *p = NULL; volatile int ret = 0; struct sigaction sa; sa.sa_handler = sig_handler; sigemptyset (&sa.sa_mask); sa.sa_flags = SA_SIGINFO; if (sigaction (SIGSEGV, &sa, 0)) { perror ("installing SIGSEGV handler\n"); exit (1); } puts ("Attempting to sprintf to null ptr"); if (setjmp (jmpbuf)) { puts ("Exiting main..."); return ret; } sprintf (p, "This should segv\n"); return 1; } markus@x4 tmp % gcc -O2 test.c && ./a.out Attempting to sprintf to null ptr [1] 28519 illegal hardware instruction ./a.out markus@x4 tmp % gcc -fno-isolate-erroneous-paths -O2 test.c && ./a.out Attempting to sprintf to null ptr Exiting main... markus@x4 tmp %
Should be fixed via recent commits. Specifically, we preserve the *0 for code that wants to catch the null pointer deref.
(In reply to Jeffrey A. Law from comment #8) > Should be fixed via recent commits. Specifically, we preserve the *0 for > code that wants to catch the null pointer deref. Well, if you had run the simple glibc testcase that I've posted above, you would have found out that your recent commits don't fix this issue. It's obvious, not only from the amount of churn, that fisolate-erroneous-paths shouldn't be enabled by default at -O2. The very few users you would like SIGSEGV to be transformed to SIGILL by the compiler should set this option explicitly.
I ran the testcase you sent. It worked fine for me. THe problems we're having are within the realm of normal development and they will be resolved one way or another.
Damn it. Tested the wrong compiler. The problem with your testcase Markus is you're simply not allowed to pass a null pointer to sprintf, memcpy and a variety of other functions. Once you execute code which attempts that, you've walked into the realm of undefined behaviour. Once you cross that boundary the compiler is allowed to do these kinds of transformations. Your testcode is fundamentally flawed in that it relies upon undefined behaviour.
I'll note further, that an implementation of sprintf, memcpy, etc could check for a NULL pointer internally and raise a trap on their own rather than dereferencing the invalid pointer and still be a conforming implementation. In summary, you can't make the assumption that you'll actually get a segfault/bus error like your code wants to do. Contrast that to an explicit *0 = <whatever> which might appear in user code. While executing that code still results in undefined behaviour, one can easily make an argument from a QOI standpoint that we should still allow the memory dereference to occur to generate the SIGBUS/SIGSEGV.
(In reply to Jeffrey A. Law from comment #11) > Damn it. Tested the wrong compiler. > > The problem with your testcase Markus is you're simply not allowed to pass a > null pointer to sprintf, memcpy and a variety of other functions. Once you > execute code which attempts that, you've walked into the realm of undefined > behaviour. > > Once you cross that boundary the compiler is allowed to do these kinds of > transformations. Your testcode is fundamentally flawed in that it relies > upon undefined behaviour. Understood. But the problem is that "the kernel" and "x.org" still fail. I will try to come up with testcases for those two and see if it's also caused by the same undefined behavior.
The 'kernel' case passes a NULL pointer to qsort: % < test.i struct relocs { int *offset; }; static struct relocs a; void qsort(void *, int) __attribute__((__nonnull__)); void die(int *p1, ...) {} void sort_relocs(struct relocs *p1) { qsort(p1->offset, 0); } int main() { sort_relocs(&a); return 0; }
Note that technically invalid C programs still may be "reasonable". For example writing *0 = 1; and catching that via sigsetjmp/siglongjmp and a signal handler is possible but of course invokes undefined behavior according to the C standard. But yes, if now random "POSIX" programs start failing then that's bad. Does -fnon-call-exceptions exeption handling on SJLJ targets still work? (I suppose we lower to SJLJ only after RTL expansion or we at least still have EH edges around).
Richi, No doubt about *0 = 1 and the like. While it's clearly undefined, I think we've got to continue to support catching the SIGSEGV/SIGBUS from a QOI standpoint. That's why I installed the changes to preserve the actual null dereferences yesterday. The cases where a NULL is passed as an argument to a function which declares the argument must be non-NULL isn't as clear cut. Similarly for a NULL which reaches a return statement in a function declared as returning non-NULL. I feel we're doing the right thing for those by just trapping. There's no guarantee the caller/callee will dereference those values -- all we know is a NULL pointer is disallowed in those cases by way of the attributes. And FWIW, we're going to have the same bundle of questions when we want to start isolating and trapping out of bounds array accesses. All cases look more and more like something we should be warning about. WRT non-call-exceptions on SJLJ targets, we should still have the EH edges around, so fixing the abnormal edge handing should be all we need to do (of course, SJLJ targets are less and less important, but presumably they're still around?)
Markus, For the kernel case, note the qsort prototype and the non-null attribute. That explicitly states that the pointer arguments must not be null. Any code which then passes null for those arguments has stepped into the realm of undefined behaviour. Prior to CCP2 we have: main () { int * _3; ;; basic block 2, loop depth 0, count 0, freq 10000, maybe hot ;; prev block 0, next block 1, flags: (NEW, REACHABLE) ;; pred: ENTRY [100.0%] (FALLTHRU,EXECUTABLE) _3 = a.offset; qsort (_3, 0); return 0; ;; succ: EXIT [100.0%] } a.offset gets folded to zero by CCP2 resulting in: main () { ;; basic block 2, loop depth 0, count 0, freq 10000, maybe hot ;; prev block 0, next block 1, flags: (NEW, REACHABLE) ;; pred: ENTRY [100.0%] (FALLTHRU,EXECUTABLE) qsort (0B, 0); return 0; ;; succ: EXIT [100.0%] } Note we're passing NULL as the first argument to qsort, which has a declaration saying that none of its pointer arguments can be NULL. Note we're able to fold a.offset to zero because we can see "a"'s initializer. AFAICT the code is doing exactly what is expected.
(In reply to Jeffrey A. Law from comment #17) > For the kernel case, note the qsort prototype and the non-null attribute. > That explicitly states that the pointer arguments must not be null. Any > code which then passes null for those arguments has stepped into the realm > of undefined behaviour. Yes, I've already posted a patch to the LKML to fix this issue. Now the question is if one could also find *real* bugs with the help of -fisolate-erroneous-path and if the inconvenience of breaking otherwise "harmless" programs is counterbalanced by this ability?
Yes, the glibc guys have already found real bugs that they've fixed.
I've tested this further on my Gentoo box and it turned out the many nontrivial packages that I've compiled failed with "trap invalid opcode". From a QOI perspective this is a nightmare. Not every user of the compiler is a developer, e.g. as a Gentoo user I don't care deeply about most ebuilds, I just want them to compile without fuss. And for this I need a compiler that is as permissive as possible.
Jeff patch to split up the erroneous path optimization into two pieces (r205689) fixes the issue.