60406 – recover.go: test13reflect2 test failure

Bug 60406 - recover.go: test13reflect2 test failure

Summary: recover.go: test13reflect2 test failure

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	go (show other bugs)
Version:	4.9.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Ian Lance Taylor

URL:
Keywords:

Duplicates (1):	63170 (view as bug list)
Depends on:
Blocks:

Reported:	2014-03-04 07:30 UTC by Dominik Vogt
Modified:	2014-10-28 08:50 UTC (History)
CC List:	4 users (show)

See Also:	60728
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2014-08-06 00:00:00

Attachments
possible patch (1.35 KB, patch) 2014-10-03 22:58 UTC, Ian Lance Taylor	Details \| Diff
possible patch 2 (3.40 KB, patch) 2014-10-07 18:25 UTC, Ian Lance Taylor	Details \| Diff
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dominik Vogt 2014-03-04 07:30:23 UTC

From a email discussion between me (DV) and Ian Lance Taylor (IT):

DV: The bug shows up when running recover.go from the go testsuite
DV: in the function test13reflect2 (and test14reflect2).
DV:
DV: When panic is called, the deferred reflection function is invoked
DV: by calling the thunk function.  (See below for the assembly code
DV: on s390).  The call to __go_set_defer_retaddr is passed the address
DV: of a label that was generated originally right behind the call to
DV: makeFuncStub, and the fake jump is there to make sure the
DV: automatically generated label does not get deleted during
DV: compilation.  However, in the assembler code a different address
DV: is passed to __go_set_defer_retaddr that the target address of the
DV: fake jump:
DV:
DV: -- snip assembler code --
DV:   0000000080002490 <main.$thunk0>:
DV:     ...
DV:     8000249e:  larl  %r2,800024f4 <main.$thunk0+0x64> -------
DV:     800024a4:  brasl %r14,80002c48 <__go_set_defer_retaddr>  |
DV:     800024aa:  tmll  %r2,255                                 |
DV:     # fake jump                                              |
DV:     800024ae:  jne   80002500 <main.$thunk0+0x70> --------   |
DV:     800024b2:  ltgr  %r13,%r13                            |  |
DV:     ...                                                   |  |
DV:     # code that copies T5                                 |  |
DV:     800024de:  lghi  %r3,31                               |  |
DV:     800024e2:  mvc   0(256,%r2),0(%r1)                    |  |
DV:     800024e8:  la    %r2,256(%r2)                         |  |
DV:     800024ec:  la    %r1,256(%r1)                         |  |
DV:     800024f0:  brctg %r3,800024e2 <main.$thunk0+0x52>     |  |
DV:     # wrong position of split label                       |  |
DV:     800024f4:  mvc   0(256,%r2),0(%r1)            <-------|--
DV:     800024fa:  la    %r2,160(%r15)                        |
DV:     # call makeFuncStub                                   |
DV:     800024fe:  basr  %r14,%r13  <--- makeFuncStub         |
DV:     # original position of the label "retaddr"            |
DV:     80002500:  lghi  %r2,0                        <-------
DV:     ...
DV: -- snip --
DV:
DV: The code in libgo/runtime/go-recover.c:__go_can_recover tests
DV: whether the address passed to __go_set_defer_retaddr is in a
DV: certain range around the the real return address of makeFuncStub:
DV:
DV:   if (ret <= dret && ret + 16 >= dret)
DV:     return 1;
DV:
DV: In the above assembly language code the assumption that the defer
DV: retaddr will always end up in that range is violated.  Thus,
DV: __go_can_recover returns 0, the panic is not recovered and the test
DV: case fails.
DV:
DV: --
DV:
DV: There are two problems in the current code.  First, the assumption
DV: in __go_can_recover is too strict.  I think the "correct" test
DV: would be to check whether dret is inside the thunk function, i.e.
DV:
DV:   if (dret >= thunk_start && dret < thunk_end)
DV:     return 1;

IT: Yes.


DV: Except that thunk_start and thunk_end are not easily available at
DV: that place.


IT: Well, we could get thunk_start fairly easily: we could pass it to
IT: __go_set_defer_retaddr, just as we currently pass the return label.  I
IT: can't think of a reliable way for us to get thunk_end, though.


DV: Second, the code ("hack") in gcc/go/gofrontend/statementis.cc:build_thunk()
DV: simply does not work:
DV:
DV: -- snip --
DV:   // For a defer statement, start with a call to
DV:   // __go_set_defer_retaddr.  */
DV:   Label* retaddr_label = NULL;
DV:   if (may_call_recover)
DV:     {
DV:       retaddr_label = gogo->add_label_reference("retaddr", location, false);
DV:       Expression* arg = Expression::make_label_addr(retaddr_label, location);
DV:       Expression* call = Runtime::make_call(Runtime::SET_DEFER_RETADDR,
DV:                                             location, 1, arg);
DV:
DV:       // This is a hack to prevent the middle-end from deleting the
DV:       // label.
DV:       gogo->start_block(location);
DV:       gogo->add_statement(Statement::make_goto_statement(retaddr_label,
DV:                                                          location));
DV:       ...
DV:     }
DV:
DV:   ...
DV:
DV:   // If this is a defer statement, the label comes immediately after
DV:   // the call.
DV:   if (may_call_recover)
DV:     {
DV:       gogo->add_label_definition("retaddr", location);
DV:       ...
DV:     }
DV: -- snip --
DV:
DV: The address of the label "retaddr" is passed to
DV: __go_set_defer_retaddr and then a fake jump to that label is
DV: inserted after that (fake because __go_set_defer_retaddr always
DV: returns 0).  Although the fake jump does prevent the label of the
DV: jump target from being deleted, it does not help keeping the label
DV: at the passed address alive.  What actually happens is this:
DV:
DV:  * At first, the call and the jump both use the retaddr label.
DV:  * After the cprop1 pass, the cfgcleanup pass calls
DV:    try_forward_edges().
DV:  * try_forward_edges() decides that the edge described by the jump
DV:    can be forwarded to some other place.  So, it creates a new
DV:    label at the new jump target and redirects the jump there.
DV:    Then it tries to delete the label (cfgrtl.c:delete_insn()) and
DV:    notices it's a user generated label that cannot be deleted.
DV:    Instead, the label is replaced by a NOTE_DELETED_LABEL_NAME to
DV:    keep the address reference alive.  We now have _two_ labels,
DV:    the new, live one and the old deleted one.
DV:  * The following passes move the deleted label around the function
DV:    and it ends up in a more or less random location inside the
DV:    thunk function.  (Actually, where it ends up depends on the
DV:    amount of code that is needed to copy T5.  If T5 is small
DV:    enough that it can be copied without a loop, the test succeeds.)
DV:
DV: In other words, when the edge is forwarded, only jumps to the
DV: target label are forwarded but not simple references to the
DV: address of the label.  I'm not sure whether this is a Gcc bug or
DV: not.

IT: The design assumes that the thunk is very simple, which it normally
IT: is.  A simple thunk won't have any basic blocks so the problematic
IT: optimization won't kick in.  I think that assumption is failing in
IT: this case because the function argument is so large.  The large
IT: argument is introducing a loop and thus permitting basic block
IT: optimizations to occur.
IT:
IT: I suppose we could introduce yet another thunk, and pass along the
IT: pointer to the parameters, rather than expanding them.  That is
IT: getting pretty horrible, though.  It would be better to use
IT: thunk_start and thunk_end as you suggest above, if we can figure out
IT: how to do it.

--

DV: To get the thunk address and length, the correct approach is
DV: probably to utilize the DWARF info, ...

IT: To get the thunk address I would suggest just changing the Go
IT: frontend to pass it to __go_defer_set_retaddr.
IT:
IT: I have no particular suggestions for the thunk length.  There may
IT: be some relatively simple way to pick that up as well.

Comment 1 Uroš Bizjak 2014-08-06 14:19:52 UTC

This problem also happens on alpha, please see analysis at [1].

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60874#c8

Comment 2 Uroš Bizjak 2014-08-06 17:04:41 UTC

Corrected Summary.

Comment 3 Uroš Bizjak 2014-08-07 08:47:24 UTC

It looks that a hack, mentioned in Comment #0 should use asm goto instead of goto. The following test:

--cut here--
extern void foo (void *);

int test_bad (int p1, int p2)
{
  __label__ bla1;

  foo (&&bla1);
  goto bla1;
 bla1:
  return 0;
}

int test_good (int p1, int p2)
{
  __label__ bla2;

  foo (&&bla2);
  asm goto ("" :::: bla2);
 bla2:
  return 0;
}
--cut here--

results in (-O2):

test_good:
        subl    $28, %esp
        movl    $.L5, (%esp)
        call    foo
.L6:
.L5:
        xorl    %eax, %eax
        addl    $28, %esp
        ret

which is much better than:

test_bad:
.L2:
        subl    $28, %esp
        movl    $.L2, (%esp)
        call    foo
        xorl    %eax, %eax
        addl    $28, %esp
        ret

Also checked on alpha:

test_good:
        .frame $30,16,$26,0
        .mask 0x4000000,-16
$LFB1:
        .cfi_startproc
        ldah $29,0($27)         !gpdisp!4
        lda $29,0($29)          !gpdisp!4
$test_good..ng:
        lda $30,-16($30)
        .cfi_def_cfa_offset 16
        cpys $f31,$f31,$f31
        ldah $16,$L4($29)               !gprelhigh
        ldq $27,foo($29)                !literal!5
        stq $26,0($30)
        .cfi_offset 26, -16
        .prologue 1
        lda $16,$L4($16)                !gprellow
        jsr $26,($27),foo               !lituse_jsr!5
        ldah $29,0($26)         !gpdisp!6
        lda $29,0($29)          !gpdisp!6
        .align 3 #realign
$L5:
$L4:
        mov $31,$0
        ldq $26,0($30)
        lda $30,16($30)
        .cfi_restore 26
        .cfi_def_cfa_offset 0
        ret $31,($26),1

Please note, that the check will still need some tolerance due to various label alignment requirements, additional instructions, etc:

  5c:   00 00 10 22     lda     a0,0(a0)
                        5c: GPRELLOW    .text+0x70              <- label loc
  60:   00 40 5b 6b     jsr     ra,(t12),64 <test_good+0x24>
                        60: LITUSE      .text+0x3
                        60: HINT        foo
  64:   00 00 ba 27     ldah    gp,0(ra)                        <- retaddr
                        64: GPDISP      .text+0x4
  68:   00 00 bd 23     lda     gp,0(gp)
  6c:   00 00 fe 2f     unop
  70:   00 04 ff 47     clr     v0

Comment 4 Uroš Bizjak 2014-08-07 10:34:50 UTC

Alternatively, since __go_set_defer_retaddr always returns 0, we can prevent split of the label by enclosing it in asm volatiles:

extern int foo (void *);
extern void bar (void);

int test_1 (void)
{
  __label__ bla1;

  if (foo (&&bla1))
    goto bla1;

  asm volatile ("");
  bar ();
 bla1:
  asm volatile ("");

  return 0;
}

test_1:
        subl    $28, %esp
        movl    $.L2, (%esp)
        call    foo
        testl   %eax, %eax
        jne     .L2
        call    bar
.L2:
        xorl    %eax, %eax
        addl    $28, %esp
        ret

'goto label' can also be substituted with 'asm goto ("" :::: label)', but I don't think this is necessary in the above case.

Comment 5 Ian Lance Taylor 2014-09-04 15:52:19 UTC

*** Bug 63170 has been marked as a duplicate of this bug. ***

Comment 6 Dominik Vogt 2014-09-05 07:15:08 UTC

I'll submit a fix for this problem as soon as I figure out what to do with the patch.

Comment 7 boger 2014-09-16 20:57:17 UTC

I've built with Dominik's patches against trunk on ppc64le and have been trying to run the gcc testsuite for go and libgo.

The recover.go testcase continues to fail in my build.  I did some debugging on this and found that the problem occurs in ffi_callback when it calls __go_makefunc_can_recover.  Based on the comments I think it should be called with the program address for the most recent caller on the stack which is not ffi, but instead it is called with the address for ffi_closer_helper_LINUX64.  I suspect this is happening because runtime_callers is not returning the complete set of callers.  This error leads to an incorrect setting of __makefunc_can_recover in the __go_defer_stack structure so the test prints the "missing recover" error.

Comment 8 boger 2014-09-22 20:57:09 UTC

Update on my previous work:

1) My previous update referred to a build that was done with the patches that were submitted to gcc and patches that Dominik provided me for libffi.  I found that if I only build with the gcc patches and don't use the libffi patches then the recover.go testcase passes on ppc64le.

Comment 9 boger 2014-09-23 13:16:06 UTC

The patch to fix the recover.go problem causes significant regression in gccgo when built for ppc64 (BE).  There are 32 unexpected failures in the gcc go testsuite with the patch 32 unexpected failures in the gcc libgo testsuite.  Without this patch on trunk there are 2 failures in the go testsuites and 1 failure in the libgo testsuite.

I did some debugging on one of the regression failures:  bug113.go.  I found that the problem occurs in the new code in the patch in go/gofrontend/statements.cc when creating the expression tree to call __go_set_defering_fn.  The argument that is being passed is generated through a call to make_func_code_reference:


      Expression* arg =
        Expression::make_func_code_reference(function, location);
      Expression* call = Runtime::make_call(Runtime::SET_DEFERING_FN,
                                            location, 1, arg);
      Statement* s = Statement::make_statement(call, true);
      s->determine_types();
      gogo->add_statement(s);


On ppc64le, this works as expected but on ppc64(be) the code that is generated from this is not the address of the function but the address of the .opd entry that is used to call the function.  As a result the setting of the deferring function is incorrect and it does not appear as if it can recover because of the way __go_can_recover uses the deferring function's address to see if it is in the correct range.

When I debug this using gdb:

Breakpoint 1, __go_set_defering_fn (defering_fn=0x10020150 <main.$thunk0>) at /home/boger/gccgo.work/140922/src/libgo/runtime/go-defer.c:78
78      {
Missing separate debuginfos, use: debuginfo-install glibc-2.17-55.el7.ppc64
(gdb) info reg $r3
r3             0x10020150       268566864

From the objdump, I see this address is in the .opd:
0000000010020150 <main.$thunk0>:
    10020150:   00 00 00 00     .long 0x0
    10020154:   10 00 1c 4c     vsro    v0,v0,v3
    10020158:   00 00 00 00     .long 0x0
    1002015c:   10 02 81 c0     .long 0x100281c0

but the code is actually here, which is the address that should be passed to __go_set_defering_fn:
0000000010001c4c <.main.$thunk0>:
main.$thunk0():
    10001c4c:   7c 08 02 a6     mflr    r0
    10001c50:   f8 01 00 10     std     r0,16(r1)
    10001c54:   fb e1 ff f8     std     r31,-8(r1)
    10001c58:   f8 21 ff 71     stdu    r1,-144(r1)

Note that the actual function code address is in the first 8 bytes of the .opd entry for the function.

Comment 10 Peter Bergner 2014-09-23 13:34:29 UTC

(In reply to boger from comment #9)
> On ppc64le, this works as expected but on ppc64(be) the code that is
> generated from this is not the address of the function but the address of
> the .opd entry that is used to call the function.  As a result the setting
> of the deferring function is incorrect and it does not appear as if it can
> recover because of the way __go_can_recover uses the deferring function's
> address to see if it is in the correct range.

On BE, a "function pointer" (unlike on LE or x64* or ...) always points to a function's function descriptor (ie, the .opd entry) and not the code address.  The function descriptor contains 3 64-bit doublewords.  The first doubleword is the address of the function's code.  The second doubleword is the TOC value that needs to be in r2 while we execute the function and the third doubleword contains an environment pointer for languages such as Pascal and PL/1.

Comment 11 boger 2014-09-29 13:17:14 UTC

On ppc64 BE, the call to make_code_func_reference returns the function pointer in the .opd and not the actual address of the function.  That is what causes the failures with this patch https://gcc.gnu.org/ml/gcc-patches/2014-09/msg00710.html
The information in this reply fixes the regressions from this patch on ppc64 BE:
https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02282.html

Essentially the change is to dereference the function pointer in __go_set_defering_fn when building for ppc64 BE as follows:

+#if defined(__powerpc64__) && _CALL_ELF != 2
+    g->defer->__defering_fn = *(void **)defering_fn;
+#else
+    g->defer->__defering_fn = defering_fn;
+#endif

Comment 12 Ian Lance Taylor 2014-10-03 22:58:59 UTC

Created attachment 33644 [details]
possible patch

I'm not really happy with Dominik's patch because 1) it doesn't work when configuring with --enable-sjlj-exceptions; 2) the current code almost always works, even on S/390, but the patch forces us to do a lookup in the FDE table every time we call recover.

I'd like to propose a different patch, that keeps the current code that almost always works, does a quick exit when there is no defer in the call stack, and does an unwind to check for a specific function in other cases.

Can people on systems for which the recover.go test currently fails please try this patch and let me know whether it fixes the problem?  Thanks.

This is https://codereview.appspot.com/153950043 .

Comment 13 Uroš Bizjak 2014-10-05 19:21:02 UTC

(In reply to Ian Lance Taylor from comment #12)

> Can people on systems for which the recover.go test currently fails please
> try this patch and let me know whether it fixes the problem?  Thanks.

Unfortunately, the patched libgo still fails on alpha.

Comment 14 Dominik Vogt 2014-10-06 15:43:46 UTC

> I'm not really happy with Dominik's patch because 1) it doesn't work when
> configuring with --enable-sjlj-exceptions;

Why is that important?

> 2) the current code almost always works, even on S/390, but the patch
> forces us to do a lookup in the FDE table every time we call recover.

The current code works unreliably as s390 uses memcopy to copy call arguments to the stack.  The control flow introduced by the function call triggers basic block reordering that may result in anything.

> I'd like to propose a different patch, that keeps the current code
> that almost always works, does a quick exit when there is no defer in
> the call stack, and does an unwind to check for a specific function
> in other cases.

I've not tested the patch yet, but see some potential problems:

* On systems that "use a leading underscore on symbol names", the test for functions beginning with "__go_" or "_go_" would yield "true" from user functions named "_go_..." (because the system adds one '_' and the patch strips it).

* Wouldn't the new patch re-introduce the bug that

  func foo(n int) {
    if (n == 0) { recover(); } else { foo(0); }
  }
  func main() {
    defer foo(1)
    panic("...")
  }

  would recover although it should not?

* The code is even more expensive than the approach I had chosen because now it needs to fetch a two level backtrace instead of just one level (and probably each level is more expensive than the one _Unwind_FindEnclosingFunc()).

----


Digging deeper at the issue that causes Lynn's problems on Power surfaces more problems with the current implementation of __go_can_recover() and the thunk, also with the patch I've posted.  Here's a summary of all the problems I am aware of (some new, some known, skipping the bugs introduced by the patch):

Problems with the current implementation only:
----------------------------------------------

1) Calculation of the label address in the thunk does not work if the basic block reordering is done (that's the issue why this bug report was created)

2) The current checks for "return address + 16" may point into a different function, allowing recover() in weird situations.

Problems with the current implementation and the proposed patch:
----------------------------------------------------------------

3) Quote from http://golang.org/ref/spec:

"The return value of recover is nil if any of the following conditions holds:
 ...
 *recover was not called directly by a deferred function."

According to the spec, the following code should recover the panic but does not:

  func main() { defer foo(); panic("..."); }
  func foo() { defer bar(); }
  func bar() { recover(); }

Note that this is also also "broken" in Golang (well, at least in the old version that comes with Ubuntu).  This may be an effect of imprecise wording of the spec.

4) __go_can_recover assumes that any call through libffi is allowed to recover.  Quote from the sources:

  "/* If the function calling recover was created by reflect.MakeFunc,
      then RETADDR will be somewhere in libffi.  Our caller is
      permitted to recover if it was called from libffi.  */"

This violates the specification.  An example that recovers the panic in a nested function but should not:

--snip--
package main
import "reflect"

func main() {
  // generate foo() and bar()
  fn := reflect.ValueOf(interface{}(&foo)).Elem()
  val := reflect.MakeFunc(fn.Type(), recover_reflect1)
  fn.Set(val)
  fn = reflect.ValueOf(interface{}(&bar)).Elem()
  val = reflect.MakeFunc(fn.Type(), recover_reflect2)
  fn.Set(val)

  defer foo()
  panic("...")
}

var foo func()
func recover_reflect1(args []reflect.Value) (results []reflect.Value) {
  bar()
  return results
}

var bar func()
func recover_reflect2(args []reflect.Value) (results []reflect.Value) {
  recover()
  return results
}
--snip--

Actually, I think this may also happen if bar is not a function created through reflection but any foreign language function

 * from a stripped object file
   (no name in the backtrace is guessed as "called by libffi"),
 * if the name begins with "[_]ffi_"

(But perhaps it's impossible to construct such an example.)

Comment 15 boger 2014-10-06 15:48:25 UTC

The testcase recover.go continues to fail on both ppc64 LE & BE with the new patch https://codereview.appspot.com/153950043.

Comment 16 Ian Lance Taylor 2014-10-06 16:27:15 UTC

>> I'm not really happy with Dominik's patch because 1) it doesn't work when
>> configuring with --enable-sjlj-exceptions;
>
> Why is that important?

It's not very important but it's still a point to consider.  Some targets default to SJLJ exceptions, albeit not very important ones.


>> 2) the current code almost always works, even on S/390, but the patch
>> forces us to do a lookup in the FDE table every time we call recover.
>
> The current code works unreliably as s390 uses memcopy to copy call
> arguments to the stack.  The control flow introduced by the function
> call triggers basic block reordering that may result in anything.

Sure, I understand, and it can obviously cause a false negative: some cases that should recover will fail to do so.  However, I don't see any way that it can ever cause a false positive: I don't see any way that it can cause recover to succeed when it should not.


> * On systems that "use a leading underscore on symbol names", the test
> for functions beginning with "__go_" or "_go_" would yield "true" from user
> functions named "_go_..." (because the system adds one '_' and the patch
> strips it).

Yes.  We are already going to have trouble on such systems.  Really the library needs to learn which systems use a leading underscore and which do not.  This is actually available as __USER_LABEL_PREFIX__, and we should use that.


> * Wouldn't the new patch re-introduce the bug that
>
>   func foo(n int) {
>     if (n == 0) { recover(); } else { foo(0); }
>   }
>   func main() {
>     defer foo(1)
>     panic("...")
>   }
> 
>   would recover although it should not?

Hmmm, I hadn't fully internalized that issue.  Your new withoutRecoverRecursive test doesn't fail for me on x86_64.  I'll have to figure out why.


> * The code is even more expensive than the approach I had chosen because
> now it needs to fetch a two level backtrace instead of just one level
> (and probably each level is more expensive than the one 
> _Unwind_FindEnclosingFunc()).

Yes, but the expensive case only happens in the rare cases where either recover should not work or when the existing code has a false negative.  In the normal case, where recover is permitted and the existing code works, we save the FDE lookup.


> 2) The current checks for "return address + 16" may point into a
> different function, allowing recover() in weird situations.

It's a potential problem but I'm not too worried about it.


> "The return value of recover is nil if any of the following conditions holds:
>  ...
>  *recover was not called directly by a deferred function."
> 
> According to the spec, the following code should recover the panic but
> does not:
> 
>   func main() { defer foo(); panic("..."); }
>   func foo() { defer bar(); }
>   func bar() { recover(); }
> 
> Note that this is also also "broken" in Golang (well, at least in the old
> version that comes with Ubuntu).  This may be an effect of imprecise
> wording of the spec.

In this case, the call to recover in bar is supposed to return nil; it should not recover the panic.  If you read the paragraph before the one you quote, you will see that recover only returns non-nil if it was called by a function that was deferred before the call to panic.  In your example, the defer of bar happens after the call to panic.  The reason Go works this way is to that the deferred function foo can itself call a function that panics and recovers without that function being confused by the earlier panic, one that it may not know anything about.


> 4) __go_can_recover assumes that any call through libffi is allowed
> to recover.

Thanks for the example.  Does your patch fix this problem?

Comment 17 Dominik Vogt 2014-10-07 09:11:55 UTC

>> * Wouldn't the new patch re-introduce the bug that
>>
>>   func foo(n int) {
>>     if (n == 0) { recover(); } else { foo(0); }
>>   }
>>   func main() {
>>     defer foo(1)
>>     panic("...")
>>   }
>> 
>>   would recover although it should not?
>
> Hmmm, I hadn't fully internalized that issue.  Your new
> withoutRecoverRecursive test doesn't fail for me on x86_64.

I think you have implicitly fixed this issue by splitting functions that call recover() into two parts (i.e. main.foo and main.foo$recover).  So recursive calls originate from the ...$recover function and never match the return address check.  Well, maybe it was only a problem with tail recursion, i.e.

  func foo(n int) int {
    if (n == 0) { recover(); return 0; }
    return foo(0)
  }

Would be optimized to a loop, removing the function call, and then the return address check would trigger.  That's not possible with two function approach.  But if there's another criterion to allow recover that simply depends on the caller's name the problem might reappear.

>> * The code is even more expensive than the approach I had chosen because
>> now it needs to fetch a two level backtrace instead of just one level
>> (and probably each level is more expensive than the one 
>> _Unwind_FindEnclosingFunc()).
>
> Yes, but the expensive case only happens in the rare cases where
> either recover should not work or when the existing code has a
> false negative.

Hm, so the patch penalises platforms that cannot deal with the 16 byte window?

>>   func main() { defer foo(); panic("..."); }
>>   func foo() { defer bar(); }
>>   func bar() { recover(); }
>
> In this case, the call to recover in bar is supposed to return nil;
> it should not recover the panic.  If you read the paragraph before
> the one you quote, you will see that recover only returns non-nil
> if it was called by a function that was deferred before the call to
> panic.

I've read it but cannot see anything that would disallow recovery in this situation.  What exactly do you mean?

>> 4) __go_can_recover assumes that any call through libffi is allowed
>> to recover.
>
> Thanks for the example.  Does your patch fix this problem?

No, I just became aware of the problem when reviewing the code last week.

Comment 18 Ian Lance Taylor 2014-10-07 13:56:19 UTC

> Well, maybe it was only a problem with tail recursion, ....

Note that because Go programs expect predictable results from runtime.Callers and other stack backtracing functions, the Go frontend disables tail recursion (go_langhook_post_options in gcc/go/go-lang.c).


> Hm, so the patch penalises platforms that cannot deal with the 16 byte window?

Yes, but, recall that on your system almost all tests pass using the code that is in the tree today, before your patch.  The only tests that fail are the very challenging ones in recover.go, that stress test the panic/recover mechanism but are in no way representative of real code.  The normal tests all works fine.  So while there is a penalty, it is one that only occurs in rare cases.


>>>   func main() { defer foo(); panic("..."); }
>>>   func foo() { defer bar(); }
>>>   func bar() { recover(); }
>>
>> In this case, the call to recover in bar is supposed to return nil;
>> it should not recover the panic.  If you read the paragraph before
>> the one you quote, you will see that recover only returns non-nil
>> if it was called by a function that was deferred before the call to
>> panic.
>
>I've read it but cannot see anything that would disallow recovery in this
> situation.  What exactly do you mean?

The spec says "Suppose a function G defers a function D that calls recover and a panic occurs in a function on the same goroutine in which G is executing."  The order is 1) G defers D; 2) a panic occurs.  In your example above, this applies to main defers foo and then a panic occurs.  It does not apply to foo defers bar, because there is no panic after foo defers bar (the panic has already occurred--that is why we are executing foo).  Since there is no panic, the recover in bar returns nil.

The recover.go file tests this pattern in, e.g., test1WithClosures.

Comment 19 Ian Lance Taylor 2014-10-07 18:25:05 UTC

Created attachment 33661 [details]
possible patch 2

Here is a new possible patch.  This time I tested it on a PPC GNU/Linux machine.  The problem was a mishandling of the stack walk when using MakeFunc with FFI closures.  This passes on both x86_64 and PPC.  It also address Dominik's case in which there is a spurious recover when one MakeFunc function calls another.

Please review this patch and let me know if this fails on your system.  Thanks.

Comment 20 boger 2014-10-07 21:15:13 UTC

The latest patch worked on ppc64 for LE & BE.  That is, the testcase recover.go now works and no new regressions were introduced.

Comment 21 Uroš Bizjak 2014-10-08 05:48:58 UTC

(In reply to boger from comment #20)
> The latest patch worked on ppc64 for LE & BE.  That is, the testcase
> recover.go now works and no new regressions were introduced.

Also works on alpha [1] without new regressions.

[1] https://gcc.gnu.org/ml/gcc-testresults/2014-10/msg00840.html

Comment 22 Dominik Vogt 2014-10-08 08:12:21 UTC

>> Hm, so the patch penalises platforms that cannot deal with the
>> 16 byte window?

> Yes, but, recall that on your system almost all tests pass using the
> code that is in the tree today, before your patch.

But they really only succeed "by accident".  The call for any platform might introduce control flow into the thunk and trigger the problem in any of the test cases.

> The spec says "Suppose a function G defers a function D that calls
> recover and a panic occurs in a function on the same goroutine in
> which G is executing."  The order is 1) G defers D; 2) a panic occurs.
> ...

I'd still not understand it that way, but from other documentation the
intention is "clear".  Maybe another bullet could be added to the list
below:

  "The return value of recover is nil if ...
    
    ...
    * defer was called when the panic already existed"

Comment 23 Dominik Vogt 2014-10-08 09:34:27 UTC

Regarding the new patch:

1) We need to call __builtin_extract_return_address(retaddr) in __go_set_defer_retaddr() too.  (On s390 (31 bit) this is necessary to remove the most significant bit of the address which indicates the addressing mode.)

I.e.

--snip--
-    g->defer->__retaddr = retaddr;
+    g->defer->__retaddr = __builtin_extract_return_addr (retaddr);
--snip--


2) The new patch does not compile for me:

--
In file included from ../../../libgo/runtime/go-check-interface.c:8:0:
../../../libgo/runtime/go-panic.h:43:13: error: conflicting types for ‘__go_makefunc_can_recover’
 extern void __go_makefunc_can_recover (void *retaddr);
             ^
In file included from ../../../libgo/runtime/go-check-interface.c:7:0:
../../../libgo/runtime/runtime.h:845:9: note: previous declaration of ‘__go_makefunc_can_recover’ was here
 void    __go_makefunc_can_recover (const void *);
         ^
--

In runtime.h, __go_makefunc_can_recover still has a "const" argument:

--snip--
-void    __go_makefunc_can_recover (const void *);
+void    __go_makefunc_can_recover (void *);
--snip--


3) Couldn't this be written more efficiently by passing a flag?

+  /* If we are called from __go_makefunc_can_recover, then we need to
+     look one level higher.  */
+  if (locs[0].function.len > 0
+      && __builtin_strcmp ((const char *) locs[0].function.str,
+			   "__go_makefunc_can_recover") == 0)

E.g.

  _Bool __go_can_recover (void *retaddr, _Bool called_by_makefunc_can_recover)
  {
    ...
    if (called_by_makefunc_can_recover)
      { do something }
    else
      { do something else }
    ...
  }


4) With the suggested changes from 1 and 2 above:

s390x (64 bit):

* PASS: test/recover.go
* the test from #4 in my earlier posting works as expected
* my private defer/recover/panic testsuite works as expected

s390 (31 bit):

* PASS: test/recover.go
* the test from #4 in my earlier posting works as expected
* my private defer/recover/panic testsuite works as expected


5) I find the assumption in the loop at the end of __go_can_recover() that if a caller's name begins with "__go_" that means the panic can be recovered, a bit hairy.  Even if it is with the current libgo, an unrelated change elsewhere could break this logic.

Comment 24 Dominik Vogt 2014-10-08 10:12:59 UTC

> --snip--
> -    g->defer->__retaddr = retaddr;
> +    g->defer->__retaddr = __builtin_extract_return_addr (retaddr);
> --snip--

We need to double check whether this would break Sparc.

Comment 25 Ian Lance Taylor 2014-10-08 13:28:35 UTC

(In reply to Dominik Vogt from comment #22)
> >> Hm, so the patch penalises platforms that cannot deal with the
> >> 16 byte window?
> 
> > Yes, but, recall that on your system almost all tests pass using the
> > code that is in the tree today, before your patch.
> 
> But they really only succeed "by accident".  The call for any platform might
> introduce control flow into the thunk and trigger the problem in any of the
> test cases.

Yes, I understand that.

You suggested that my patch penalized all cases where we have a control flow problem, because it will do a more costly unwind step.  That is true.  However, in practice, the case arises very rarely.  We know that the code in the thunk is always very simple:

    if (__go_set_defer_retaddr (&&L))
      goto L;
    real_function(args);
  L:

The compiler is not going to start randomly throwing additional control flow statements into that function.  The cases where it does do that are the cases where args is very large, so that it uses a block copy to copy them onto the stack.  But that essentially never happens.  The args here are the args in "defer real_function(args)".  90% of the time there are no arguments at all.  9.9% of the time it's just a couple of pointer/scalar arguments.  In real code nobody ever calls defer with large arguments, because it's just a strange way to write; people write a function closure instead.  The only code in the standard library that calls defer with a large argument is the recover.go test, to make sure that it works.

It's necessary to handle all cases correctly.  But there is nothing wrong with using an efficient mechanism when it works, as long as we can correctly fall back to a more expensive mechanism when it fails.  I believe that my patch does that.

If you like, we can incorporate your patch too, as long as it is only an addition to the existing code.  Before calling runtime_callers, we can call _Unwind_FindEnclosingFunction.  Then we need only go on to the runtime_callers code if _Unwind_FindEnclosingFunction returns NULL, or in the cases where d->__makefunc_can_recover is true.

Comment 26 Ian Lance Taylor 2014-10-08 13:44:06 UTC

(In reply to Dominik Vogt from comment #23)
> 
> 1) We need to call __builtin_extract_return_address(retaddr) in
> __go_set_defer_retaddr() too.  (On s390 (31 bit) this is necessary to remove
> the most significant bit of the address which indicates the addressing mode.)
> 
> I.e.
> 
> --snip--
> -    g->defer->__retaddr = retaddr;
> +    g->defer->__retaddr = __builtin_extract_return_addr (retaddr);
> --snip--

Will do.


> 2) The new patch does not compile for me:
> 
> --
> In file included from ../../../libgo/runtime/go-check-interface.c:8:0:
> ../../../libgo/runtime/go-panic.h:43:13: error: conflicting types for
> ‘__go_makefunc_can_recover’
>  extern void __go_makefunc_can_recover (void *retaddr);
>              ^
> In file included from ../../../libgo/runtime/go-check-interface.c:7:0:
> ../../../libgo/runtime/runtime.h:845:9: note: previous declaration of
> ‘__go_makefunc_can_recover’ was here
>  void    __go_makefunc_can_recover (const void *);
>          ^
> --
> 
> In runtime.h, __go_makefunc_can_recover still has a "const" argument:
> 
> --snip--
> -void    __go_makefunc_can_recover (const void *);
> +void    __go_makefunc_can_recover (void *);
> --snip--

I think that must be a local change in your tree.  In my tree runtime.h does not declare __go_makefunc_can_recover.


> 3) Couldn't this be written more efficiently by passing a flag?
> 
> +  /* If we are called from __go_makefunc_can_recover, then we need to
> +     look one level higher.  */
> +  if (locs[0].function.len > 0
> +      && __builtin_strcmp ((const char *) locs[0].function.str,
> +			   "__go_makefunc_can_recover") == 0)
> 
> E.g.
> 
>   _Bool __go_can_recover (void *retaddr, _Bool
> called_by_makefunc_can_recover)
>   {
>     ...
>     if (called_by_makefunc_can_recover)
>       { do something }
>     else
>       { do something else }
>     ...
>   }

Yes, but I would rather do that later if it seems useful.  It seems silly to change the compiler to always pass an extra argument that will always be false.  Introducing a new function called by both __go_can_recover and __go_makefunc_can_recover changes the stack layout depending on the compiler's inlining choices, so must be treated with care.  And it's worth remembering that this case never ever happens outside of tests.  It only happens when somebody constructs a function using reflect.MakeFunc and then defers a call to that function.  That is a bizarre way to code and I am confident that absolutely no real Go code does it.  Making that case slightly less efficient is not important.


> 4) With the suggested changes from 1 and 2 above:
> 
> s390x (64 bit):
> 
> * PASS: test/recover.go
> * the test from #4 in my earlier posting works as expected
> * my private defer/recover/panic testsuite works as expected
> 
> s390 (31 bit):
> 
> * PASS: test/recover.go
> * the test from #4 in my earlier posting works as expected
> * my private defer/recover/panic testsuite works as expected

Thanks for testing.


> 5) I find the assumption in the loop at the end of __go_can_recover() that
> if a caller's name begins with "__go_" that means the panic can be
> recovered, a bit hairy.  Even if it is with the current libgo, an unrelated
> change elsewhere could break this logic.

I agree that it's imperfect.  However, it's fairly difficult to construct a case that causes the wrong thing to happen.  No Go function can ever start with __go.  Very little code in libgo calls a function written in Go; it's easy to find such code because it must call __go_set_closure (the defer thunks are a special case that are known to have no closure).  So it can only happen if somebody writes a Go function that calls recover, and then passes it to C code, and that C code names a function starting with "__go_" and then calls the Go function.  And this has to happen from a deferred function called while there is a panic in progress.  The result will be that the function called by the C code will incorrectly recover the panic.

Comment 27 ian@gcc.gnu.org 2014-10-08 14:03:44 UTC

Author: ian
Date: Wed Oct  8 14:03:13 2014
New Revision: 216003

URL: https://gcc.gnu.org/viewcvs?rev=216003&root=gcc&view=rev
Log:
	PR go/60406
runtime: Check callers in can_recover if return address	doesn't match.

Also use __builtin_extract_return_address and tighten up the
checks in FFI code.

Fixes PR 60406.

Modified:
    trunk/libgo/go/reflect/makefunc_ffi_c.c
    trunk/libgo/runtime/go-defer.c
    trunk/libgo/runtime/go-panic.h
    trunk/libgo/runtime/go-recover.c
    trunk/libgo/runtime/panic.c

Comment 28 Ian Lance Taylor 2014-10-08 14:15:15 UTC

Fixed on trunk.

Comment 29 Dominik Vogt 2014-10-28 08:50:32 UTC

Thanks!

> It's necessary to handle all cases correctly.  But there is nothing wrong
> with using an efficient mechanism when it works, as long as we can correctly
> fall back to a more expensive mechanism when it fails.  I believe that my
> patch does that.

I'm fine with that.  But couldn't this also happen when copying multiple arguments to the stack?  E.g. if you have an array A and call the function with something like

  defer(foo(A[0], A[1], ... A[N])

gcc could turn that into a loop for a sufficiently large N?  This may be also an unlikely case, but a test for this wouldn't hurt either.

> If you like, we can incorporate your patch too, as long as it is only an
> addition to the existing code.  Before calling runtime_callers, we can call
> _Unwind_FindEnclosingFunction.  Then we need only go on to the
> runtime_callers code if _Unwind_FindEnclosingFunction returns NULL, or in the
> cases where d->__makefunc_can_recover is true.

Given that my patch fails to work with indirect function pointers (Power) that does not seem to be very helpful.