Although the following code uses symmetric transfer, it crashes due to a stack-overflow. The crash is also reproducible when using the task<> type of the cppcoro library. The crash does not occur when using clang. ``` // main.cc #include <coroutine> #include <exception> class Task { public: struct promise_type { Task get_return_object() { return Handle::from_promise(*this); } struct FinalAwaitable { bool await_ready() const noexcept { return false; } // Use symmetric transfer. Resuming coro.promise().m_continuation should // not require extra stack space std::coroutine_handle<> await_suspend( std::coroutine_handle<promise_type> coro) noexcept { if (coro.promise().m_continuation) { return coro.promise().m_continuation; } else { // The top-level task started from within main() does not have a // continuation. This will give control back to the main function. return std::noop_coroutine(); } } void await_resume() noexcept {} }; std::suspend_always initial_suspend() noexcept { return {}; } FinalAwaitable final_suspend() noexcept { return {}; } void unhandled_exception() noexcept { std::terminate(); } void set_continuation(std::coroutine_handle<> continuation) noexcept { m_continuation = continuation; } void return_void() noexcept {} private: std::coroutine_handle<> m_continuation; }; using Handle = std::coroutine_handle<promise_type>; Task(Handle coroutine) : m_coroutine(coroutine) {} ~Task() { if (m_coroutine) { m_coroutine.destroy(); } } void start() noexcept { m_coroutine.resume(); } auto operator co_await() const noexcept { return Awaitable{m_coroutine}; } private: struct Awaitable { Handle m_coroutine; Awaitable(Handle coroutine) noexcept : m_coroutine(coroutine) {} bool await_ready() const noexcept { return false; } // Use symmetric transfer. Resuming m_coroutine should not require extra // stack space std::coroutine_handle<> await_suspend( std::coroutine_handle<> awaitingCoroutine) noexcept { m_coroutine.promise().set_continuation(awaitingCoroutine); return m_coroutine; } void await_resume() {} }; Handle m_coroutine; }; Task inner() { co_return; } Task outer() { // Use large number of iterations to trigger stack-overflow for (int i = 0; i != 50000000; ++i) { co_await inner(); } } int main() { auto task = outer(); task.start(); } ``` I compile the code with `g++-11 main.cc -std=c++20 -O3 -fsanitize=address`. Here is the output: ``` $ ./a.out AddressSanitizer:DEADLYSIGNAL ================================================================= ==21002==ERROR: AddressSanitizer: stack-overflow on address 0x7fffc666dff8 (pc 0x7f6ec2dfa16d bp 0x7fffc666e870 sp 0x7fffc666e000 T0) #0 0x7f6ec2dfa16d in __sanitizer::BufferedStackTrace::UnwindImpl(unsigned long, unsigned long, void*, bool, unsigned int) ../../../../src/libsanitizer/asan/asan_stack.cpp:57 #1 0x7f6ec2df00eb in __sanitizer::BufferedStackTrace::Unwind(unsigned long, unsigned long, void*, bool, unsigned int) ../../../../src/libsanitizer/sanitizer_common/sanitizer_stacktrace.h:122 #2 0x7f6ec2df00eb in operator delete(void*) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:160 #3 0x560193552e57 in _Z5innerv.destroy(inner()::_Z5innerv.frame*) (/home/leonard/Desktop/hiwi/async_io_uring/stack-overflow/a.out+0x1e57) #4 0x560193553b30 in _Z5outerv.actor(outer()::_Z5outerv.frame*) (/home/leonard/Desktop/hiwi/async_io_uring/stack-overflow/a.out+0x2b30) #5 0x560193552bbb in _Z5innerv.actor(inner()::_Z5innerv.frame*) (/home/leonard/Desktop/hiwi/async_io_uring/stack-overflow/a.out+0x1bbb) ... ```
for symmetric transfer to work without stack overflow, it relies on an indirect tailcall. For some GCC targets indirect tail-calls are not available without some additional support (see PR94794). I tried to reproduce this (with a test case I use regularly for this) on a target that normally completes symmetric transfers successfully when the optimisation level is > 1. (x86_64, darwin). The fail also occurs with my regular test case with -fsanitize=address - so, it seems that the inclusion of the address sanitiser is preventing or interfering with the tailcall. Note that there are also other known issues with coroutines and the sanitizers (PR95137).
(In reply to Iain Sandoe from comment #1) > for symmetric transfer to work without stack overflow, it relies on an > indirect tailcall. > > For some GCC targets indirect tail-calls are not available without some > additional support (see PR94794). > > I tried to reproduce this (with a test case I use regularly for this) on a > target that normally completes symmetric transfers successfully when the > optimisation level is > 1. (x86_64, darwin). > > The fail also occurs with my regular test case with -fsanitize=address - so, > it seems that the inclusion of the address sanitiser is preventing or > interfering with the tailcall. Note that there are also other known issues > with coroutines and the sanitizers (PR95137). Thank you for your comment. I tried it out and can confirm that I don't get a stack-overflow anymore if I omit -fsanitize=address and use an optimization level > 1. If the issues with coroutines and sanitizers are already known, then this bug report can be marked as resolved. Of course, it would be nice if the stack-overflow would not occur even when using an optimization level <= 1, but this probably does not qualify as a bug.
(In reply to Leonard von Merzljak from comment #2) > (In reply to Iain Sandoe from comment #1) > Thank you for your comment. I tried it out and can confirm that I don't get > a stack-overflow anymore if I omit -fsanitize=address and use an > optimization level > 1. So that's a workaround (on platforms that support indirect tail calls at all). > If the issues with coroutines and sanitizers are > already known, then this bug report can be marked as resolved. For the present, I will leave this open - until (at least) there's a chance to confirm the hypothesis and determine if the problems are the same ones as mentioned in other PRs. > Of course, it would be nice if the stack-overflow would not occur even when > using an optimization level <= 1, but this probably does not qualify as a > bug. Note that the inability to support indirect tail calls is not usually a failing in GCC - but that some platform ABIs cannot support it (e.g. because they require initialisation of some per DSO data). For platforms that support indirect tail calls, it is actually feasible to support the symmetric transfer at O0 (at least as per my local testing) - the front end can demand a tailcall "for correctness". The issue is that coroutines are not a target-specific implementation, and therefore demanding the tailcall will cause compile fails on targets that cannot support it. Of course, one can argue that the code will *probably* fail on those targets if there is arbitrary recursion needed - but it was decided to not to make this demand until a solution is found to supporting continuations on all target. JFTR, my outline sketch for this would be to allocate some area in the coroutine frame that is reserved for target-specific continuation support, and then to use a builtin to implement the continuation rather than relying on the indirect tailcall mechanism.
Just wanted to add, if any space is used in the coroutine frame to fix this, make sure it's in the target coroutine frame, not the coroutine frame of the coroutine that just suspended. Otherwise you'll end up with the same broken behavior MSVC suffered from, where any attempt to destroy the just-suspended coroutine (either in the same thread or a different thread) before returning the continuation from await_suspend resulted in use-after-free. They only recently fixed it, and I'm not sure what their solution was. I am not convinced tail calls are truly necessary to implement symmetric transfer though, another option is having two versions of .resume(): the public-facing version that works as expected, and a private version that does the actual resuming and just returns the next coroutine handle. The public-facing version can just call that private one in a loop, overwriting the coroutine handle on the stack, and breaking at a noop-coroutine. However, this would be an ABI break if it's not already implemented this way. It's also less efficient than the tail call version due to having to branch to know when to exit the loop. Maybe that can be avoided with some magic in the noop coroutine resume implementation though.