40124 – Inline asm should support limited control flow

Bug 40124 - Inline asm should support limited control flow

Summary: Inline asm should support limited control flow

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	inline-asm (show other bugs)
Version:	unknown

Importance:	P3 enhancement
Target Milestone:	4.5.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-05-12 16:00 UTC by Ryan Johnson
Modified:	2009-11-21 09:13 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Ryan Johnson 2009-05-12 16:00:58 UTC

Right now gcc does not officially support local control flow of any kind. The user is free to jump around within an asm block using local labels but this is cumbersome and bug-prone. Jumping out of an asm block in any way is (un?)officially verboten. 

This RFE is for support of jumping from an asm block to a label defined somewhere in the enclosing function. It does not deal with jumps between asm blocks or non-local control flow like function calls (both of which are understandably nasty).

Rationale:

1. There is demand for this feature. MSVC supports it (see http://msdn.microsoft.com/en-us/library/aa279405(VS.60).aspx), and people currently kludge it in gcc by defining assembler labels in other asm blocks and jumping to them, even though the gcc docs explicitly forbid it (e.g. http://stackoverflow.com/questions/744055/gcc-inline-assembly-jump-to-label-outside-block). Jumping to a C label is less error-prone, more general, and significantly easier to implement than asm-asm jumps (see #4).

2. It's already implemented. The assembler label corresponding to a C label is accessible like this -- asm("jmp %l0" ::"i"(&&label)) -- where "%l" emits the label "with the syntax expected of a jump target" (Brennan's inline asm guide). All that's left is for the compiler to mark such labels as used-by-computed-goto so the optimizer plays nice; even that can be faked with judicious use of computed gotos (see example code, below).

3. Its effect is basically the same as for a computed goto, which is already supported. I suspect that every argument that jumping from asm to C is inherently unsafe or difficult to implement applies equally well to computed gotos (e.g. Bug #37722).

4. It would encourage smaller asm blocks because internal control flow would seldom be necessary any more. The resulting code would be easier to write and debug, more portable, and less fragile. Further, the example below suggests that the compiler may generate better code as well. 

Example: Efficient use of condition codes

Currently the only way to use condition codes generated within an asm block is to either convert them to an output value (which requires extra instructions to create and test) or to embed the if/else inside the asm block (forcing the user to write extra assembly code even if it was otherwise portable). Support for jumping from asm to local labels would allow a better idiom:

void handle_overflow(int a, int b, int* dest);

int ideal_foo(int a, int b) {
    int rval = a;
    asm("adc    %[src], %[dest]\n\t"
        "jno    %l[no_overflow]"
        : [dest] "+r"(rval)
        : [src] "r"(b), [no_overflow] "i"(&&done));
        handle_overflow(a, b, &rval);
 done:
    return rval;
}

int current_foo(int a, int b) {
    int overflow = 0;
    int rval = a;
    asm("adc    %[src], %[dest]\n\t"
        "cmovo  %[true], %[overflow]"
        : [dest] "+r"(rval), [overflow] "+r"(overflow)
        : [src] "r"(b), [true] "r"(1));
    if(overflow)
        handle_overflow(a, b, &rval);
    return rval;
}

int hacked_foo(int a, int b) {
    static void* volatile labels[] = {&&overflow, &&no_overflow};
    int rval=a;
    asm("adc    %[src], %[dest]\n\t"
        "jno    %l[no_overflow]\n\t"
        "jmp    %l[overflow]"
        : [dest] "=r"(rval)
        : [src] "r"(b), "[dest]"(rval),
        [no_overflow] "i"(&&no_overflow), [overflow] "i"( &&overflow));
    goto *labels[0];
 overflow:
        handle_overflow(a, b, &rval);
 no_overflow:
    return rval;
}


Extracts of x86_64 assembler output for "x86_64-unknown-linux-gnu-gcc-4.2.2 -S -O3" follows. Notice that ideal_foo is clearly the best, *except* that the optimizer broke it by moving .L9 to the top of the loop (it should be just above the addq). If it were optimized better it would take only four instructions on the short path. hacked_foo would also beat current_foo by quite a bit, except the compiler decided to set up a stack frame for it:


ideal_foo:
.L9:
        subq    $16, %rsp
        movl    %edi, %eax
        leaq    12(%rsp), %rdx
        movl    %edi, 12(%rsp)
#APP
        adc     %esi, %eax
        jno     .L9
#NO_APP
        movl    %eax, 12(%rsp)
        call    handle_overflow
        movl    12(%rsp), %eax
        addq    $16, %rsp
        ret


current_foo:
        pushq   %rbx
        xorl    %eax, %eax
        movl    $1, %edx
        movl    %eax, %ebx
        movl    %edi, %ecx
        subq    $16, %rsp
#APP
        adc     %esi, %ecx
        cmovo   %edx, %ebx
#NO_APP
        testl   %ebx, %ebx
        movl    %edi, 12(%rsp)
        movl    %ecx, %eax
        movl    %ecx, 12(%rsp)
        je      .L12
        leaq    12(%rsp), %rdx
        call    handle_overflow
        movl    12(%rsp), %eax
.L12:
        addq    $16, %rsp
        popq    %rbx
        ret


hacked_foo:
        movq    %rbx, -16(%rsp)
        movq    %rbp, -8(%rsp)
        subq    $32, %rsp
        movl    %edi, 12(%rsp)
        movl    %edi, %eax
#APP
        adc     %esi, %eax
        jno     .L4
        jmp     .L5
#NO_APP
        movl    %eax, 12(%rsp)
        movq    labels.1894(%rip), %rax
        jmp     *%rax
        .p2align 4,,7
.L5:
        leaq    12(%rsp), %rdx
        call    handle_overflow
.L4:
        movl    12(%rsp), %eax
        movq    16(%rsp), %rbx
        movq    24(%rsp), %rbp
        addq    $32, %rsp
        ret

Comment 1 Andrew Pinski 2009-05-12 16:05:32 UTC

GCC can already produce good code for overflow catching on x86.  I think it is better if you don't use inline-asm for this case at all and just write the overflow detection in C90/C99 (which means don't cause an overflow).

Comment 2 Ryan Johnson 2009-05-12 16:13:20 UTC

Overflow and adc were only examples. Other instructions that set cc, or other conditions (e.g. parity) would not have that optimization.

Another use is the ability to jump out of an inline asm to handle an uncommon case (if writing hand-tuned asm for speed, for example).

Comment 3 H.J. Lu 2009-05-12 16:23:19 UTC

We are investigating if we need to add non-SSE intrinsics into gcc.
If you can show me an example where gcc can't generate similar
code automatically vs inline asm statement, I will look into if
we should add some appropriate intrinsics.

Comment 4 Ryan Johnson 2009-05-12 16:36:01 UTC

I'm actually running sparcv9-sun-solaris2.10 (the examples used x86 because more people know it and its asm is easier to read).

My use case is the following: I'm implementing high-performance synchronization primitives and the compiler isn't generating good enough code -- partly because it doesn't pipeline spinloops, and partly because it has no way to know what stuff is truly critical path and what just needs to happen "eventually."

Here's a basic idea of what I've been looking at:

long mcs_lock_acquire(mcs_lock* lock, mcs_qnode* me) {
 again:
    /* initialize qnode, etc */
    membar_producer();
    mcs_qnode* pred = atomic_swap(&lock->tail, me);
    if(pred) {
        pred->next = me;
        while(int flags=me->wait_flags) {
            if(flags & ERROR) {
                /* recovery code */
                goto again;
            }
        }
    }
    membar_enter();
    return (long) pred;
}

This code is absolutely performance-critical because every instruction on the critical path delays O(N) other threads -- even a single extra load or store causes noticeable delays. I was trying to rewrite just the while loop above in asm to be more efficient, but it is hard because of that goto inside. Basically, the error isn't going anywhere once it shows up, so we don't have to check it nearly as often as the flags==0 case, and it can be interleaved across as many loop iterations as needed to make its overhead disappear. Manually unrolling and pipelining the loop helped a bit, but the compiler still tended to cluster things together more than was strictly necessary (leading to bursts of saturated pipeline alternating with slack).

For CC stuff, especially x86-related, I bet places like fftw and gmp are good sources of frustration to mine.

Comment 5 H.J. Lu 2009-05-12 16:42:22 UTC

(In reply to comment #4)
> I'm actually running sparcv9-sun-solaris2.10 (the examples used x86 because
> more people know it and its asm is easier to read).
> 
> My use case is the following: I'm implementing high-performance synchronization
> primitives and the compiler isn't generating good enough code -- partly because
> it doesn't pipeline spinloops, and partly because it has no way to know what
> stuff is truly critical path and what just needs to happen "eventually."
> 

I think intrinsic is a better solution than inline asm statement for
this. If you find any x86 intrinsics are missing, please let me know.

Comment 6 Andrew Pinski 2009-05-12 16:48:27 UTC

Sounds like you want to use __builtin_expect.

Also GCC uses heuristics to figure out:
    if(pred)
is taken most of the time because it is comparing a pointer to NULL.

Comment 7 Ryan Johnson 2009-05-12 17:01:19 UTC

Isn't __builtin_expect a way to send branch prediction hints? I'm not having trouble with that AFAIK.

Comment 8 Andrew Pinski 2009-05-12 17:08:37 UTC

(In reply to comment #7)
> Isn't __builtin_expect a way to send branch prediction hints?

Not just prediction hints, it helps other optimizations of not moving code around inter block which seems like what you need.

Maybe it is better to file a bug about the code you want optimized and the way you want it optimized instead of asking for another feature.

Comment 9 Ryan Johnson 2009-05-13 07:55:53 UTC

RE: __builtin_expect -- Thanks! It did help quite a bit, even though the compiler was already emitting not-taken branch hints on its own.

RE: Filing bugs -- I have. This RFE arose out of Bug #40078, which was triggered by attempts to work around Bug #40067. I still have some issues with overconservative use of branch delay slots and possibly loop pipelining, but haven't gotten to filing them yet. I've also filed other bugs in the past where it would have been nice to work around using inline asm but control flow was a pain.

In the end, is there any particular reason *not* to make inline asm easier to use and more transparent to the compiler, given points #1 and #2? Invoking point #3, what significant uses of computed gotos exist, other than to work around switch statements that compile suboptimally? The docs don't mention any, and yet we have them instead of (or in addition to) bug reports. 

I'd take a stab at implementing this myself -- it's probably a one-liner -- but I've never hacked gcc before and have no clue where that one line might lurk. 

BTW, how does one exploit the compiler's overflow catching? I tried testing a+b < a and a+b < b (for unsigned ints) with no luck, and there's no __builtin test for overflow or carry.

Comment 10 Jakub Jelinek 2009-05-13 08:19:31 UTC

Of course there is a very important reason.  If you allow inline asms to change control flow, even just to labels whose address has been taken through &&label,
you penalize a lot of code which doesn't change the control flow, as the compiler will have to assume each inline asm which could possibly get at an label address (not just directly, but through global variables, pointers etc.) can jump to it.  That's an awful lot of extra flow graph edges and optimization limitations.  Unless you'd invent some new syntax to say if an inline asm can change control flow and by default assume it can't.

Comment 11 Ryan Johnson 2009-05-13 09:51:22 UTC

> If you allow inline asms to change control flow, even just 
> to labels whose address has been taken through &&label, you 
> penalize a lot of code which doesn't change the control 
> flow, as the compiler will have to assume each inline asm 
> which could possibly get at an label address (not just 
> directly, but through global variables, pointers etc.) can
> jump to it.

I'm going to invoke #3 again to respond to these concerns:

a. This RFE is specifically limited to local control flow only, so the compiler can safely ignore any label not in the asm's enclosing function, as well as labels whose addresses are never taken (or provably never used). Computed gotos appear to make the same assumptions, based on the docs' strong warning not allow labels to leak out of their enclosing function in any way. 

b. While it's always possible that an asm could jump to a value loaded from an arbitrary, dynamically-generated address, the same is true for computed gotos. Either way, compiler analysis or not, doing so would almost certainly send you to la-la land because label values aren't known until assembler time or later and have no guaranteed relationship with each other. The only way to get a valid label address is using one directly, or computing it with some sort of base+(label-base). Either way requires taking the address of the desired label at some point and tipping off the compiler.

c. It's pretty easy to write functions whose computed gotos defy static analysis, but most of the time the compiler does pretty well. Well-written asm blocks should access memory via "m" constraints -- which the compiler can analyze -- rather than manually dereferencing a pointer passed in with an "r" constraint. This is especially true for asm blocks with no internal control flow, which this RFE encourages. 

d. If a code path is short/simple enough that incoming jumps penalize it heavily (whether from computed gotos or jumps from asm), it's probably also small enough that the compiler (or programmer, if need be) can duplicate it for the short path. A big, ugly code path probably wouldn't even notice an extra control flow arc or two.

In the end, a big goal of this RFE is to allow programmers to make the compiler aware of control flow arcs they're already adding (or tempted to add) behind its back. It therefore wouldn't strike me as much of a limitation if "jumps to labels not explicitly passed to the asm are unsupported and may lead to undefined behavior."

Comment 12 Richard Biener 2009-05-13 09:58:15 UTC

I strongly oppose to making asm handling even more complex.

Comment 13 Andrew Pinski 2009-11-21 09:12:02 UTC

Reopening to ...

Comment 14 Andrew Pinski 2009-11-21 09:13:18 UTC

Mark this as fixed for 4.5:
http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Extended%20asm%20with%20gotoAs of GCC version 4.5, asm goto may be used to have the assembly jump to one or more C labels. In this form, a fifth section after the clobber list contains a list of all C labels to which the assembly may jump. Each label operand is implicitly self-named. The asm is also assumed to fall through to the next statement.

This form of asm is restricted to not have outputs. This is due to a internal restriction in the compiler that control transfer instructions cannot have outputs. This restriction on asm goto may be lifted in some future version of the compiler. In the mean time, asm goto may include a memory clobber, and so leave outputs in memory.

     int frob(int x)
     {
       int y;
       asm goto ("frob %%r5, %1; jc %l[error]; mov (%2), %%r5"
                 : : "r"(x), "r"(&y) : "r5", "memory" : error);
       return y;
      error:
       return -1;
     }
In this (inefficient) example, the frob instruction sets the carry bit to indicate an error. The jc instruction detects this and branches to the error label. Finally, the output of the frob instruction (%r5) is stored into the memory for variable y, which is later read by the return statement.

     void doit(void)
     {
       int i = 0;
       asm goto ("mfsr %%r1, 123; jmp %%r1;"
                 ".pushsection doit_table;"
     	    ".long %l0, %l1, %l2, %l3;"
     	    ".popsection"
     	    : : : "r1" : label1, label2, label3, label4);
       __builtin_unreachable ();
     
      label1:
       f1();
       return;
      label2:
       f2();
       return;
      label3:
       i = 1;
      label4:
       f3(i);
     }
In this (also inefficient) example, the mfsr instruction reads an address from some out-of-band machine register, and the following jmp instruction branches to that address. The address read by the mfsr instruction is assumed to have been previously set via some application-specific mechanism to be one of the four values stored in the doit_table section. Finally, the asm is followed by a call to __builtin_unreachable to indicate that the asm does not in fact fall through.

     #define TRACE1(NUM)                         \
       do {                                      \
         asm goto ("0: nop;"                     \
                   ".pushsection trace_table;"   \
                   ".long 0b, %l0;"              \
                   ".popsection"                 \
                   : : : : trace#NUM);           \
         if (0) { trace#NUM: trace(); }          \
       } while (0)
     #define TRACE  TRACE1(__COUNTER__)
In this example (which in fact inspired the asm goto feature) we want on rare occasions to call the trace function; on other occasions we'd like to keep the overhead to the absolute minimum. The normal code path consists of a single nop instruction. However, we record the address of this nop together with the address of a label that calls the trace function. This allows the nop instruction to be patched at runtime to be an unconditional branch to the stored label. It is assumed that an optimizing compiler will move the labeled block out of line, to optimize the fall through path from the asm.