.. Copyright (C) 2014-2024 Free Software Foundation, Inc. Originally contributed by David Malcolm This is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . Tutorial part 4: Adding JIT-compilation to a toy interpreter ------------------------------------------------------------ In this example we construct a "toy" interpreter, and add JIT-compilation to it. Our toy interpreter ******************* It's a stack-based interpreter, and is intended as a (very simple) example of the kind of bytecode interpreter seen in dynamic languages such as Python, Ruby etc. For the sake of simplicity, our toy virtual machine is very limited: * The only data type is `int` * It can only work on one function at a time (so that the only function call that can be made is to recurse). * Functions can only take one parameter. * Functions have a stack of `int` values. * We'll implement function call within the interpreter by calling a function in our implementation, rather than implementing our own frame stack. * The parser is only good enough to get the examples to work. Naturally, a real interpreter would be much more complicated that this. The following operations are supported: .. list-table:: :header-rows: 1 :widths: 35 50 25 25 * - Operation - Meaning - Old Stack - New Stack * - DUP - Duplicate top of stack. - ``[..., x]`` - ``[..., x, x]`` * - ROT - Swap top two elements of stack. - ``[..., x, y]`` - ``[..., y, x]`` * - BINARY_ADD - Add the top two elements on the stack. - ``[..., x, y]`` - ``[..., (x+y)]`` * - BINARY_SUBTRACT - Likewise, but subtract. - ``[..., x, y]`` - ``[..., (x-y)]`` * - BINARY_MULT - Likewise, but multiply. - ``[..., x, y]`` - ``[..., (x*y)]`` * - BINARY_COMPARE_LT - Compare the top two elements on the stack and push a nonzero/zero if (x D.80; signed int D.81; signed int D.82; signed int D.83; signed int D.84; signed int D.85; signed int y; signed int x; signed int stack_depth; signed int stack[8]; try { initial: stack_depth = 0; stack[stack_depth] = arg; stack_depth = stack_depth + 1; goto instr0; instr0: /* DUP */: stack_depth = stack_depth + -1; x = stack[stack_depth]; stack[stack_depth] = x; stack_depth = stack_depth + 1; stack[stack_depth] = x; stack_depth = stack_depth + 1; goto instr1; instr1: /* PUSH_CONST */: stack[stack_depth] = 2; stack_depth = stack_depth + 1; goto instr2; /* etc */ You can see the generated machine code in assembly form via: .. code-block:: c gcc_jit_context_set_bool_option ( ctxt, GCC_JIT_BOOL_OPTION_DUMP_GENERATED_CODE, 1); result = gcc_jit_context_compile (ctxt); which shows that (on this x86_64 box) the compiler has unrolled the loop and is using MMX instructions to perform several multiplications simultaneously: .. code-block:: gas .file "fake.c" .text .Ltext0: .p2align 4,,15 .globl factorial .type factorial, @function factorial: .LFB0: .file 1 "factorial.toy" .loc 1 14 0 .cfi_startproc .LVL0: .L2: .loc 1 26 0 cmpl $1, %edi jle .L13 leal -1(%rdi), %edx movl %edx, %ecx shrl $2, %ecx leal 0(,%rcx,4), %esi testl %esi, %esi je .L14 cmpl $9, %edx jbe .L14 leal -2(%rdi), %eax movl %eax, -16(%rsp) leal -3(%rdi), %eax movd -16(%rsp), %xmm0 movl %edi, -16(%rsp) movl %eax, -12(%rsp) movd -16(%rsp), %xmm1 xorl %eax, %eax movl %edx, -16(%rsp) movd -12(%rsp), %xmm4 movd -16(%rsp), %xmm6 punpckldq %xmm4, %xmm0 movdqa .LC1(%rip), %xmm4 punpckldq %xmm6, %xmm1 punpcklqdq %xmm0, %xmm1 movdqa .LC0(%rip), %xmm0 jmp .L5 # etc - edited for brevity This is clearly overkill for a function that will likely overflow the ``int`` type before the vectorization is worthwhile - but then again, this is a toy example. Turning down the optimization level to 2: .. code-block:: c gcc_jit_context_set_int_option ( ctxt, GCC_JIT_INT_OPTION_OPTIMIZATION_LEVEL, 3); yields this code, which is simple enough to quote in its entirety: .. code-block:: gas .file "fake.c" .text .p2align 4,,15 .globl factorial .type factorial, @function factorial: .LFB0: .cfi_startproc .L2: cmpl $1, %edi jle .L8 movl $1, %edx jmp .L4 .p2align 4,,10 .p2align 3 .L6: movl %eax, %edi .L4: .L5: leal -1(%rdi), %eax imull %edi, %edx cmpl $1, %eax jne .L6 .L3: .L7: imull %edx, %eax ret .L8: movl %edi, %eax movl $1, %edx jmp .L7 .cfi_endproc .LFE0: .size factorial, .-factorial .ident "GCC: (GNU) 4.9.0 20131023 (Red Hat 0.2)" .section .note.GNU-stack,"",@progbits Note that the stack pushing and popping have been eliminated, as has the recursive call (in favor of an iteration). Putting it all together *********************** The complete example can be seen in the source tree at ``gcc/jit/docs/examples/tut04-toyvm/toyvm.c`` along with a Makefile and a couple of sample .toy scripts: .. code-block:: console $ ls -al drwxrwxr-x. 2 david david 4096 Sep 19 17:46 . drwxrwxr-x. 3 david david 4096 Sep 19 15:26 .. -rw-rw-r--. 1 david david 615 Sep 19 12:43 factorial.toy -rw-rw-r--. 1 david david 834 Sep 19 13:08 fibonacci.toy -rw-rw-r--. 1 david david 238 Sep 19 14:22 Makefile -rw-rw-r--. 1 david david 16457 Sep 19 17:07 toyvm.c $ make toyvm g++ -Wall -g -o toyvm toyvm.c -lgccjit $ ./toyvm factorial.toy 10 interpreter result: 3628800 compiler result: 3628800 $ ./toyvm fibonacci.toy 10 interpreter result: 55 compiler result: 55 Behind the curtain: How does our code get optimized? **************************************************** Our example is done, but you may be wondering about exactly how the compiler turned what we gave it into the machine code seen above. We can examine what the compiler is doing in detail by setting: .. code-block:: c gcc_jit_context_set_bool_option (state.ctxt, GCC_JIT_BOOL_OPTION_DUMP_EVERYTHING, 1); gcc_jit_context_set_bool_option (state.ctxt, GCC_JIT_BOOL_OPTION_KEEP_INTERMEDIATES, 1); This will dump detailed information about the compiler's state to a directory under ``/tmp``, and keep it from being cleaned up. The precise names and their formats of these files is subject to change. Higher optimization levels lead to more files. Here's what I saw (edited for brevity; there were almost 200 files): .. code-block:: console intermediate files written to /tmp/libgccjit-KPQbGw $ ls /tmp/libgccjit-KPQbGw/ fake.c.000i.cgraph fake.c.000i.type-inheritance fake.c.004t.gimple fake.c.007t.omplower fake.c.008t.lower fake.c.011t.eh fake.c.012t.cfg fake.c.014i.visibility fake.c.015i.early_local_cleanups fake.c.016t.ssa # etc The gimple code is converted into Static Single Assignment form, with annotations for use when generating the debuginfo: .. code-block:: console $ less /tmp/libgccjit-KPQbGw/fake.c.016t.ssa .. code-block:: c ;; Function factorial (factorial, funcdef_no=0, decl_uid=53, symbol_order=0) factorial (signed int arg) { signed int stack[8]; signed int stack_depth; signed int x; signed int y; _20; signed int _21; signed int _38; signed int _44; signed int _51; signed int _56; initial: stack_depth_3 = 0; # DEBUG stack_depth => stack_depth_3 stack[stack_depth_3] = arg_5(D); stack_depth_7 = stack_depth_3 + 1; # DEBUG stack_depth => stack_depth_7 # DEBUG instr0 => NULL # DEBUG /* DUP */ => NULL stack_depth_8 = stack_depth_7 + -1; # DEBUG stack_depth => stack_depth_8 x_9 = stack[stack_depth_8]; # DEBUG x => x_9 stack[stack_depth_8] = x_9; stack_depth_11 = stack_depth_8 + 1; # DEBUG stack_depth => stack_depth_11 stack[stack_depth_11] = x_9; stack_depth_13 = stack_depth_11 + 1; # DEBUG stack_depth => stack_depth_13 # DEBUG instr1 => NULL # DEBUG /* PUSH_CONST */ => NULL stack[stack_depth_13] = 2; /* etc; edited for brevity */ We can perhaps better see the code by turning off :c:macro:`GCC_JIT_BOOL_OPTION_DEBUGINFO` to suppress all those ``DEBUG`` statements, giving: .. code-block:: console $ less /tmp/libgccjit-1Hywc0/fake.c.016t.ssa .. code-block:: c ;; Function factorial (factorial, funcdef_no=0, decl_uid=53, symbol_order=0) factorial (signed int arg) { signed int stack[8]; signed int stack_depth; signed int x; signed int y; _20; signed int _21; signed int _38; signed int _44; signed int _51; signed int _56; initial: stack_depth_3 = 0; stack[stack_depth_3] = arg_5(D); stack_depth_7 = stack_depth_3 + 1; stack_depth_8 = stack_depth_7 + -1; x_9 = stack[stack_depth_8]; stack[stack_depth_8] = x_9; stack_depth_11 = stack_depth_8 + 1; stack[stack_depth_11] = x_9; stack_depth_13 = stack_depth_11 + 1; stack[stack_depth_13] = 2; stack_depth_15 = stack_depth_13 + 1; stack_depth_16 = stack_depth_15 + -1; y_17 = stack[stack_depth_16]; stack_depth_18 = stack_depth_16 + -1; x_19 = stack[stack_depth_18]; _20 = x_19 < y_17; _21 = (signed int) _20; stack[stack_depth_18] = _21; stack_depth_23 = stack_depth_18 + 1; stack_depth_24 = stack_depth_23 + -1; x_25 = stack[stack_depth_24]; if (x_25 != 0) goto (instr9); else goto (instr4); instr4: /* DUP */: stack_depth_26 = stack_depth_24 + -1; x_27 = stack[stack_depth_26]; stack[stack_depth_26] = x_27; stack_depth_29 = stack_depth_26 + 1; stack[stack_depth_29] = x_27; stack_depth_31 = stack_depth_29 + 1; stack[stack_depth_31] = 1; stack_depth_33 = stack_depth_31 + 1; stack_depth_34 = stack_depth_33 + -1; y_35 = stack[stack_depth_34]; stack_depth_36 = stack_depth_34 + -1; x_37 = stack[stack_depth_36]; _38 = x_37 - y_35; stack[stack_depth_36] = _38; stack_depth_40 = stack_depth_36 + 1; stack_depth_41 = stack_depth_40 + -1; x_42 = stack[stack_depth_41]; _44 = factorial (x_42); stack[stack_depth_41] = _44; stack_depth_46 = stack_depth_41 + 1; stack_depth_47 = stack_depth_46 + -1; y_48 = stack[stack_depth_47]; stack_depth_49 = stack_depth_47 + -1; x_50 = stack[stack_depth_49]; _51 = x_50 * y_48; stack[stack_depth_49] = _51; stack_depth_53 = stack_depth_49 + 1; # stack_depth_1 = PHI instr9: /* RETURN */: stack_depth_54 = stack_depth_1 + -1; x_55 = stack[stack_depth_54]; _56 = x_55; stack ={v} {CLOBBER}; return _56; } Note in the above how all the :c:type:`gcc_jit_block` instances we created have been consolidated into just 3 blocks in GCC's internal representation: ``initial``, ``instr4`` and ``instr9``. Optimizing away stack manipulation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Recall our simple implementation of stack operations. Let's examine how the stack operations are optimized away. After a pass of constant-propagation, the depth of the stack at each opcode can be determined at compile-time: .. code-block:: console $ less /tmp/libgccjit-1Hywc0/fake.c.021t.ccp1 .. code-block:: c ;; Function factorial (factorial, funcdef_no=0, decl_uid=53, symbol_order=0) factorial (signed int arg) { signed int stack[8]; signed int stack_depth; signed int x; signed int y; _20; signed int _21; signed int _38; signed int _44; signed int _51; initial: stack[0] = arg_5(D); x_9 = stack[0]; stack[0] = x_9; stack[1] = x_9; stack[2] = 2; y_17 = stack[2]; x_19 = stack[1]; _20 = x_19 < y_17; _21 = (signed int) _20; stack[1] = _21; x_25 = stack[1]; if (x_25 != 0) goto (instr9); else goto (instr4); instr4: /* DUP */: x_27 = stack[0]; stack[0] = x_27; stack[1] = x_27; stack[2] = 1; y_35 = stack[2]; x_37 = stack[1]; _38 = x_37 - y_35; stack[1] = _38; x_42 = stack[1]; _44 = factorial (x_42); stack[1] = _44; y_48 = stack[1]; x_50 = stack[0]; _51 = x_50 * y_48; stack[0] = _51; instr9: /* RETURN */: x_55 = stack[0]; x_56 = x_55; stack ={v} {CLOBBER}; return x_56; } Note how, in the above, all those ``stack_depth`` values are now just constants: we're accessing specific stack locations at each opcode. The "esra" pass ("Early Scalar Replacement of Aggregates") breaks out our "stack" array into individual elements: .. code-block:: console $ less /tmp/libgccjit-1Hywc0/fake.c.024t.esra .. code-block:: c ;; Function factorial (factorial, funcdef_no=0, decl_uid=53, symbol_order=0) Created a replacement for stack offset: 0, size: 32: stack$0 Created a replacement for stack offset: 32, size: 32: stack$1 Created a replacement for stack offset: 64, size: 32: stack$2 Symbols to be put in SSA form { D.89 D.90 D.91 } Incremental SSA update started at block: 0 Number of blocks in CFG: 5 Number of blocks to update: 4 ( 80%) factorial (signed int arg) { signed int stack$2; signed int stack$1; signed int stack$0; signed int stack[8]; signed int stack_depth; signed int x; signed int y; _20; signed int _21; signed int _38; signed int _44; signed int _51; initial: stack$0_45 = arg_5(D); x_9 = stack$0_45; stack$0_39 = x_9; stack$1_32 = x_9; stack$2_30 = 2; y_17 = stack$2_30; x_19 = stack$1_32; _20 = x_19 < y_17; _21 = (signed int) _20; stack$1_28 = _21; x_25 = stack$1_28; if (x_25 != 0) goto (instr9); else goto (instr4); instr4: /* DUP */: x_27 = stack$0_39; stack$0_22 = x_27; stack$1_14 = x_27; stack$2_12 = 1; y_35 = stack$2_12; x_37 = stack$1_14; _38 = x_37 - y_35; stack$1_10 = _38; x_42 = stack$1_10; _44 = factorial (x_42); stack$1_6 = _44; y_48 = stack$1_6; x_50 = stack$0_22; _51 = x_50 * y_48; stack$0_1 = _51; # stack$0_52 = PHI instr9: /* RETURN */: x_55 = stack$0_52; x_56 = x_55; stack ={v} {CLOBBER}; return x_56; } Hence at this point, all those pushes and pops of the stack are now simply assignments to specific temporary variables. After some copy propagation, the stack manipulation has been completely optimized away: .. code-block:: console $ less /tmp/libgccjit-1Hywc0/fake.c.026t.copyprop1 .. code-block:: c ;; Function factorial (factorial, funcdef_no=0, decl_uid=53, symbol_order=0) factorial (signed int arg) { signed int stack$2; signed int stack$1; signed int stack$0; signed int stack[8]; signed int stack_depth; signed int x; signed int y; _20; signed int _21; signed int _38; signed int _44; signed int _51; initial: stack$0_39 = arg_5(D); _20 = arg_5(D) <= 1; _21 = (signed int) _20; if (_21 != 0) goto (instr9); else goto (instr4); instr4: /* DUP */: _38 = arg_5(D) + -1; _44 = factorial (_38); _51 = arg_5(D) * _44; stack$0_1 = _51; # stack$0_52 = PHI instr9: /* RETURN */: stack ={v} {CLOBBER}; return stack$0_52; } Later on, another pass finally eliminated ``stack_depth`` local and the unused parts of the `stack`` array altogether: .. code-block:: console $ less /tmp/libgccjit-1Hywc0/fake.c.036t.release_ssa .. code-block:: c ;; Function factorial (factorial, funcdef_no=0, decl_uid=53, symbol_order=0) Released 44 names, 314.29%, removed 44 holes factorial (signed int arg) { signed int stack$0; signed int mult_acc_1; _5; signed int _6; signed int _7; signed int mul_tmp_10; signed int mult_acc_11; signed int mult_acc_13; # arg_9 = PHI # mult_acc_13 = PHI <1(0)> initial: : # arg_4 = PHI # mult_acc_1 = PHI _5 = arg_4 <= 1; _6 = (signed int) _5; if (_6 != 0) goto (instr9); else goto (instr4); instr4: /* DUP */: _7 = arg_4 + -1; mult_acc_11 = mult_acc_1 * arg_4; goto ; # stack$0_12 = PHI instr9: /* RETURN */: mul_tmp_10 = mult_acc_1 * stack$0_12; return mul_tmp_10; } Elimination of tail recursion ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Another significant optimization is the detection that the call to ``factorial`` is tail recursion, which can be eliminated in favor of an iteration: .. code-block:: console $ less /tmp/libgccjit-1Hywc0/fake.c.030t.tailr1 .. code-block:: c ;; Function factorial (factorial, funcdef_no=0, decl_uid=53, symbol_order=0) Symbols to be put in SSA form { D.88 } Incremental SSA update started at block: 0 Number of blocks in CFG: 5 Number of blocks to update: 4 ( 80%) factorial (signed int arg) { signed int stack$2; signed int stack$1; signed int stack$0; signed int stack[8]; signed int stack_depth; signed int x; signed int y; signed int mult_acc_1; _20; signed int _21; signed int _38; signed int mul_tmp_44; signed int mult_acc_51; # arg_5 = PHI # mult_acc_1 = PHI <1(0), mult_acc_51(3)> initial: _20 = arg_5 <= 1; _21 = (signed int) _20; if (_21 != 0) goto (instr9); else goto (instr4); instr4: /* DUP */: _38 = arg_5 + -1; mult_acc_51 = mult_acc_1 * arg_5; goto (initial); # stack$0_52 = PHI instr9: /* RETURN */: stack ={v} {CLOBBER}; mul_tmp_44 = mult_acc_1 * stack$0_52; return mul_tmp_44; }