Quickstart Guide to Hacking Gfortran
Starting to help in developing an existing compiler can be a daunting task. This document aims to go give a new developer a foothold in the many lines of code. It is still very preliminary. Feel free to ask any of the gfortran regulars (or on the gfortran mailing list) for advice or help.
Gfortran is a front end to gcc. Its task is to parse Fortran code and to convert it to an intermediate form, which is then handed off to other parts of gcc (the so-called "middle end") which do further optimization and finally the translation to the assembly language ("back end").
This document is only about the gfortran front end, the other parts of gcc have their own documentation. It assumes you know how to build gcc, including gfortran.
You will need to know C to work on gfortran. The language used in the front end is actually C++, but most of the C++ idioms are hidden behind macros; otherwise, it is mostly written in the compatible subset of C and C++.
What does the front end do?
The front end's action can be grouped into four phases:
- Parsing
- This converts source code into a stream of tokens which describe the language. Because Fortran does not have reserved keywords, the gfortran runs a series of matchers against code trying to find one that matches a statement. On failing a match an error message may be queued, and another matcher tried. If all attempts at matching fail, the error queue is dumped to the user.
- (parse.cc, scanner.cc and primary.cc)
- Resolution
- This resolves things left over from the parsing phase, such as types of expressions, and compile-time simplification of constants. Many errors are issued in this phase. At the end of this phase, the abstract syntax tree is finished.
- (resolve.cc and (for intrinsics) iresolve.cc, expr.cc, array.cc, interface.cc and simplify.cc)
- Front-end optimization
- This does some optimization since there is some information in the Fortran language that can not easily be handled by the later stages.
- (frontend-passes.cc)
- Translation
- This translates the Fortran abstract syntax tree into a tree stucture suitable for the middle end.
- (trans-*.cc)
Examining gfortran data structures
There are a few useful options to look at gfortran internals. Compiling a file with gfortran -fdump-fortran-original foo.f90 dumps the internal representation of the Fortran abstract syntax tree to standard output. The code which generates output for this option can be found in dump-parse-tree.cc, which can serve as a good starting point for examining gfortran's data structures.
Using gfortran -fdump-tree-original foo.f90 will generate a file named a-foo.f90.004t.original which contains a C-like representation of what the compiler handed off to the middle end. Most code errors can be found from examining this file.
Another interesting option is -fdump-ipa-cgraph, which will dump information about the middle end's symbol table to a-foo.f90.000i.cgraph.
Some documentation on the data files can be found in the GNU Fortran Compiler Internals document. Additions to that document are highly welcome.
Using a debugger on the gfortran compiler
You need to run the debugger (usually gdb) on the f951 executable. This can be found in your gcc build directory. Assuming that this is ~/gcc-bin, the executable is in ~/gcc-bin/gcc/f951.
A good starting point is to run gdb with
$ gdb ~/gcc-bin/gcc/f951 (gdb) break show_expr (gdb) run -fdump-fortran-original hello.f90
and then examine the expressions there. You can find some documentation on the gfc_code and gfc_expr expressions you will encounter in the gcc-internals.texi file in the gfortran source directory.
Another interesting variable to look at is gfc_current_ns. It contains the code found under gfc_current_ns->code and symbols (i.e. variable names, functions etc.) found under gfc_current_ns->sym_root. This is a gfc_symtree pointer. Looking at the first symbol will require you to look at *(gfc_current_ns->sym_root->n.sym).
If you are looking for the source of a particular error, you can set a breakpoint in gfc_error. Be prepared for a large number of false positives, because the parser calls gfc_error frequently for constructs that it may recognize later. It may be a better idea to grep for the error message in the gfortran source files, and then set a break point there.
If you want to inspect a particular internal data structure which is pointed to by a pointer *p , a good first try is to use
(gdb) call debug(p)
on it. There are also some special functions like gfc_debug_code and gfc_debug_expr which you can also call. These functions can also be found in dump-parse-tree.cc.
If you want to extend the debugging facilities in dump-parse-tree.cc, feel free. This code has no user impact, so it can be extended easily.
Function inlining in the compiler can sometimes make the code hard to follow. One way to deal with that is to go into the gcc subdirectory and to edit the Makefile there to set all code to -O0 instead of -O2. Touch gfortran.h and recompile from there; the resulting compiler will be much easier to debug.
If you use -O0, don't forget to run your regression tests with normal optimization turned on, because the compiler will only flag some warnings for optimization levels above -O0.
If you use gdb, you can also replace the -g option by -ggdb3; this will make gdb recognize marcos, which can be extremely useful, see below.
Examining tree structures
Let's say you are looking at the middle end code generated in a tree variable named stmt somewhere in trans-*.cc. The best way to look at this is to try
(gdb) call debug(stmt)
which will dump the code using the same internal representation as the -fdump-tree-original option. If the tree you are looking at contains a declaration, this will return an empty line. In this case, you can use
(gdb) call debug_tree(stmt)
which will return a complete, but somewhat hard to read, representation of the declaration.
If you have compiled the compiler with -ggdb3, you can also use the pre-defined macros to examine the data structures. Useful things to look at are TREE_CODE (foo), TREE_TYPE (foo) and, combined, TREE_CODE (TREE_TYPE (foo)). The name of a variable is IDENTIFIER_POINTER (DECL_NAME (foo)), but you need to look at DECL_NAME (foo) first to make sure that this is not a NULL pointer.
You can look at the unique indentifier of a ceclaration via DECL_UID (foo).
Breaking on an internal error
If you want to look at an internal error, try setting a breakpoint in fancy_abort. Stepping up from this will lead you to the gcc_unreachable () call where something went wrong.
Using valgrind
Using valgrind with the compiler itself
You may get better valgrind results if you use --enable-valgrind-annotations for the bootstrap.
If you want to use valgrind on the compiler itself, start it (for example) with
$ valgrind --expensive-definedness-checks=yes ~/gcc-bin/gcc/f951 foo.f90
After the gfortran front end itself has finished, there will probably be a lot of messages about sparseset_p. This is a known false positive, you can then stop the run. --expensive-definedness-checks=yes is needed because there is at least one false positive without it - see PR 89747.
Using valgrind with test cases
Make sure you compile with -g, so that error messages will be more informative.
Debugging tree code generated by gfortran
The -fdebug-aux-vars option will allow you to debug the generated auxiliary variables generated by gfortran. You will have to use stepi because normal debugging steps will take you past the rather complex statements generated by the front end.
Debugging assembly code annotated via -fverbose-asm can also help a lot. Assuming you would like to debug a file foo.f90 containting the simple program called foo.f90 with the contents
real :: a(2,2) call random_number(a) print *,minloc(a) end
The command
$ gfortran -S -fverbose-asm -fdump-tree-original-uid -fdump-tree-optimized-uid foo.f90
generates an assembly file foo.s which contains the statements from foo.f90.t*.optimized as comments. In this case, it is important to leave out the -g option. The file foo.f90.*t.optimized then contains GIMPLE statements like
parm.0.dim[1].lbound = 1; parm.0.dim[1].ubound = 2; parm.0.dim[1].stride = 2; parm.0.data = &a[0]; parm.0.offset = -3; _gfortran_arandom_r4 (&parm.0)
and the assembly file foo.s has the GIMPLE statements as comments, like this:
movq $1, -504(%rbp) #, MEM[(struct array02_real(kind=4) *)_62].dim[1].lbound movq $2, -496(%rbp) #, MEM[(struct array02_real(kind=4) *)_62].dim[1].ubound movq $2, -512(%rbp) #, MEM[(struct array02_real(kind=4) *)_62].dim[1].stride leaq -32(%rbp), %rax #, tmp86 movq %rax, -576(%rbp) # tmp86, MEM[(struct array02_real(kind=4) *)_62].data movq $-3, -568(%rbp) #, MEM[(struct array02_real(kind=4) *)_62].offset leaq -576(%rbp), %rax #, tmp87 movq %rax, %rdi # tmp87, movl $0, %eax #, call _gfortran_arandom_r4 #
The -uid part of the options makes sure that you can recognize the variables from different passes.
Translating foo.s with
$ gfortran -g foo.s
and using the debugger on ./a.out, breaking on MAIN__ and single-stepping through the assembler file as if it was the source file, with the optimized file for comparison, allows fairly good debugging.
A note on fn spec
If you build a new library function, you may want to use the gfc_build_library_function_decl_with_spec to build a TREE representation of your function. The meaning of the individual letters in spec argument (looking like ". W R w r ") is:
". " means that the argument can "escape" (that it can be stored somewhere). Any other letter means that this can not happen.
"r " means that the argument is read by the function and that its component can be accessed recursively (for example for an array descriptor)
"R " means that the argument is read by the function and that it is not accessed recursively (for example, when passing a scalar INTENT(IN) argument)
"w " means that the argument can be read and written by the function and that its components can be accessed recursively
"W " means that the argument is written and its components can not be accessed recursively.
The first of the string denotes the return value (even if the function does not return a value) and '1' to '4' can be used to tell that the function returns the first to fourth parameter unchanged. 'm' can be used to denote the function returns newly allocated memory that is uninitialized. The second letter for each arguments carries some additional information.
The complete documentation can be found in (attr-fnspec.h).
What kind of PR to start with
All currently open bugs reports (called PRs) can be found in the gcc bugtracker called bugzilla if you set the product to fortran.
Traditionally, internal compiler errors on invalid code (gcc bugzilla keyword ice-on-invalid-code). have been considered relatively easy. But you may always find a hard one...
Happy hacking!