Seeking advice on front ends

Yannick Gingras ygingras@ygingras.net
Sun Feb 4 22:36:00 GMT 2007


Hi, 
  I'd like to write a small language.  This would be my first language
and I'm faced with several implementation options.  I have a hard time
raking them so I'd like to ask advice from the GCC users and
developers.

The language will feature dynamic typing, lexical closures and garbage
collections.  Most languages falling into this category are
implemented with a bytecode interpreter but there are also a few
notable GCC front ends.  GCC front ends are usually patches to the GCC
tree.  The front end tend to be self contained in a single language
directory with only a few modifications to the build system.  This
architecture is nice and clean but it presents problems for the
distribution of the front end.  Either a fully patched GCC source tree
is distributed or instructions on how to patch GCC must be supplied.
Alternatively a front end can convert its language to C.  GCC support
many annotations and extensions to C that will enable efficient
transfer to machine language of programming constructs common in
languages quite different from C.

My options code generations are more or less:

  1) Code an interpreter

  2) Build the parse tree in GCC's native format and let GCC generate
     the code
     
  3) Generate annotated C and call GCC on that.

I think that option 1 would represent the less work but I doubt that
it can be made efficient without major contortions.  I will probably
go that way for the first prototype but I'm afraid that I will need
something else for a production release.  

Option 2 sounds like a good deal.  The parse tree need to be build
anyway, building it in GCC's native format in the first place makes a
lot of sense.  The only problem seems to be with the distribution of
the resulting front end.  Is it possible to build such a front end by
only linking to libgcc or something like that?

Finally, option 3 solves the distribution problems of option 2 but
generating good C code doesn't sound like a trivial problem.
Compilation under that option is probably slow since each files must
be parsed twice...  Is it possible to produde machine code as
efficient as with option 2 when generating C code?

Did I miss anything?  What are the relative advantages of each
solutions?  Do you think that I overlooked other options?  Would using
an exiting virtual machine be a good option?  Except for Nice, this
option doesn't seem to be popular; there must be a catch.

Regarding parsing, I would like to support Unicode literals and
identifier.  I would not mind if the input encoding was restricted to
UTF-8.  I think that Flex has very limited Unicode support.  Is there
a good lexer out there with good Unicode support?  Would Unicode break
anything if I use GCC as my backed?

Thanks for your time, 

-- 
Yannick Gingras



More information about the Gcc mailing list