Tree Compiler-Compiler

Overview

Introduction

Traditional compiler construction tools such as lex and yacc focus on the lexical analysis and parsing phases of compilation. But they provide very little to support semantic analysis and code generation.

Yacc allows grammar rules to be tagged with semantic actions and values, but it doesn't provide any routines that assist in the process of tree building, semantic analysis, or code generation. Because those processes are language-specific, yacc leaves the details to the programmer.

Support for semantic analysis was also a lot simpler in the languages that were prevalent when lex and yacc were devised. C and Pascal require declare before use, which allows the semantic information about a statement to be determined within the parser at the point of use.(1) If extensive optimization is not required, then code generation can also be performed within the grammar, leading to a simple one-pass compiler structure.

Modern languages allow deferred declaration of methods, fields, and types. For example, Java allows a method to refer to a field that is declared further down the .java source file. A field can be declared with a type whose class definition has not yet been parsed.

Hence, most of the semantic analysis that used to be performed inline within a yacc grammar must now be performed after the entire program has been parsed. Tree building and walking is now more important than it was in older declare before use languages.

Tree walking: the need for something better

Building parse tree data structures and walking them is not terribly difficult, but it is extremely time-consuming and error-prone. A modern programming language may have hundreds of node types, divided into categories for statements, expressions, types, declarations, etc. When a new programming language is being devised, new node types may be added quite frequently. This has ramifications in trying to manage the code's complexity.(2)

For example, consider nodes that correspond to programming language types in a C-like language. There will be node types for integer types, floating-point types, pointers, structures, functions, etc. There will be semantic analysis routines for testing types for equality, comparing types for coercions and casts, evaluating the size of a type for memory layout purposes, determining if the type falls into a general category such as "integer" or "pointer", etc.

Let's say we wanted to add a new "128-bit integer" type to this language. Adding a new node type is fairly straight-forward. But we also need to track down every place in the code where the compiler walks a type or deals with integers and add an appropriate case for the new type. This is very error-prone. Such code is likely to be split over many files, and good coding practices only help to a certain extent.

This problem gets worse when new kinds of expressions and statements are added to the language. The change not only affects semantic analysis, but also optimization and code generation. Some compilers use multiple passes over the tree to perform optimization, with different algorithms used in each pass. Code generation may use a number of different strategies, depending upon how an expression or statement is used. If even one of these places is missed when the new node type is added, then there is the potential for a very nasty bug that may go unnoticed for months or years.

Object-oriented languages such as C++ can help a bit in constructing robust tree structures. The base class can declare abstract methods for any semantic analysis, optimization, or code generation routine that needs to be implemented for all members of the node category. But another code maintainence problem arises. What happens when we want to add a new optimization pass in the future? We must go into hundreds of classes and implement the methods.

To avoid changing hundreds of classes, texts on Design Patterns suggest using a Visitor pattern. Then the new optimization pass can be encapsulated in a visitor. This would work, except for the following drawback of visitor patterns, as described in Gamma, et al:

The Visitor pattern makes it hard to add new subclasses of Element. Each new ConcreteElement gives rise to a new abstract operation on Visitor and a corresponding implementation in every ConcreteVisitor class.
... The Visitor class hierarchy can be difficult to maintain when new ConcreteElement classes are added frequently. In such cases, it's probably easier just to define operations on the classes that make up the structure.

That is, if we add a new node type in the future, we have a large maintainence problem on our hands. The solution is to scatter the implementation through-out every class, which is the situation we were trying to avoid by using the Visitor pattern.

Because compiler construction deals with a large set of rapidly changing node types and operations, neither of the usual approaches work very well.

The ideal programming language for designing compilers needs to have some way to detect when the programmer forgets to implement an operation for a new node type, and to ensure that a new operation covers all existing node types adequately. Existing OO languages do not perform this kind of global error checking. What few checking procedures they have change the maintainence problem into a different problem of similar complexity.

Aspect-oriented programming

A new field in language design has emerged in recent years called "Aspect-Oriented Programming" (AOP). A good review of the field can be found in the October 2001 issue of the Communications of the ACM, and on the AspectJ Web site, http://www.aspectj.org/.

The following excerpt from the introduction to the AOP section in the CACM issue describes the essential aspects of AOP, and the difference between OOP and AOP:

AOP is based on the idea that computer systems are better programmed by separately specifying the various concerns (properties or areas of interest) of a system and some description of their relationships, and then relying on mechanisms in the underlying AOP environment to weave or compose them together into a coherent program. ... While the tendancy in OOP's is to find commonality among classes and push it up the inheritance tree, AOP attempts to realize scattered concerns as first-class elements, and eject them horizontally from the object structure.

Aspect-orientation gives us some hope of solving our compiler complexity problems. We can view each operation on node types (semantic analysis, optimization, code generation, etc) as an "aspect" of the compiler's construction. The AOP language weaves these aspects with the node types to create the final compiler.

The treecc approach

We don't really want to implement a new programming language just for compiler construction. Especially since the new language's implementation would have all of the problems described above and would therefore also be difficult to debug and maintain.

The approach that we take with "treecc" is similar to that used by "yacc". A simple rule-based language is devised that is used to describe the intended behaviour declaratively. Embedded code is used to provide the specific implementation details. A translator then converts the input into source code that can be compiled in the usual fashion.

The translator is responsible for generating the tree building and walking code, and for checking that all relevant operations have been implemented on the node types. Functions are provided that make it easier to build and walk the tree data structures from within a "yacc" grammar and other parts of the compiler.

A simple example for expressions

Consider the following yacc grammar for a simple expression language:

%token INT FLOAT

%%

expr: INT
    | FLOAT
    | '(' expr ')'
    | expr '+' expr
    | expr '-' expr
    | expr '*' expr
    | expr '/' expr
    | '-' expr
    ;

(We will ignore the problems of precedence and associativity and assume that the reader is familiar with how to resolve such issues in yacc grammars).

There are 7 types of nodes for this grammar: `intnum', `floatnum', `plus', `minus', `multiply', `divide', and `negate'. They are defined in treecc as follows:

%node expression %abstract %typedef

%node binary expression %abstract =
{
    expression *expr1;
    expression *expr2;
}

%node unary expression %abstract =
{
    expression *expr;
}

%node intnum expression =
{
    int num;
}

%node floatnum expression =
{
    float num;
}

%node plus binary
%node minus binary
%node multiply binary
%node divide binary
%node negate unary

We have introduced three extra node types that refer to any expression, binary expressions, and unary expressions. These can be seen as superclasses in an OO-style framework. We have declared these node types as `abstract' because the yacc grammar will not be permitted to create instances of these classes directly.

The `binary', `unary', `intnum', and `floatnum' node types have field definitions associated with them. These have a similar syntax to C struct declarations.

The yacc grammar is augmented as follows to build the parse tree:

%union {
    expression *node;
    int         inum;
    float       fnum;
}

%token INT FLOAT

%type <node> expr
%type <inum> INT
%type <fnum> FLOAT

%%

expr: INT               { $$ = intnum_create($1); }
    | FLOAT             { $$ = floatnum_create($1); }
    | '(' expr ')'      { $$ = $2; }
    | expr '+' expr     { $$ = plus_create($1, $3); }
    | expr '-' expr     { $$ = minus_create($1, $3); }
    | expr '*' expr     { $$ = multiply_create($1, $3); }
    | expr '/' expr     { $$ = divide_create($1, $3); }
    | '-' expr          { $$ = negate_create($2); }
    ;

The treecc translator generates the `*_create' functions so that the rest of the compiler can build the necessary data structures on demand. The parameters to the `*_create' functions are identical in type and order to the members of the structure for that node type.

Because `expression', `binary', and `unary' are abstract, there will be no `*_create' functions associated with them. This will help the programmer catch certain kinds of errors.

The type that is returned from a `*_create' function is the first superclass of the node that has a `%typedef' keyword associated with it; `expression *' in this case.

Storing extra information

Normally we will want to store extra information with a node beyond that which is extracted by the yacc grammar. In our expression example, we probably want to store type information in the nodes so that we can determine if the whole expression is integer or floating point during semantic analysis. We can add type information to the `expression' node type as follows:

%node expression %abstract %typedef =
{
    %nocreate type_code type;
}

The `%nocreate' flag indicates that the field should not be passed to the `*_create' functions as a parameter. i.e. it provides semantic information that isn't present in the grammar. When nodes are created, any fields that are declared as `%nocreate' will be undefined in value. A default value can be specified as follows:

%node expression %abstract %typedef =
{
    %nocreate type_code type = {int_type};
}

Default values must be enclosed in `{' and `}' because they are pieces of code in the underlying source language (C, C++, etc), instead of tokens in the treecc syntax. Any legitimate expression in the underlying source language may be used.

We also need to arrange for `type_code' to be declared. One way to do this is by adding a `%decls' section to the front of the treecc input file:

%decls %{

typedef enum
{
    int_type,
    float_type

} type_code;

%}

We could have introduced the definition by placing a `#include' directive into the `%decls' section instead, or by defining a treecc enumerated type:

%enum type_code =
{
    int_type,
    float_type
}

Now that we have these definitions, type-inferencing can be implemented as follows:

%operation void infer_type(expression *e)

infer_type(binary)
{
    infer_type(e->expr1);
    infer_type(e->expr2);

    if(e->expr1->type == float_type || e->expr2->type == float_type)
    {
        e->type = float_type;
    }
    else
    {
        e->type = int_type;
    }
}

infer_type(unary)
{
    infer_type(e->expr);
    e->type = e->expr->type;
}

infer_type(intnum)
{
    e->type = int_type;
}

This example demonstrates using the abstract node types `binary' and `unary' to define operations on all subclasses. The treecc translator will generate code for a full C function called `infer_type' that incorporates all of the cases.

But hang on a second! What happened to `floatnum'? Where did it go? It turns out that treecc will catch this. It will report an error to the effect that `node type `floatnum' is not handled in operation `infer_type''. Here is its definition:

infer_type(floatnum)
{
    e->type = float_type;
}

As we can see, treecc has just caught a bug in the language implementation and reported it to us as soon as we introduced it.

Let's now extend the language with a `power' operator:

yacc:

expr: expr '^' expr     { $$ = create_power($1, $3); }
    ;

treecc:

%node power binary

That's all there is to it! When treecc re-translates the input file, it will modify the definition of `infer_type' to include the extra case for `power' nodes. Because `power' is a subclass of `binary', treecc already knows how to perform type inferencing for the new node and it doesn't warn us about a missing declaration.

What if we wanted to restrict the second argument of `power' to be an integer value? We can add the following case to `infer_type':

infer_type(power)
{
    infer_type(e->expr1);
    infer_type(e->expr2);

    if(e->expr2->type != int_type)
    {
        error("second argument to `^' is not an integer");
    }

    e->type = e->expr1->type;
}

The translator now notices that there is a more specific implementation of `infer_type' for `power', and won't use the `binary' case for it.

The most important thing to realise here is that the translator always checks that there are sufficient declarations for `infer_type' to cover all relevant node types. If it detects a lack, it will immediately raise an error to the user. This allows tree coverage problems to be found a lot sooner than with the traditional approach.

See section Full expression example code, for a complete listing of the above example files.

Invoking treecc from the command-line

The general form of treecc's command-line syntax is as follows:

treecc [OPTIONS] INPUT ...

Treecc accepts the following command-line options:

-o FILE
--output FILE: Set the name of the output file to `FILE'. If this option is not supplied, then the name of the first input file will be used, with its extension changed to `.c'. If the input is standard input, the default output file is `yy_tree.c'. This option may be overridden using the `%output' keyword in the input files.
-h FILE
--header FILE: Set the name of the header output file to `FILE'. This is only used for the C and C++ output languages. If this option is not supplied, then the name of the output file will be used, with its extension changed to `.h'. If the input is standard input, the default header output file is `yy_tree.h'. This option may be overriden using the `%header' keyword in the input files. If this option is used with a language that does not require headers, it will be ignored.
-d DIR
--output-dir DIR: Set the name of the Java output directory to `DIR'. This is only used for the Java language. If this option is not supplied, then the directory corresponding to the first input file is used. If the input is standard input, the default is the current directory. This option may be overriden using the `%outdir' keyword in the input files. If this option is used with a language other than Java, it will be ignored.
-s DIR
--skeleton-dir DIR: Set the name of the directory that contains the skeleton files for the C and C++ node memory managers to `DIR'.
-e EXT
--extension EXT: Change the default output file extension to `ext', instead of `.c'. The value `ext' can have a leading dot, but this is not required.
-f
--force-create: Treecc normally attempts to optimise the creation of output files so that they are only modified if a non-trivial change has occurred in the input. This can reduce the number of source code recompiles when treecc is used in combination with make. This option forces the output files to be created, even if they are the same as existing files with the same name. The declaration `%option force' can be used in the input files to achieve the same effect as this option.
-n
--no-output: Suppress the generation of output files. Treecc parses the input files, checks for errors, and then stops.
--help: Print a usage message for the treecc program.
-v
--version: Print the version of the treecc program.
--: Marks the end of the command-line options, and the beginning of the input filenames. You may need to use this if your filename begins with `-'. e.g. `treecc -- -input.tc'. This is not needed if the input is standard input: `treecc -' is perfectly valid.

Syntax of input files

Treecc input files consist of zero or more declarations that define nodes, operations, options, etc. The following sections describe each of these elements.

Node declarations

Node types are defined using the `node' keyword in input files. The general form of the declaration is:

%node NAME [ PNAME ] [ FLAGS ] [ = FIELDS ]

`NAME'

An identifier that is used to refer to the node type elsewhere in the treecc definition. It is also the name of the type that will be visible to the programmer in literal code blocks.

`PNAME'

An identifier that refers to the parent node type that `NAME' inherits from. If `PNAME' is not supplied, then `NAME' is a top-level declaration. It is legal to supply a `PNAME' that has not yet been defined in the input.

`FLAGS'

Any combination of `%abstract' and `%typedef':

`%abstract': The node type cannot be constructed by the programmer. In addition, the programmer does not need to define operation cases for this node type if all subtypes have cases associated with them.
`%typedef': The node type is used as the common return type for node creation functions. Top-level declarations must have a `%typedef' keyword.

The `FIELDS' part of a node declaration defines the fields that make up the node type. Each field has the following general form:

[ %nocreate ] TYPE FNAME [ = VALUE ] ';'

`%nocreate': The field is not used in the node's constructor. When the node is constructed, the value of this field will be undefined unless `VALUE' is specified.
`TYPE': The type that is associated with the field. Types can be declared using a subset of the C declaration syntax, augmented with some C++ and Java features. See section Types used in fields and parameters, for more information.
`FNAME': The name to associate with the field. Treecc verifies that the field does not currently exist in this node type, or in any of its ancestor node types.
`VALUE': The default value to assign to the field in the node's constructor. This can only be used on fields that are declared with `%nocreate'. The value must be enclosed in braces. For example `{NULL}' would be used to initialize a field with `NULL'. The braces are required because the default value is expressed in the underlying source language, and can use any of the usual constant declaration features present in that language.

When the output language is C, treecc creates a struct-based type called `NAME' that contains the fields for `NAME' and all of its ancestor classes. The type also contains some house-keeping fields that are used internally by the generated code. The following is an example:

typedef struct binary__ binary;
struct binary__ {
    const struct binary_vtable__ *vtable__;
    int kind__;
    char *filename__;
    long linenum__;
    type_code type;
    expression * expr1;
    expression * expr2;
};

The programmer should avoid using any identifier that ends with `__', because it may clash with house-keeping identifiers that are generated by treecc.

When the output language is C++, Java, or C#, treecc creates a class called `NAME', that inherits from the class `PNAME'. The field definitions for `NAME' are converted into public members in the output.

Types used in fields and parameters

Types that are used in field and parameter declarations have a syntax which is subset of features found in C, C++, and Java:

TypeAndName ::= Type [ IDENTIFIER ]

Type ::= TypeName
       | Type '*'
       | Type '&'
       | Type '[' ']'

TypeName ::= IDENTIFIER { IDENTIFIER }

Types are usually followed by an identifier that names the field or parameter. The name is required for fields and is optional for parameters. For example `int' is usually equivalent to `int x' in parameter declarations.

The following are some examples of using types:

int
int x
const char *str
expression *expr
Element[][] array
Item&
unsigned int y
const Element

The grammar used by treecc is slightly ambiguous. The last example above declares a parameter called `Element', that has type `const'. The programmer probably intended to declare an anonymous parameter with type `const Element' instead.

This ambiguity is unavoidable given that treecc is not fully aware of the underlying language's type system. When treecc sees a type that ends in a sequence of identifiers, it will always interpret the last identifier as the field or parameter name. Thus, the programmer must write the following instead:

const Element e

Treecc cannot declare types using the full power of C's type system. The most common forms of declarations are supported, and the rest can usually be obtained by defining a `typedef' within a literal code block. See section Literal code declarations, for more information on literal code blocks.

It is the responsibility of the programmer to use type constructs that are supported by the underlying programming language. Types such as `const char *' will give an error when the output is compiled with a Java compiler, for example.

Enumerated type declarations

Enumerated types are a special kind of node type that can be used by the programmer for simple values that don't require a full abstract syntax tree node. The following is an example of defining a list of the primitive machine types used in a Java virtual machine:

%enum JavaType =
{
    JT_BYTE,
    JT_SHORT,
    JT_CHAR,
    JT_INT,
    JT_LONG,
    JT_FLOAT,
    JT_DOUBLE,
    JT_OBJECT_REF
}

Enumerations are useful when writing code generators and type inferencing routines. The general form is:

%enum NAME = { VALUES }

`NAME': An identifier to be used to name the enumerated type. The name must not have been previously used as a node type, an enumerated type, or an enumerated value.
`VALUES': A comma-separated list of identifiers that name the values within the enumeration. Each of the names must be unique, and must not have been used previously as a node type, an enumerated type, or an enumerated value.

Logically, each enumerated value is a special node type that inherits from a parent node type corresponding to the enumerated type `NAME'.

When the output language is C or C++, treecc generates an enumerated typedef for `NAME' that contains the enumerated values in the same order as was used in the input file. The typedef name can be used elsewhere in the code as the type of the enumeration.

When the output language is Java, treecc generates a class called `NAME' that contains the enumerated values as integer constants. Elsewhere in the code, the type `int' must be used to declare variables of the enumerated type. Enumerated values are referred to as `NAME.VALUE'. If the enumerated type is used as a trigger parameter, then `NAME' must be used instead of `int': treecc will convert the type when the Java code is output.

When the output language is C#, treecc generates an enumerated value type called `NAME' that contains the enumerated values as members. The C# type `NAME' can be used elsewhere in the code as the type of the enumeration. Enumerated values are referred to as `NAME.VALUE'.

Operation declarations

Operations are declared in two parts: the declaration, and the cases. The declaration part defines the prototype for the operation and the cases define how to handle specific kinds of nodes for the operation.

Operations are defined over one or more trigger parameters. Each trigger parameter specifies a node type or an enumerated type that is selected upon to determine what course of action to take. The following are some examples of operation declarations:

%operation void infer_type(expression *e)
%operation type_code common_type([type_code t1], [type_code t2])

Trigger parameters are specified by enclosing them in square brackets. If none of the parameters are enclosed in square brackets, then treecc assumes that the first parameter is the trigger.

The general form of an operation declaration is as follows:

%operation { %virtual | %inline | %split } RTYPE [CLASS::]NAME(PARAMS)

`%virtual': Specifies that the operation is associated with a node type as a virtual method. There must be only one trigger parameter, and it must be the first parameter. Non-virtual operations are written to the output source files as global functions.
`%inline': Optimise the generation of the operation code so that all cases are inline within the code for the function itself. This can only be used with non-virtual operations, and may improve code efficiency if there are lots of operation cases with a small amount of code in each.
`%split': Split the generation of the multi-trigger operation code across multiple functions, to reduce the size of each individual function. It is sometimes necessary to split large %inline operations to avoid compiler limits on function size.
`RTYPE': The type of the return value for the operation. This should be `void' if the operation does not have a return value.
`CLASS': The name of the class to place the operation's definition within. This can only be used with non-virtual operations, and is intended for languages such as Java and C# that cannot declare methods outside of classes. The class name will be ignored if the output language is C. If a class name is required, but the programmer did not supply it, then `NAME' will be used as the default. The exception to this is the C# language: `CLASS' must always be supplied and it must be different from `NAME'. This is due to a "feature" in some C# compilers that forbid a method with the same name as its enclosing class.
`NAME': The name of the operation.
`PARAMS': The parameters to the operation. Trigger parameters may be enclosed in square brackets. Trigger parameters must be either node types or enumerated types.

Once an operation has been declared, the programmer can specify its cases anywhere in the input files. It is not necessary that the cases appear after the operation, or that they be contiguous within the input files. This permits the programmer to place operation cases where they are logically required for maintainence reasons.

There must be sufficient operation cases defined to cover every possible combination of node types and enumerated values that inherit from the specified trigger types. An operation case has the following general form:

NAME(TRIGGERS) [, NAME(TRIGGERS2) ...]
{
    CODE
}

`NAME': The name of the operation for which this case applies.
`TRIGGERS': A comma-separated list of node types or enumerated values that define the specific case that is handled by the following code.
`CODE': Source code in the output source language that implements the operation case.

Multiple trigger combinations can be associated with a single block of code, by listing them all, separated by commas. For example:

common_type(int_type, int_type)
{
    return int_type;
}

common_type(int_type, float_type),
common_type(float_type, int_type),
common_type(float_type, float_type)
{
    return float_type;
}

Options that modify treecc's behaviour

"(*)" is used below to indicate an option that is enabled by default.

`%option track_lines': Enable the generation of code that can track the current filename and line number when nodes are created. See section Tracking line numbers in source files, for more information. (*)
`%option no_track_lines': Disable the generation of code that performs line number tracking.
`%option singletons': Optimise the creation of singleton node types. These are node types without any fields. Treecc can optimise the code so that only one instance of a singleton node type exists in the system. This can speed up the creation of nodes for constants within compilers. (*) Singleton optimisations will have no effect if `track_lines' is enabled, because line tracking uses special hidden fields in every node.
`%option no_singletons': Disable the optimisation of singleton node types.
`%option reentrant': Enable the generation of reentrant code that does not rely upon any global variables. Separate copies of the compiler state can be used safely in separate threads. However, the same copy of the compiler state cannot be used safely in two or more threads.
`%option no_reentrant': Disable the generation of reentrant code. The interface to node management functions is simpler, but cannot be used in a threaded environment. (*)
`%option force': Force output source files to be written, even if they are unchanged. This option can also be set using the `-f' command-line option.
`%option no_force': Don't force output source files to be written if they are the same as before. (*) This option can help smooth integration of treecc with make. Only those output files that have changed will be modified. This reduces the number of files that the underlying source language compiler must process after treecc is executed.
`%option virtual_factory': Use virtual methods in the node type factories, so that the programmer can subclass the factory and provide new implementations of node creation functions. This option is ignored for C, which does not use factories.
`%option no_virtual_factory': Don't use virtual methods in the node type factories. (*)
`%option abstract_factory': Use abstract virtual methods in the node type factories. The programmer is responsible for subclassing the factory to provide node creation functionality.
`%option no_abstract_factory': Don't use abstract virtual methods in the node type factories. (*)
`%option kind_in_node': Put the kind field in the node, for more efficient access at runtime. (*)
`%option kind_in_vtable': Put the kind field in the vtable, and not the node. This saves some memory, at the cost of slower access to the kind value at runtime. This option only applies when the language is C. The kind field is always placed in the node in other languages, because it isn't possible to modify the vtable.
`%option prefix = PREFIX': Specify the prefix to be used in output files in place of "yy".
`%option state_type = NAME': Specify the name of the state type. The state type is generated by treecc to perform centralised memory management and reentrancy support. The default value is `YYNODESTATE'. If the output language uses factories, then this will also be the name of the factory base class.
`%option namespace = NAME': Specify the namespace to write definitions to in the output source files. This option is ignored when the output language is C.
`%option package = NAME': Same as `%option namespace = NAME'. Provided because `package' is more natural for Java programmers.
`%option base = NUM': Specify the numeric base to use for allocating numeric values to node types. By default, node type allocation begins at 1.
`%option lang = LANGUAGE': Specify the output language. Must be one of "C", "C++", "Java", or "C#". The default is "C".
`%option block_size = NUM': Specify the size of the memory blocks to use in C and C++ node allocators.
`%option strip_filenames': Strip filenames down to their base name in #line directives. i.e. strip off the directory component. This can be helpful in combination with the %include %readonly command when treecc input files may processed from different directories, causing common output files to change unexpectedly.
`%option no_strip_filenames': Don't strip filenames in #line directives. (*)

Literal code declarations

Sometimes it is necessary to embed literal code within output `.h' and source files. Usually this is to `#include' definitions from other files, or to define functions that cannot be easily expressed as operations.

A literal code block is specified by enclosing it in `%{' and `%}'. The block can also be prefixed with the following flags:

`%decls': Write the literal code to the currently active declaration header file, instead of the source file.
`%both': Write the literal code to both the currently active declaration header file and the currently active source file.
`%end': Write the literal code to the end of the file, instead of the beginning.

Another form of literal code block is one which begins with `%%' and extends to the end of the current input file. This form implicitly has the `%end' flag.

Changing input and output files

Most treecc compiler definitions will be too large to be manageable in a single input file. They also will be too large to write to a single output file, because that may overload the source language compiler.

Multiple input files can be specified on the command-line, or they can be explicitly included by other input files with the following declarations:

`%include [ %readonly ] FILENAME': Include the contents of the specified file at the current point within the current input file. `FILENAME' is interpreted relative to the name of the current input file. If the `%readonly' keyword is supplied, then any output files that are generated by the included file must be read-only. That is, no changes are expected by performing the inclusion. The `%readonly' keyword is useful for building compilers in layers. The programmer may group a large number of useful node types and operations together that are independent of the particulars of a given language. The programmer then defines language-specific compilers that "inherit" the common definitions. Read-only inclusions ensure that any extensions that are added by the language-specific parts do not "leak" into the common code.

Output files can be changed using the follow declarations:

`%header FILENAME': Change the currently active declaration header file to `FILENAME', which is interpreted relative to the current input file. This option has no effect for languages without header files (Java and C#). Any node types and operations that are defined after a `%header' declaration will be declared in `FILENAME'.
`%output FILENAME': Change the currently active source file to `FILENAME', which is interpreted relative to the current input file. This option has no effect for languages that require a single class per file (Java). Any node types and operations that are defined after a `%header' declaration will have their implementations placed in `FILENAME'.
`%outdir DIRNAME': Change the output source directory to `DIRNAME'. This is only used for Java, which requires that a single file be used for each class. All classes are written to the specified directory. By default, `DIRNAME' is the current directory where treecc was invoked.

When treecc generates the output source code, it must insert several common house-keeping functions and classes into the code. By default, these are written to the first header and source files. This can be changed with the `%common' declaration:

`%common'

Output the common house-keeping code to the currently active declaration header file and the currently active source file. This is typically used as follows:

%header "common.h"
%output "common.c"
%common

Tracking line numbers in source files

When compilers emit error messages to the programmer, it is generally a good idea to indicate which file and which line gave rise to the error. Syntax errors can be emitted fairly easily because the parser usually has access to the current line number. However, semantic errors are harder to report because the parser may no longer be active when the error is detected.

Treecc can generate code that automatically keeps track of what line in the source file was active when a node is created. Every node has two extra private fields that specify the name of the file and the line number. Semantic analysis routines can query this information when reporting errors.

Because treecc is not aware of how to obtain this information, the programmer must supply some additional functions. See section API's available in the generated output, for more information.

See section API's available in the generated output, for more information.

API's available in the generated output

The source code that is generated by treecc exports a number of application programmer interfaces (API's) to the programmer. These can be used elsewhere in the compiler implementation to manipulate abstract syntax trees. The following sections describe the API's for each of the output languages.

C Language APIs

In the C output language, each node type is converted into a `typedef' that contains the node's fields, and the fields of its ancestor node types. The following example demonstrates how treecc node declarations are converted into C source code:

%node expression %abstract %typedef =
{
    %nocreate type_code type;
}
%node binary expression %abstract =
{
    expression *expr1;
    expression *expr2;
}
%node plus binary

becomes:

typedef struct expression__ expression;
typedef struct binary__ binary;
typedef struct plus__ plus;

struct expression__ {
    const struct expression_vtable__ *vtable__;
    int kind__;
    char *filename__;
    long linenum__;
    type_code type;
};
struct binary__ {
    const struct binary_vtable__ *vtable__;
    int kind__;
    char *filename__;
    long linenum__;
    type_code type;
    expression * expr1;
    expression * expr2;
};
struct plus__ {
    const struct plus_vtable__ *vtable__;
    int kind__;
    char *filename__;
    long linenum__;
    type_code type;
    expression * expr1;
    expression * expr2;
};

Programmers should avoid using any identifiers that end in `__'. Such identifiers are reserved for internal use by treecc and its support routines.

For each non-abstract node type called `NAME', treecc generates a function called `NAME_create' that creates nodes of that type. The general form of the function's prototype is as follows:

TYPE *NAME_create([YYNODESTATE *state,] PARAMS)

`TYPE'

The return node type, which is the nearest ancestor that has the `%typedef' flag.

`NAME'

The name of the node type that is being created.

`state'

The system state, if reentrant code is being generated.

`PARAMS'

The create parameters, consisting of every field that does not have the `%nocreate' flag. The parameters appear in the same order as the fields in the node types, from the top-most ancestor down to the node type itself. For example:

expression *plus_create(expression * expr1, expression * expr2);

Enumerated types are converted into a C `typedef' with the same name and values:

%enum JavaType =
{
    JT_BYTE,
    JT_SHORT,
    JT_CHAR,
    JT_INT,
    JT_LONG,
    JT_FLOAT,
    JT_DOUBLE,
    JT_OBJECT_REF
}

becomes:

typedef enum
{
    JT_BYTE,
    JT_SHORT,
    JT_CHAR,
    JT_INT,
    JT_LONG,
    JT_FLOAT,
    JT_DOUBLE,
    JT_OBJECT_REF

} JavaType;

Virtual operations are converted into C macros that invoke the correct vtable entry on a node type:

%operation %virtual void infer_type(expression *e)

becomes:

#define infer_type(this__) \
    ((*(((struct expression_vtable__ *) \
        ((this__)->vtable__))->infer_type_v__)) \
            ((expression *)(this__)))

Calls to `infer_type' can then be made with `infer_type(node)'.

Non-virtual operations are converted into C functions:

%operation void infer_type(expression *e)

becomes:

extern void infer_type(expression *e);

Because virtual and non-virtual operations use a similar call syntax, it is very easy to convert a virtual operation into a non-virtual operation when the output language is C. This isn't possible with the other output languages.

Other house-keeping tasks are performed by the following functions and macros. Some of these must be supplied by the programmer. The `state' parameter is required only if a reentrant compiler is being built.

int yykind(ANY *node): Gets the numeric kind value associated with a particular node. The kind value for node type `NAME' is called `NAME_kind'.
const char *yykindname(ANY *node): Gets the name of the node kind associated with a particular node. This may be helpful for debugging and logging code.
int yyisa(ANY *node, type): Determines if `node' is an instance of the node type `type'.
char *yygetfilename(ANY *node): Gets the filename corresponding to where `node' was created during parsing. This macro is only generated if `%option track_lines' was specified.
long yygetlinenum(ANY *node): Gets the line number corresponding to where `node' was created during parsing. This macro is only generated if `%option track_lines' was specified.
void yysetfilename(ANY *node, char *value): Sets the filename associated with `node' to `value'. The string is not copied, so `value' must persist for the lifetime of the node. This macro will rarely be required, unless a node corresponds to a different line than the current parse line. This macro is only generated if `%option track_lines' was specified.
void yysetlinenum(ANY *node, long value): Sets the line number associated with `node' to `value'. This macro will rarely be required, unless a node corresponds to a different line than the current parse line. This macro is only generated if `%option track_lines' was specified.
char *yycurrfilename([YYNODESTATE *state]): Get the name of the current input file from the parser. The pointer that is returned from this function is stored as-is: the string is not copied. Therefore, the value must persist for at least as long as the node will persist. This function must be supplied by the programmer if `%option track_lines' was specified.
long yycurrlinenum([YYNODESTATE *state]): Get the number of the current input line from the parser. This function must be supplied by the programmer if `%option track_lines' was specified.
void yynodeinit([YYNODESTATE *state]): Initializes the node memory manager. If the system is reentrant, then the node memory manager is `state'. Otherwise a global node memory manager is used.
void *yynodealloc([YYNODESTATE *state,] unsigned int size): Allocates a block of memory of `size' bytes in size from the node memory manager. This function is called automatically from the node-specific `*_create' functions. The programmer will not normally need to call this function. This function will return NULL if the system is out of memory, or if `size' is too large to be allocated within the node memory manager. If the system is out of memory, then `yynodealloc' will call `yynodefailed' prior to returning NULL.
int yynodepush([YYNODESTATE *state]): Pushes the current node memory manager position. The next time yynodepop is called, the node memory manager will reset to the pushed position. This function returns zero if the system is out of memory.
void yynodepop([YYNODESTATE *state]): Pops the current node memory manager position. This function has no effect if yynodepush was not called previously. The yynodepush and yynodepop functions can be used to perform a simple kind of garbage collection on nodes. When the parser enters a scope, it pushes the node memory manager position. After all definitions in the scope have been dealt with, the parser pops the node memory manager to reclaim all of the memory used.
void yynodeclear([YYNODESTATE *state]): Clears the entire node memory manager and returns it to the state it had after calling yynodeinit. This is typically used upon program shutdown to free all remaining node memory.
void yynodefailed([YYNODESTATE *state]): Called when yynodealloc or yynodepush detects that the system is out of memory. This function must be supplied by the programmer. The programmer may choose to exit to program when the system is out of memory; in which case yynodealloc will never return NULL.

C++ Language APIs

In the C++ output language, each node type is converted into a `class' that contains the node's fields, virtual operations, and other house-keeping definitions. The following example demonstrates how treecc node declarations are converted into C++ source code:

%node expression %abstract %typedef =
{
    %nocreate type_code type;
}
%node binary expression %abstract =
{
    expression *expr1;
    expression *expr2;
}
%node plus binary

becomes:

class expression
{
protected:

    int kind__;
    char *filename__;
    long linenum__;

public:

    int getKind() const { return kind__; }
    const char *getFilename() const { return filename__; }
    int getLinenum() const { return linenum__; }
    void setFilename(char *filename) { filename__ = filename; }
    void setLinenum(long linenum) { linenum__ = linenum; }

    void *operator new(size_t);
    void operator delete(void *, size_t);

protected:

    expression();

public:

    type_code type;

    virtual int isA(int kind) const;
    virtual const char *getKindName() const;

protected:

    virtual ~expression();

};

class binary : public expression
{
protected:

    binary(expression * expr1, expression * expr2);

public:

    expression * expr1;
    expression * expr2;

    virtual int isA(int kind) const;
    virtual const char *getKindName() const;

protected:

    virtual ~binary();

};

class plus : public binary
{
public:

    plus(expression * expr1, expression * expr2);

public:

    virtual int isA(int kind) const;
    virtual const char *getKindName() const;

protected:

    virtual ~plus();

};

The following standard methods are available on every node type:

int getKind(): Gets the numeric kind value associated with a particular node. The kind value for node type `NAME' is called `NAME_kind'.
virtual const char *getKindName(): Gets the name of the node kind associated with a particular node. This may be helpful for debugging and logging code.
virtual int isA(int kind): Determines if the node is a member of the node type that corresponds to the numeric kind value `kind'.
const char *getFilename(): Gets the filename corresponding to where the node was created during parsing. This method is only generated if `%option track_lines' was specified.
long getLinenum(): Gets the line number corresponding to where the node was created during parsing. This method is only generated if `%option track_lines' was specified.
void setFilename(char *value): Sets the filename associated with the node to `value'. The string is not copied, so `value' must persist for the lifetime of the node. This method will rarely be required, unless a node corresponds to a different line than the current parse line. This method is only generated if `%option track_lines' was specified.
void setLinenum(long value): Sets the line number associated with the node to `value'. This method will rarely be required, unless a node corresponds to a different line than the current parse line. This method is only generated if `%option track_lines' was specified.

If the generated code is non-reentrant, then the constructor for the class can be used to construct nodes of the specified node type. The constructor parameters are the same as the fields within the node type's definition, except for `%nocreate' fields.

If the generated code is reentrant, then nodes cannot be constructed using the C++ `new' operator. The `*Create' methods on the `YYNODESTATE' factory class must be used instead.

The `YYNODESTATE' class contains a number of house-keeping methods that are used to manage nodes:

static YYNODESTATE *getState(): Gets the global `YYNODESTATE' instance that is being used by non-reentrant code. If an instance has not yet been created, this method will create one. When using non-reentrant code, the programmer will normally subclass `YYNODESTATE', override some of the methods below, and then construct an instance of the subclass. This constructed instance will then be returned by future calls to `getState'.
void *alloc(size_t size): Allocates a block of memory of `size' bytes in size from the node memory manager. This function is called automatically from the node-specific constructors and `*Create' methods. The programmer will not normally need to call this function. This function will return NULL if the system is out of memory, or if `size' is too large to be allocated within the node memory manager. If the system is out of memory, then `alloc' will call `failed' prior to returning NULL.
int push(): Pushes the current node memory manager position. The next time pop is called, the node memory manager will reset to the pushed position. This function returns zero if the system is out of memory.
void pop(): Pops the current node memory manager position. This function has no effect if push was not called previously. The push and pop methods can be used to perform a simple kind of garbage collection on nodes. When the parser enters a scope, it pushes the node memory manager position. After all definitions in the scope have been dealt with, the parser pops the node memory manager to reclaim all of the memory used.
void clear(): Clears the entire node memory manager and returns it to the state it had after construction.
virtual void failed(): Called when alloc or push detects that the system is out of memory. This method is typically overridden by the programmer in subclasses. The programmer may choose to exit to program when the system is out of memory; in which case alloc will never return NULL.
virtual char *currFilename(): Get the name of the current input file from the parser. The pointer that is returned from this function is stored as-is: the string is not copied. Therefore, the value must persist for at least as long as the node will persist. This method is usually overrriden by the programmer in subclasses if `%option track_lines' was specified.
virtual long currLinenum(): Get the number of the current input line from the parser. This method is usually overridden by the programmer in subclasses if `%option track_lines' was specified.

The programmer will typically subclass `YYNODESTATE' to provide additional functionality, and then create an instance of this class to act as the node memory manager and node creation factory.

Java Language APIs

In the Java output language, each node type is converted into a `class' that contains the node's fields, virtual operations, and other house-keeping definitions. The following example demonstrates how treecc node declarations are converted into Java source code:

%node expression %abstract %typedef =
{
    %nocreate type_code type;
}
%node binary expression %abstract =
{
    expression expr1;
    expression expr2;
}
%node plus binary

becomes:

public class expression
{
    protected int kind__;
    protected String filename__;
    protected long linenum__;

    public int getKind() { return kind__; }
    public String getFilename() { return filename__; }
    public long getLinenum() const { return linenum__; }
    public void setFilename(String filename) { filename__ = filename; }
    public void setLinenum(long linenum) { linenum__ = linenum; }

    public static final int KIND = 1;

    public type_code type;

    protected expression()
    {
        this.kind__ = KIND;
        this.filename__ = YYNODESTATE.getState().currFilename();
        this.linenum__ = YYNODESTATE.getState().currLinenum();
    }

    public int isA(int kind)
    {
        if(kind == KIND)
            return 1;
        else
            return 0;
    }

    public String getKindName()
    {
        return "expression";
    }
}

public class binary extends expression
{
    public static final int KIND = 2;

    public expression expr1;
    public expression expr2;

    protected binary(expression expr1, expression expr2)
    {
        super();
        this.kind__ = KIND;
        this.expr1 = expr1;
        this.expr2 = expr2;
    }

    public int isA(int kind)
    {
        if(kind == KIND)
            return 1;
        else
            return super.isA(kind);
    }

    public String getKindName()
    {
        return "binary";
    }
}

public class plus extends binary
{
    public static final int KIND = 3;

    public plus(expression expr1, expression expr2)
    {
        super(expr1, expr2);
        this.kind__ = KIND;
    }

    public int isA(int kind)
    {
        if(kind == KIND)
            return 1;
        else
            return super.isA(kind);
    }

    public String getKindName()
    {
        return "plus";
    }
}

The following standard members are available on every node type:

int KIND: The kind value for the node type corresponding to this class.
int getKind(): Gets the numeric kind value associated with a particular node. The kind value for node type `NAME' is called `NAME.KIND'.
String getKindName(): Gets the name of the node kind associated with a particular node. This may be helpful for debugging and logging code.
int isA(int kind): Determines if the node is a member of the node type that corresponds to the numeric kind value `kind'.
String getFilename(): Gets the filename corresponding to where the node was created during parsing. This method is only generated if `%option track_lines' was specified.
long getLinenum(): Gets the line number corresponding to where the node was created during parsing. This method is only generated if `%option track_lines' was specified.
void setFilename(String value): Sets the filename associated with the node to `value'. This method will rarely be required, unless a node corresponds to a different line than the current parse line. This method is only generated if `%option track_lines' was specified.
void setLinenum(long value): Sets the line number associated with the node to `value'. This method will rarely be required, unless a node corresponds to a different line than the current parse line. This method is only generated if `%option track_lines' was specified.

If the generated code is reentrant, then nodes cannot be constructed using the Java `new' operator. The `*Create' methods on the `YYNODESTATE' factory class must be used instead.

Enumerated types are converted into a Java `class':

%enum JavaType =
{
    JT_BYTE,
    JT_SHORT,
    JT_CHAR,
    JT_INT,
    JT_LONG,
    JT_FLOAT,
    JT_DOUBLE,
    JT_OBJECT_REF
}

becomes:

public class JavaType
{
    public static final int JT_BYTE = 0;
    public static final int JT_SHORT = 1;
    public static final int JT_CHAR = 2;
    public static final int JT_INT = 3;
    public static final int JT_LONG = 4;
    public static final int JT_FLOAT = 5;
    public static final int JT_DOUBLE = 6;
    public static final int JT_OBJECT_REF = 7;
}

References to enumerated types in fields and operation parameters are replaced with the type `int'.

Virtual operations are converted into public methods on the Java node classes.

Non-virtual operations are converted into a static method within a class named for the operation. For example,

%operation void InferType::infer_type(expression e)

becomes:

public class InferType
{
    public static void infer_type(expression e)
    {
        ...
    }
}

If the class name (`InferType' in the above example) is omitted, then the name of the operation is used as both the class name and the the method name.

The `YYNODESTATE' class contains a number of house-keeping methods that are used to manage nodes:

static YYNODESTATE getState(): Gets the global `YYNODESTATE' instance that is being used by non-reentrant code. If an instance has not yet been created, this method will create one. When using non-reentrant code, the programmer will normally subclass `YYNODESTATE', override some of the methods below, and then construct an instance of the subclass. This constructed instance will then be returned by future calls to `getState'. This method will not be present if a reentrant system is being generated.
String currFilename(): Get the name of the current input file from the parser. This method is usually overrriden by the programmer in subclasses if `%option track_lines' was specified.
long currLinenum(): Get the number of the current input line from the parser. This method is usually overridden by the programmer in subclasses if `%option track_lines' was specified.

The programmer will typically subclass `YYNODESTATE' to provide additional functionality, and then create an instance of this class to act as the global state and node creation factory.

C# Language APIs

In the C# output language, each node type is converted into a `class' that contains the node's fields, virtual operations, and other house-keeping definitions. The following example demonstrates how treecc node declarations are converted into C# source code:

%node expression %abstract %typedef =
{
    %nocreate type_code type;
}
%node binary expression %abstract =
{
    expression expr1;
    expression expr2;
}
%node plus binary

becomes:

public class expression
{
    protected int kind__;
    protected String filename__;
    protected long linenum__;

    public int getKind() { return kind__; }
    public String getFilename() { return filename__; }
    public long getLinenum() const { return linenum__; }
    public void setFilename(String filename) { filename__ = filename; }
    public void setLinenum(long linenum) { linenum__ = linenum; }

    public const int KIND = 1;

    public type_code type;

    protected expression()
    {
        this.kind__ = KIND;
        this.filename__ = YYNODESTATE.getState().currFilename();
        this.linenum__ = YYNODESTATE.getState().currLinenum();
    }

    public virtual int isA(int kind)
    {
        if(kind == KIND)
            return 1;
        else
            return 0;
    }

    public virtual String getKindName()
    {
        return "expression";
    }
}

public class binary : expression
{
    public const int KIND = 2;

    public expression expr1;
    public expression expr2;

    protected binary(expression expr1, expression expr2)
        : expression()
    {
        this.kind__ = KIND;
        this.expr1 = expr1;
        this.expr2 = expr2;
    }

    public override int isA(int kind)
    {
        if(kind == KIND)
            return 1;
        else
            return base.isA(kind);
    }

    public override String getKindName()
    {
        return "binary";
    }
}

public class plus : binary
{
    public const int KIND = 5;

    public plus(expression expr1, expression expr2)
        : binary(expr1, expr2)
    {
        this.kind__ = KIND;
    }

    public override int isA(int kind)
    {
        if(kind == KIND)
            return 1;
        else
            return base.isA(kind);
    }

    public override String getKindName()
    {
        return "plus";
    }
}

The following standard members are available on every node type:

const int KIND: The kind value for the node type corresponding to this class.
int getKind(): Gets the numeric kind value associated with a particular node. The kind value for node type `NAME' is called `NAME.KIND'.
virtual String getKindName(): Gets the name of the node kind associated with a particular node. This may be helpful for debugging and logging code.
virtual int isA(int kind): Determines if the node is a member of the node type that corresponds to the numeric kind value `kind'.
String getFilename(): Gets the filename corresponding to where the node was created during parsing. This method is only generated if `%option track_lines' was specified.
long getLinenum(): Gets the line number corresponding to where the node was created during parsing. This method is only generated if `%option track_lines' was specified.
void setFilename(String value): Sets the filename associated with the node to `value'. This method will rarely be required, unless a node corresponds to a different line than the current parse line. This method is only generated if `%option track_lines' was specified.
void setLinenum(long value): Sets the line number associated with the node to `value'. This method will rarely be required, unless a node corresponds to a different line than the current parse line. This method is only generated if `%option track_lines' was specified.

If the generated code is reentrant, then nodes cannot be constructed using the C# `new' operator. The `*Create' methods on the `YYNODESTATE' factory class must be used instead.

Enumerated types are converted into a C# `enum' definition:

%enum JavaType =
{
    JT_BYTE,
    JT_SHORT,
    JT_CHAR,
    JT_INT,
    JT_LONG,
    JT_FLOAT,
    JT_DOUBLE,
    JT_OBJECT_REF
}

becomes:

public enum JavaType
{
    JT_BYTE,
    JT_SHORT,
    JT_CHAR,
    JT_INT,
    JT_LONG,
    JT_FLOAT,
    JT_DOUBLE,
    JT_OBJECT_REF,
}

Virtual operations are converted into public virtual methods on the C# node classes.

Non-virtual operations are converted into a static method within a class named for the operation. For example,

%operation void InferType::infer_type(expression e)

becomes:

public class InferType
{
    public static void infer_type(expression e)
    {
        ...
    }
}

If the class name (`InferType' in the above example) is omitted, then the name of the operation is used as both the class name and the the method name.

The `YYNODESTATE' class contains a number of house-keeping methods that are used to manage nodes:

static YYNODESTATE getState(): Gets the global `YYNODESTATE' instance that is being used by non-reentrant code. If an instance has not yet been created, this method will create one. When using non-reentrant code, the programmer will normally subclass `YYNODESTATE', override some of the methods below, and then construct an instance of the subclass. This constructed instance will then be returned by future calls to `getState'. This method will not be present if a reentrant system is being generated.
virtual String currFilename(): Get the name of the current input file from the parser. This method is usually overrriden by the programmer in subclasses if `%option track_lines' was specified.
virtual long currLinenum(): Get the number of the current input line from the parser. This method is usually overridden by the programmer in subclasses if `%option track_lines' was specified.

The programmer will typically subclass `YYNODESTATE' to provide additional functionality, and then create an instance of this class to act as the global state and node creation factory.

Full expression example code

The full treecc input file for the expression example is as follows:

%enum type_code =
{
    int_type,
    float_type
}

%node expression %abstract %typedef =
{
    %nocreate type_code type = {int_type};
}

%node binary expression %abstract =
{
    expression *expr1;
    expression *expr2;
}

%node unary expression %abstract =
{
    expression *expr;
}

%node intnum expression =
{
    int num;
}

%node floatnum expression =
{
    float num;
}

%node plus binary
%node minus binary
%node multiply binary
%node divide binary
%node power binary
%node negate unary

%operation void infer_type(expression *e)

infer_type(binary)
{
    infer_type(e->expr1);
    infer_type(e->expr2);

    if(e->expr1->type == float_type || e->expr2->type == float_type)
    {
        e->type = float_type;
    }
    else
    {
        e->type = int_type;
    }
}

infer_type(unary)
{
    infer_type(e->expr);
    e->type = e->expr->type;
}

infer_type(intnum)
{
    e->type = int_type;
}

infer_type(floatnum)
{
    e->type = float_type;
}

infer_type(power)
{
    infer_type(e->expr1);
    infer_type(e->expr2);

    if(e->expr2->type != int_type)
    {
        error("second argument to `^' is not an integer");
    }

    e->type = e->expr1->type;
}

The full yacc grammar is as follows:

%union {
    expression *node;
    int         inum;
    float       fnum;
}

%token INT FLOAT

%type <node> expr
%type <inum> INT
%type <fnum> FLOAT

%%

expr: INT               { $$ = intnum_create($1); }
    | FLOAT             { $$ = floatnum_create($1); }
    | '(' expr ')'      { $$ = $2; }
    | expr '+' expr     { $$ = plus_create($1, $3); }
    | expr '-' expr     { $$ = minus_create($1, $3); }
    | expr '*' expr     { $$ = multiply_create($1, $3); }
    | expr '/' expr     { $$ = divide_create($1, $3); }
    | expr '^' expr     { $$ = power_create($1, $3); }
    | '-' expr          { $$ = negate_create($2); }
    ;

EBNF syntax for treecc input files

The EBNF syntax for treecc input files uses the following lexical tokens:

IDENTIFIER ::= <A-Za-z_> { <A-Za-z0-9_> }

STRING ::= '"' <anything that does not include '"'> '"'
         | "'" <anything that does not include "'"> "'"

LITERAL_DEFNS ::= "%{" <anything except "%}"> "%}"

LITERAL_END ::= "%%" <any character sequence until EOF>

LITERAL_CODE ::= '{' <anything with matched '{' and '}'> '}'

In addition, anything that begins with "%" in the following syntax is a lexical keyword.

The EBNF syntax is as follows:

File ::= { Declaration }

Declaration ::= Node
              | Operation
              | OperationCase
              | Option
              | Enum
              | Literal
              | Header
              | Output
              | Common
              | Include

Node ::= %node IDENTIFIER [ IDENTIFIER ] { NodeFlag } [ '=' Fields ]

NodeFlag ::= %abstract | %typedef

Fields ::= '{' { Field } '}'

Field ::= [ %nocreate ] TypeAndName [ '=' LITERAL_CODE ] ';'

TypeAndName ::= Type [ IDENTIFIER ]

Type ::= TypeName
       | Type '*'
       | Type '&'
       | Type '[' ']'

TypeName ::= IDENTIFIER { IDENTIFIER }

Operation ::= %operation { OperFlag } Type
                   [ ClassName ] IDENTIFIER '(' [ Params ] ')'
                   [ '=' LITERAL_CODE ] [ ';' ]

OperFlag ::= %virtual | %inline | %split

ClassName ::= IDENTIFIER "::"

Params ::= Param { ',' Param }

Param ::= TypeAndName | '[' TypeAndName ']'

OperationCase ::= OperationHead { ',' OperationHead } LITERAL_CODE

OperationHead ::= IDENTIFIER '(' [ TypeList ] ')'

TypeList ::= IDENTIFIER { ',' IDENTIFIER }

Option ::= %option IDENTIFIER [ '=' Value ]

Value ::= IDENTIFIER | STRING

Enum ::= %enum IDENTIFIER '=' '{' EnumBody [ ',' ] '}'

EnumBody ::= IDENTIFIER { ',' IDENTIFIER }

Literal ::= { LiteralFlag } (LITERAL_DEFNS | LITERAL_END)

LiteralFlag ::= %both | %decls | %end

Header ::= %header STRING

Output ::= %output STRING

Common ::= %common

Include ::= %include [ %readonly ] STRING