Traditional compiler construction tools such as lex and yacc focus on the lexical analysis and parsing phases of compilation. But they provide very little to support semantic analysis and code generation.
Yacc allows grammar rules to be tagged with semantic actions and values, but it doesn't provide any routines that assist in the process of tree building, semantic analysis, or code generation. Because those processes are language-specific, yacc leaves the details to the programmer.
Support for semantic analysis was also a lot simpler in the languages that were prevalent when lex and yacc were devised. C and Pascal require declare before use, which allows the semantic information about a statement to be determined within the parser at the point of use.(1) If extensive optimization is not required, then code generation can also be performed within the grammar, leading to a simple one-pass compiler structure.
Modern languages allow deferred declaration of methods, fields, and types. For example, Java allows a method to refer to a field that is declared further down the .java source file. A field can be declared with a type whose class definition has not yet been parsed.
Hence, most of the semantic analysis that used to be performed inline within a yacc grammar must now be performed after the entire program has been parsed. Tree building and walking is now more important than it was in older declare before use languages.
Building parse tree data structures and walking them is not terribly difficult, but it is extremely time-consuming and error-prone. A modern programming language may have hundreds of node types, divided into categories for statements, expressions, types, declarations, etc. When a new programming language is being devised, new node types may be added quite frequently. This has ramifications in trying to manage the code's complexity.(2)
For example, consider nodes that correspond to programming language types in a C-like language. There will be node types for integer types, floating-point types, pointers, structures, functions, etc. There will be semantic analysis routines for testing types for equality, comparing types for coercions and casts, evaluating the size of a type for memory layout purposes, determining if the type falls into a general category such as "integer" or "pointer", etc.
Let's say we wanted to add a new "128-bit integer" type to this language. Adding a new node type is fairly straight-forward. But we also need to track down every place in the code where the compiler walks a type or deals with integers and add an appropriate case for the new type. This is very error-prone. Such code is likely to be split over many files, and good coding practices only help to a certain extent.
This problem gets worse when new kinds of expressions and statements are added to the language. The change not only affects semantic analysis, but also optimization and code generation. Some compilers use multiple passes over the tree to perform optimization, with different algorithms used in each pass. Code generation may use a number of different strategies, depending upon how an expression or statement is used. If even one of these places is missed when the new node type is added, then there is the potential for a very nasty bug that may go unnoticed for months or years.
Object-oriented languages such as C++ can help a bit in constructing robust tree structures. The base class can declare abstract methods for any semantic analysis, optimization, or code generation routine that needs to be implemented for all members of the node category. But another code maintainence problem arises. What happens when we want to add a new optimization pass in the future? We must go into hundreds of classes and implement the methods.
To avoid changing hundreds of classes, texts on Design Patterns suggest using a Visitor pattern. Then the new optimization pass can be encapsulated in a visitor. This would work, except for the following drawback of visitor patterns, as described in Gamma, et al:
The Visitor pattern makes it hard to add new subclasses of Element. Each new ConcreteElement gives rise to a new abstract operation on Visitor and a corresponding implementation in every ConcreteVisitor class.
... The Visitor class hierarchy can be difficult to maintain when new ConcreteElement classes are added frequently. In such cases, it's probably easier just to define operations on the classes that make up the structure.
That is, if we add a new node type in the future, we have a large maintainence problem on our hands. The solution is to scatter the implementation through-out every class, which is the situation we were trying to avoid by using the Visitor pattern.
Because compiler construction deals with a large set of rapidly changing node types and operations, neither of the usual approaches work very well.
The ideal programming language for designing compilers needs to have some way to detect when the programmer forgets to implement an operation for a new node type, and to ensure that a new operation covers all existing node types adequately. Existing OO languages do not perform this kind of global error checking. What few checking procedures they have change the maintainence problem into a different problem of similar complexity.
A new field in language design has emerged in recent years called "Aspect-Oriented Programming" (AOP). A good review of the field can be found in the October 2001 issue of the Communications of the ACM, and on the AspectJ Web site, http://www.aspectj.org/.
The following excerpt from the introduction to the AOP section in the CACM issue describes the essential aspects of AOP, and the difference between OOP and AOP:
AOP is based on the idea that computer systems are better programmed by separately specifying the various concerns (properties or areas of interest) of a system and some description of their relationships, and then relying on mechanisms in the underlying AOP environment to weave or compose them together into a coherent program. ... While the tendancy in OOP's is to find commonality among classes and push it up the inheritance tree, AOP attempts to realize scattered concerns as first-class elements, and eject them horizontally from the object structure.
Aspect-orientation gives us some hope of solving our compiler complexity problems. We can view each operation on node types (semantic analysis, optimization, code generation, etc) as an "aspect" of the compiler's construction. The AOP language weaves these aspects with the node types to create the final compiler.
We don't really want to implement a new programming language just for compiler construction. Especially since the new language's implementation would have all of the problems described above and would therefore also be difficult to debug and maintain.
The approach that we take with "treecc" is similar to that used by "yacc". A simple rule-based language is devised that is used to describe the intended behaviour declaratively. Embedded code is used to provide the specific implementation details. A translator then converts the input into source code that can be compiled in the usual fashion.
The translator is responsible for generating the tree building and walking code, and for checking that all relevant operations have been implemented on the node types. Functions are provided that make it easier to build and walk the tree data structures from within a "yacc" grammar and other parts of the compiler.
Consider the following yacc grammar for a simple expression language:
%token INT FLOAT %% expr: INT | FLOAT | '(' expr ')' | expr '+' expr | expr '-' expr | expr '*' expr | expr '/' expr | '-' expr ;
(We will ignore the problems of precedence and associativity and assume that the reader is familiar with how to resolve such issues in yacc grammars).
There are 7 types of nodes for this grammar: `intnum', `floatnum', `plus', `minus', `multiply', `divide', and `negate'. They are defined in treecc as follows:
%node expression %abstract %typedef %node binary expression %abstract = { expression *expr1; expression *expr2; } %node unary expression %abstract = { expression *expr; } %node intnum expression = { int num; } %node floatnum expression = { float num; } %node plus binary %node minus binary %node multiply binary %node divide binary %node negate unary
We have introduced three extra node types that refer to any expression, binary expressions, and unary expressions. These can be seen as superclasses in an OO-style framework. We have declared these node types as `abstract' because the yacc grammar will not be permitted to create instances of these classes directly.
The `binary', `unary', `intnum', and `floatnum'
node types have field definitions associated with them. These have
a similar syntax to C struct
declarations.
The yacc grammar is augmented as follows to build the parse tree:
%union { expression *node; int inum; float fnum; } %token INT FLOAT %type <node> expr %type <inum> INT %type <fnum> FLOAT %% expr: INT { $$ = intnum_create($1); } | FLOAT { $$ = floatnum_create($1); } | '(' expr ')' { $$ = $2; } | expr '+' expr { $$ = plus_create($1, $3); } | expr '-' expr { $$ = minus_create($1, $3); } | expr '*' expr { $$ = multiply_create($1, $3); } | expr '/' expr { $$ = divide_create($1, $3); } | '-' expr { $$ = negate_create($2); } ;
The treecc translator generates the `*_create' functions so that the rest of the compiler can build the necessary data structures on demand. The parameters to the `*_create' functions are identical in type and order to the members of the structure for that node type.
Because `expression', `binary', and `unary' are abstract, there will be no `*_create' functions associated with them. This will help the programmer catch certain kinds of errors.
The type that is returned from a `*_create' function is the first superclass of the node that has a `%typedef' keyword associated with it; `expression *' in this case.
Normally we will want to store extra information with a node beyond that which is extracted by the yacc grammar. In our expression example, we probably want to store type information in the nodes so that we can determine if the whole expression is integer or floating point during semantic analysis. We can add type information to the `expression' node type as follows:
%node expression %abstract %typedef = { %nocreate type_code type; }
The `%nocreate' flag indicates that the field should not be passed to the `*_create' functions as a parameter. i.e. it provides semantic information that isn't present in the grammar. When nodes are created, any fields that are declared as `%nocreate' will be undefined in value. A default value can be specified as follows:
%node expression %abstract %typedef = { %nocreate type_code type = {int_type}; }
Default values must be enclosed in `{' and `}' because they are pieces of code in the underlying source language (C, C++, etc), instead of tokens in the treecc syntax. Any legitimate expression in the underlying source language may be used.
We also need to arrange for `type_code' to be declared. One way to do this is by adding a `%decls' section to the front of the treecc input file:
%decls %{ typedef enum { int_type, float_type } type_code; %}
We could have introduced the definition by placing a `#include' directive into the `%decls' section instead, or by defining a treecc enumerated type:
%enum type_code = { int_type, float_type }
Now that we have these definitions, type-inferencing can be implemented as follows:
%operation void infer_type(expression *e) infer_type(binary) { infer_type(e->expr1); infer_type(e->expr2); if(e->expr1->type == float_type || e->expr2->type == float_type) { e->type = float_type; } else { e->type = int_type; } } infer_type(unary) { infer_type(e->expr); e->type = e->expr->type; } infer_type(intnum) { e->type = int_type; }
This example demonstrates using the abstract node types `binary' and `unary' to define operations on all subclasses. The treecc translator will generate code for a full C function called `infer_type' that incorporates all of the cases.
But hang on a second! What happened to `floatnum'? Where did it go? It turns out that treecc will catch this. It will report an error to the effect that `node type `floatnum' is not handled in operation `infer_type''. Here is its definition:
infer_type(floatnum) { e->type = float_type; }
As we can see, treecc has just caught a bug in the language implementation and reported it to us as soon as we introduced it.
Let's now extend the language with a `power' operator:
yacc: expr: expr '^' expr { $$ = create_power($1, $3); } ; treecc: %node power binary
That's all there is to it! When treecc re-translates the input file, it will modify the definition of `infer_type' to include the extra case for `power' nodes. Because `power' is a subclass of `binary', treecc already knows how to perform type inferencing for the new node and it doesn't warn us about a missing declaration.
What if we wanted to restrict the second argument of `power' to be an integer value? We can add the following case to `infer_type':
infer_type(power) { infer_type(e->expr1); infer_type(e->expr2); if(e->expr2->type != int_type) { error("second argument to `^' is not an integer"); } e->type = e->expr1->type; }
The translator now notices that there is a more specific implementation of `infer_type' for `power', and won't use the `binary' case for it.
The most important thing to realise here is that the translator always checks that there are sufficient declarations for `infer_type' to cover all relevant node types. If it detects a lack, it will immediately raise an error to the user. This allows tree coverage problems to be found a lot sooner than with the traditional approach.
See section Full expression example code, for a complete listing of the above example files.
The general form of treecc's command-line syntax is as follows:
treecc [OPTIONS] INPUT ...
Treecc accepts the following command-line options:
-o FILE
--output FILE
-h FILE
--header FILE
-d DIR
--output-dir DIR
-s DIR
--skeleton-dir DIR
-e EXT
--extension EXT
-f
--force-create
-n
--no-output
--help
-v
--version
--
Treecc input files consist of zero or more declarations that define nodes, operations, options, etc. The following sections describe each of these elements.
Node types are defined using the `node' keyword in input files. The general form of the declaration is:
%node NAME [ PNAME ] [ FLAGS ] [ = FIELDS ]
The `FIELDS' part of a node declaration defines the fields that make up the node type. Each field has the following general form:
[ %nocreate ] TYPE FNAME [ = VALUE ] ';'
When the output language is C, treecc creates a struct-based type called `NAME' that contains the fields for `NAME' and all of its ancestor classes. The type also contains some house-keeping fields that are used internally by the generated code. The following is an example:
typedef struct binary__ binary; struct binary__ { const struct binary_vtable__ *vtable__; int kind__; char *filename__; long linenum__; type_code type; expression * expr1; expression * expr2; };
The programmer should avoid using any identifier that ends with `__', because it may clash with house-keeping identifiers that are generated by treecc.
When the output language is C++, Java, or C#, treecc creates a class called `NAME', that inherits from the class `PNAME'. The field definitions for `NAME' are converted into public members in the output.
Types that are used in field and parameter declarations have a syntax which is subset of features found in C, C++, and Java:
TypeAndName ::= Type [ IDENTIFIER ] Type ::= TypeName | Type '*' | Type '&' | Type '[' ']' TypeName ::= IDENTIFIER { IDENTIFIER }
Types are usually followed by an identifier that names the field or parameter. The name is required for fields and is optional for parameters. For example `int' is usually equivalent to `int x' in parameter declarations.
The following are some examples of using types:
int int x const char *str expression *expr Element[][] array Item& unsigned int y const Element
The grammar used by treecc is slightly ambiguous. The last example above declares a parameter called `Element', that has type `const'. The programmer probably intended to declare an anonymous parameter with type `const Element' instead.
This ambiguity is unavoidable given that treecc is not fully aware of the underlying language's type system. When treecc sees a type that ends in a sequence of identifiers, it will always interpret the last identifier as the field or parameter name. Thus, the programmer must write the following instead:
const Element e
Treecc cannot declare types using the full power of C's type system. The most common forms of declarations are supported, and the rest can usually be obtained by defining a `typedef' within a literal code block. See section Literal code declarations, for more information on literal code blocks.
It is the responsibility of the programmer to use type constructs that are supported by the underlying programming language. Types such as `const char *' will give an error when the output is compiled with a Java compiler, for example.
Enumerated types are a special kind of node type that can be used by the programmer for simple values that don't require a full abstract syntax tree node. The following is an example of defining a list of the primitive machine types used in a Java virtual machine:
%enum JavaType = { JT_BYTE, JT_SHORT, JT_CHAR, JT_INT, JT_LONG, JT_FLOAT, JT_DOUBLE, JT_OBJECT_REF }
Enumerations are useful when writing code generators and type inferencing routines. The general form is:
%enum NAME = { VALUES }
Logically, each enumerated value is a special node type that inherits from a parent node type corresponding to the enumerated type `NAME'.
When the output language is C or C++, treecc generates an enumerated typedef for `NAME' that contains the enumerated values in the same order as was used in the input file. The typedef name can be used elsewhere in the code as the type of the enumeration.
When the output language is Java, treecc generates a class called `NAME' that contains the enumerated values as integer constants. Elsewhere in the code, the type `int' must be used to declare variables of the enumerated type. Enumerated values are referred to as `NAME.VALUE'. If the enumerated type is used as a trigger parameter, then `NAME' must be used instead of `int': treecc will convert the type when the Java code is output.
When the output language is C#, treecc generates an enumerated value type called `NAME' that contains the enumerated values as members. The C# type `NAME' can be used elsewhere in the code as the type of the enumeration. Enumerated values are referred to as `NAME.VALUE'.
Operations are declared in two parts: the declaration, and the cases. The declaration part defines the prototype for the operation and the cases define how to handle specific kinds of nodes for the operation.
Operations are defined over one or more trigger parameters. Each trigger parameter specifies a node type or an enumerated type that is selected upon to determine what course of action to take. The following are some examples of operation declarations:
%operation void infer_type(expression *e) %operation type_code common_type([type_code t1], [type_code t2])
Trigger parameters are specified by enclosing them in square brackets. If none of the parameters are enclosed in square brackets, then treecc assumes that the first parameter is the trigger.
The general form of an operation declaration is as follows:
%operation { %virtual | %inline | %split } RTYPE [CLASS::]NAME(PARAMS)
%inline
operations
to avoid compiler limits on function size.
Once an operation has been declared, the programmer can specify its cases anywhere in the input files. It is not necessary that the cases appear after the operation, or that they be contiguous within the input files. This permits the programmer to place operation cases where they are logically required for maintainence reasons.
There must be sufficient operation cases defined to cover every possible combination of node types and enumerated values that inherit from the specified trigger types. An operation case has the following general form:
NAME(TRIGGERS) [, NAME(TRIGGERS2) ...] { CODE }
Multiple trigger combinations can be associated with a single block of code, by listing them all, separated by commas. For example:
common_type(int_type, int_type) { return int_type; } common_type(int_type, float_type), common_type(float_type, int_type), common_type(float_type, float_type) { return float_type; }
"(*)" is used below to indicate an option that is enabled by default.
"C"
, "C++"
,
"Java"
, or "C#"
. The default is "C"
.
#line
directives.
i.e. strip off the directory component. This can be helpful in
combination with the %include %readonly
command when
treecc input files may processed from different directories,
causing common output files to change unexpectedly.
#line
directives. (*)
Sometimes it is necessary to embed literal code within output `.h' and source files. Usually this is to `#include' definitions from other files, or to define functions that cannot be easily expressed as operations.
A literal code block is specified by enclosing it in `%{' and `%}'. The block can also be prefixed with the following flags:
Another form of literal code block is one which begins with `%%' and extends to the end of the current input file. This form implicitly has the `%end' flag.
Most treecc compiler definitions will be too large to be manageable in a single input file. They also will be too large to write to a single output file, because that may overload the source language compiler.
Multiple input files can be specified on the command-line, or they can be explicitly included by other input files with the following declarations:
Output files can be changed using the follow declarations:
When treecc generates the output source code, it must insert several common house-keeping functions and classes into the code. By default, these are written to the first header and source files. This can be changed with the `%common' declaration:
%header "common.h" %output "common.c" %common
When compilers emit error messages to the programmer, it is generally a good idea to indicate which file and which line gave rise to the error. Syntax errors can be emitted fairly easily because the parser usually has access to the current line number. However, semantic errors are harder to report because the parser may no longer be active when the error is detected.
Treecc can generate code that automatically keeps track of what line in the source file was active when a node is created. Every node has two extra private fields that specify the name of the file and the line number. Semantic analysis routines can query this information when reporting errors.
Because treecc is not aware of how to obtain this information, the programmer must supply some additional functions. See section API's available in the generated output, for more information.
See section API's available in the generated output, for more information.
The source code that is generated by treecc exports a number of application programmer interfaces (API's) to the programmer. These can be used elsewhere in the compiler implementation to manipulate abstract syntax trees. The following sections describe the API's for each of the output languages.
In the C output language, each node type is converted into a `typedef' that contains the node's fields, and the fields of its ancestor node types. The following example demonstrates how treecc node declarations are converted into C source code:
%node expression %abstract %typedef = { %nocreate type_code type; } %node binary expression %abstract = { expression *expr1; expression *expr2; } %node plus binary
becomes:
typedef struct expression__ expression; typedef struct binary__ binary; typedef struct plus__ plus; struct expression__ { const struct expression_vtable__ *vtable__; int kind__; char *filename__; long linenum__; type_code type; }; struct binary__ { const struct binary_vtable__ *vtable__; int kind__; char *filename__; long linenum__; type_code type; expression * expr1; expression * expr2; }; struct plus__ { const struct plus_vtable__ *vtable__; int kind__; char *filename__; long linenum__; type_code type; expression * expr1; expression * expr2; };
Programmers should avoid using any identifiers that end in `__'. Such identifiers are reserved for internal use by treecc and its support routines.
For each non-abstract node type called `NAME', treecc generates a function called `NAME_create' that creates nodes of that type. The general form of the function's prototype is as follows:
TYPE *NAME_create([YYNODESTATE *state,] PARAMS)
expression *plus_create(expression * expr1, expression * expr2);
Enumerated types are converted into a C `typedef' with the same name and values:
%enum JavaType = { JT_BYTE, JT_SHORT, JT_CHAR, JT_INT, JT_LONG, JT_FLOAT, JT_DOUBLE, JT_OBJECT_REF }
becomes:
typedef enum { JT_BYTE, JT_SHORT, JT_CHAR, JT_INT, JT_LONG, JT_FLOAT, JT_DOUBLE, JT_OBJECT_REF } JavaType;
Virtual operations are converted into C macros that invoke the correct vtable entry on a node type:
%operation %virtual void infer_type(expression *e)
becomes:
#define infer_type(this__) \ ((*(((struct expression_vtable__ *) \ ((this__)->vtable__))->infer_type_v__)) \ ((expression *)(this__)))
Calls to `infer_type' can then be made with `infer_type(node)'.
Non-virtual operations are converted into C functions:
%operation void infer_type(expression *e)
becomes:
extern void infer_type(expression *e);
Because virtual and non-virtual operations use a similar call syntax, it is very easy to convert a virtual operation into a non-virtual operation when the output language is C. This isn't possible with the other output languages.
Other house-keeping tasks are performed by the following functions and macros. Some of these must be supplied by the programmer. The `state' parameter is required only if a reentrant compiler is being built.
int yykind(ANY *node)
const char *yykindname(ANY *node)
int yyisa(ANY *node, type)
char *yygetfilename(ANY *node)
long yygetlinenum(ANY *node)
void yysetfilename(ANY *node, char *value)
void yysetlinenum(ANY *node, long value)
char *yycurrfilename([YYNODESTATE *state])
long yycurrlinenum([YYNODESTATE *state])
void yynodeinit([YYNODESTATE *state])
void *yynodealloc([YYNODESTATE *state,] unsigned int size)
NULL
if the system is out of
memory, or if `size' is too large to be allocated within
the node memory manager. If the system is out of memory, then
`yynodealloc' will call `yynodefailed' prior to
returning NULL
.
int yynodepush([YYNODESTATE *state])
yynodepop
is called, the node memory manager will reset to
the pushed position. This function returns zero if the system
is out of memory.
void yynodepop([YYNODESTATE *state])
yynodepush
was not called previously.
The yynodepush
and yynodepop
functions can be used
to perform a simple kind of garbage collection on nodes. When
the parser enters a scope, it pushes the node memory manager
position. After all definitions in the scope have been dealt
with, the parser pops the node memory manager to reclaim all
of the memory used.
void yynodeclear([YYNODESTATE *state])
yynodeinit
. This is typically
used upon program shutdown to free all remaining node memory.
void yynodefailed([YYNODESTATE *state])
yynodealloc
or yynodepush
detects that
the system is out of memory. This function must be supplied by
the programmer. The programmer may choose to exit to program
when the system is out of memory; in which case yynodealloc
will never return NULL
.
In the C++ output language, each node type is converted into a `class' that contains the node's fields, virtual operations, and other house-keeping definitions. The following example demonstrates how treecc node declarations are converted into C++ source code:
%node expression %abstract %typedef = { %nocreate type_code type; } %node binary expression %abstract = { expression *expr1; expression *expr2; } %node plus binary
becomes:
class expression { protected: int kind__; char *filename__; long linenum__; public: int getKind() const { return kind__; } const char *getFilename() const { return filename__; } int getLinenum() const { return linenum__; } void setFilename(char *filename) { filename__ = filename; } void setLinenum(long linenum) { linenum__ = linenum; } void *operator new(size_t); void operator delete(void *, size_t); protected: expression(); public: type_code type; virtual int isA(int kind) const; virtual const char *getKindName() const; protected: virtual ~expression(); }; class binary : public expression { protected: binary(expression * expr1, expression * expr2); public: expression * expr1; expression * expr2; virtual int isA(int kind) const; virtual const char *getKindName() const; protected: virtual ~binary(); }; class plus : public binary { public: plus(expression * expr1, expression * expr2); public: virtual int isA(int kind) const; virtual const char *getKindName() const; protected: virtual ~plus(); };
The following standard methods are available on every node type:
int getKind()
virtual const char *getKindName()
virtual int isA(int kind)
const char *getFilename()
long getLinenum()
void setFilename(char *value)
void setLinenum(long value)
If the generated code is non-reentrant, then the constructor for the class can be used to construct nodes of the specified node type. The constructor parameters are the same as the fields within the node type's definition, except for `%nocreate' fields.
If the generated code is reentrant, then nodes cannot be constructed using the C++ `new' operator. The `*Create' methods on the `YYNODESTATE' factory class must be used instead.
The `YYNODESTATE' class contains a number of house-keeping methods that are used to manage nodes:
static YYNODESTATE *getState()
void *alloc(size_t size)
NULL
if the system is out of
memory, or if `size' is too large to be allocated within
the node memory manager. If the system is out of memory, then
`alloc' will call `failed' prior to returning NULL
.
int push()
pop
is called, the node memory manager will reset to
the pushed position. This function returns zero if the system
is out of memory.
void pop()
push
was not called previously.
The push
and pop
methods can be used
to perform a simple kind of garbage collection on nodes. When
the parser enters a scope, it pushes the node memory manager
position. After all definitions in the scope have been dealt
with, the parser pops the node memory manager to reclaim all
of the memory used.
void clear()
virtual void failed()
alloc
or push
detects that
the system is out of memory. This method is typically
overridden by the programmer in subclasses. The programmer may
choose to exit to program when the system is out of memory; in
which case alloc
will never return NULL
.
virtual char *currFilename()
virtual long currLinenum()
The programmer will typically subclass `YYNODESTATE' to provide additional functionality, and then create an instance of this class to act as the node memory manager and node creation factory.
In the Java output language, each node type is converted into a `class' that contains the node's fields, virtual operations, and other house-keeping definitions. The following example demonstrates how treecc node declarations are converted into Java source code:
%node expression %abstract %typedef = { %nocreate type_code type; } %node binary expression %abstract = { expression expr1; expression expr2; } %node plus binary
becomes:
public class expression { protected int kind__; protected String filename__; protected long linenum__; public int getKind() { return kind__; } public String getFilename() { return filename__; } public long getLinenum() const { return linenum__; } public void setFilename(String filename) { filename__ = filename; } public void setLinenum(long linenum) { linenum__ = linenum; } public static final int KIND = 1; public type_code type; protected expression() { this.kind__ = KIND; this.filename__ = YYNODESTATE.getState().currFilename(); this.linenum__ = YYNODESTATE.getState().currLinenum(); } public int isA(int kind) { if(kind == KIND) return 1; else return 0; } public String getKindName() { return "expression"; } } public class binary extends expression { public static final int KIND = 2; public expression expr1; public expression expr2; protected binary(expression expr1, expression expr2) { super(); this.kind__ = KIND; this.expr1 = expr1; this.expr2 = expr2; } public int isA(int kind) { if(kind == KIND) return 1; else return super.isA(kind); } public String getKindName() { return "binary"; } } public class plus extends binary { public static final int KIND = 3; public plus(expression expr1, expression expr2) { super(expr1, expr2); this.kind__ = KIND; } public int isA(int kind) { if(kind == KIND) return 1; else return super.isA(kind); } public String getKindName() { return "plus"; } }
The following standard members are available on every node type:
int KIND
int getKind()
String getKindName()
int isA(int kind)
String getFilename()
long getLinenum()
void setFilename(String value)
void setLinenum(long value)
If the generated code is non-reentrant, then the constructor for the class can be used to construct nodes of the specified node type. The constructor parameters are the same as the fields within the node type's definition, except for `%nocreate' fields.
If the generated code is reentrant, then nodes cannot be constructed using the Java `new' operator. The `*Create' methods on the `YYNODESTATE' factory class must be used instead.
Enumerated types are converted into a Java `class':
%enum JavaType = { JT_BYTE, JT_SHORT, JT_CHAR, JT_INT, JT_LONG, JT_FLOAT, JT_DOUBLE, JT_OBJECT_REF }
becomes:
public class JavaType { public static final int JT_BYTE = 0; public static final int JT_SHORT = 1; public static final int JT_CHAR = 2; public static final int JT_INT = 3; public static final int JT_LONG = 4; public static final int JT_FLOAT = 5; public static final int JT_DOUBLE = 6; public static final int JT_OBJECT_REF = 7; }
References to enumerated types in fields and operation parameters are replaced with the type `int'.
Virtual operations are converted into public methods on the Java node classes.
Non-virtual operations are converted into a static method within a class named for the operation. For example,
%operation void InferType::infer_type(expression e)
becomes:
public class InferType { public static void infer_type(expression e) { ... } }
If the class name (`InferType' in the above example) is omitted, then the name of the operation is used as both the class name and the the method name.
The `YYNODESTATE' class contains a number of house-keeping methods that are used to manage nodes:
static YYNODESTATE getState()
String currFilename()
long currLinenum()
The programmer will typically subclass `YYNODESTATE' to provide additional functionality, and then create an instance of this class to act as the global state and node creation factory.
In the C# output language, each node type is converted into a `class' that contains the node's fields, virtual operations, and other house-keeping definitions. The following example demonstrates how treecc node declarations are converted into C# source code:
%node expression %abstract %typedef = { %nocreate type_code type; } %node binary expression %abstract = { expression expr1; expression expr2; } %node plus binary
becomes:
public class expression { protected int kind__; protected String filename__; protected long linenum__; public int getKind() { return kind__; } public String getFilename() { return filename__; } public long getLinenum() const { return linenum__; } public void setFilename(String filename) { filename__ = filename; } public void setLinenum(long linenum) { linenum__ = linenum; } public const int KIND = 1; public type_code type; protected expression() { this.kind__ = KIND; this.filename__ = YYNODESTATE.getState().currFilename(); this.linenum__ = YYNODESTATE.getState().currLinenum(); } public virtual int isA(int kind) { if(kind == KIND) return 1; else return 0; } public virtual String getKindName() { return "expression"; } } public class binary : expression { public const int KIND = 2; public expression expr1; public expression expr2; protected binary(expression expr1, expression expr2) : expression() { this.kind__ = KIND; this.expr1 = expr1; this.expr2 = expr2; } public override int isA(int kind) { if(kind == KIND) return 1; else return base.isA(kind); } public override String getKindName() { return "binary"; } } public class plus : binary { public const int KIND = 5; public plus(expression expr1, expression expr2) : binary(expr1, expr2) { this.kind__ = KIND; } public override int isA(int kind) { if(kind == KIND) return 1; else return base.isA(kind); } public override String getKindName() { return "plus"; } }
The following standard members are available on every node type:
const int KIND
int getKind()
virtual String getKindName()
virtual int isA(int kind)
String getFilename()
long getLinenum()
void setFilename(String value)
void setLinenum(long value)
If the generated code is non-reentrant, then the constructor for the class can be used to construct nodes of the specified node type. The constructor parameters are the same as the fields within the node type's definition, except for `%nocreate' fields.
If the generated code is reentrant, then nodes cannot be constructed using the C# `new' operator. The `*Create' methods on the `YYNODESTATE' factory class must be used instead.
Enumerated types are converted into a C# `enum' definition:
%enum JavaType = { JT_BYTE, JT_SHORT, JT_CHAR, JT_INT, JT_LONG, JT_FLOAT, JT_DOUBLE, JT_OBJECT_REF }
becomes:
public enum JavaType { JT_BYTE, JT_SHORT, JT_CHAR, JT_INT, JT_LONG, JT_FLOAT, JT_DOUBLE, JT_OBJECT_REF, }
Virtual operations are converted into public virtual methods on the C# node classes.
Non-virtual operations are converted into a static method within a class named for the operation. For example,
%operation void InferType::infer_type(expression e)
becomes:
public class InferType { public static void infer_type(expression e) { ... } }
If the class name (`InferType' in the above example) is omitted, then the name of the operation is used as both the class name and the the method name.
The `YYNODESTATE' class contains a number of house-keeping methods that are used to manage nodes:
static YYNODESTATE getState()
virtual String currFilename()
virtual long currLinenum()
The programmer will typically subclass `YYNODESTATE' to provide additional functionality, and then create an instance of this class to act as the global state and node creation factory.
The full treecc input file for the expression example is as follows:
%enum type_code = { int_type, float_type } %node expression %abstract %typedef = { %nocreate type_code type = {int_type}; } %node binary expression %abstract = { expression *expr1; expression *expr2; } %node unary expression %abstract = { expression *expr; } %node intnum expression = { int num; } %node floatnum expression = { float num; } %node plus binary %node minus binary %node multiply binary %node divide binary %node power binary %node negate unary %operation void infer_type(expression *e) infer_type(binary) { infer_type(e->expr1); infer_type(e->expr2); if(e->expr1->type == float_type || e->expr2->type == float_type) { e->type = float_type; } else { e->type = int_type; } } infer_type(unary) { infer_type(e->expr); e->type = e->expr->type; } infer_type(intnum) { e->type = int_type; } infer_type(floatnum) { e->type = float_type; } infer_type(power) { infer_type(e->expr1); infer_type(e->expr2); if(e->expr2->type != int_type) { error("second argument to `^' is not an integer"); } e->type = e->expr1->type; }
The full yacc grammar is as follows:
%union { expression *node; int inum; float fnum; } %token INT FLOAT %type <node> expr %type <inum> INT %type <fnum> FLOAT %% expr: INT { $$ = intnum_create($1); } | FLOAT { $$ = floatnum_create($1); } | '(' expr ')' { $$ = $2; } | expr '+' expr { $$ = plus_create($1, $3); } | expr '-' expr { $$ = minus_create($1, $3); } | expr '*' expr { $$ = multiply_create($1, $3); } | expr '/' expr { $$ = divide_create($1, $3); } | expr '^' expr { $$ = power_create($1, $3); } | '-' expr { $$ = negate_create($2); } ;
The EBNF syntax for treecc input files uses the following lexical tokens:
IDENTIFIER ::= <A-Za-z_> { <A-Za-z0-9_> } STRING ::= '"' <anything that does not include '"'> '"' | "'" <anything that does not include "'"> "'" LITERAL_DEFNS ::= "%{" <anything except "%}"> "%}" LITERAL_END ::= "%%" <any character sequence until EOF> LITERAL_CODE ::= '{' <anything with matched '{' and '}'> '}'
In addition, anything that begins with "%" in the following syntax is a lexical keyword.
The EBNF syntax is as follows:
File ::= { Declaration } Declaration ::= Node | Operation | OperationCase | Option | Enum | Literal | Header | Output | Common | Include Node ::= %node IDENTIFIER [ IDENTIFIER ] { NodeFlag } [ '=' Fields ] NodeFlag ::= %abstract | %typedef Fields ::= '{' { Field } '}' Field ::= [ %nocreate ] TypeAndName [ '=' LITERAL_CODE ] ';' TypeAndName ::= Type [ IDENTIFIER ] Type ::= TypeName | Type '*' | Type '&' | Type '[' ']' TypeName ::= IDENTIFIER { IDENTIFIER } Operation ::= %operation { OperFlag } Type [ ClassName ] IDENTIFIER '(' [ Params ] ')' [ '=' LITERAL_CODE ] [ ';' ] OperFlag ::= %virtual | %inline | %split ClassName ::= IDENTIFIER "::" Params ::= Param { ',' Param } Param ::= TypeAndName | '[' TypeAndName ']' OperationCase ::= OperationHead { ',' OperationHead } LITERAL_CODE OperationHead ::= IDENTIFIER '(' [ TypeList ] ')' TypeList ::= IDENTIFIER { ',' IDENTIFIER } Option ::= %option IDENTIFIER [ '=' Value ] Value ::= IDENTIFIER | STRING Enum ::= %enum IDENTIFIER '=' '{' EnumBody [ ',' ] '}' EnumBody ::= IDENTIFIER { ',' IDENTIFIER } Literal ::= { LiteralFlag } (LITERAL_DEFNS | LITERAL_END) LiteralFlag ::= %both | %decls | %end Header ::= %header STRING Output ::= %output STRING Common ::= %common Include ::= %include [ %readonly ] STRING
Jump to: % - a - b - c - e - f - g - h - i - j - k - l - n - o - p - r - s - t - v - y
This document was generated on 11 June 2002 using texi2html 1.56k.