Overview of the Compiler¶
Introduction¶
The compiler takes a DIESEL program as input and produces an Intel assembler file. This is done in six phases:
Scanner – manages lexical analysis.
Parser – manages syntactic analysis, builds internal form.
Semantics – checks static semantics.
Optimization – optimizes the internal form.
Quads – generates quadruples from the internal form
Codegen – expands quadruples to assembler.
All these phases use a symbol manager, the symbol table, where information is gathered on variable names and memory requirements, function parameters, etc.
While reading the description of the phases below, it is worthwhile to
study the appropriate source code file. Each step above, plus the symbol
table, is made up of two files - one header file (with a .hh
suffix)
containing class declarations, and a code file (with a .cc
suffix)
containing the method bodies. The exceptions are the scanner, which is
contained in a file named scanner.l
, and the parser, which is
contained in a file named parser.y
. These files are input files to
the compiler-generating tools flex
and bison
, respectively. In
some cases the implementation for a header file is spread out over
several .cc
files, most notably in the case of abstract syntax
trees. This will be pointed out in the text where needed.
The scanner¶
While it is an interesting exercise to hand-write a scanner, it is not
very complicated, and rather tedious. There exist several useful tools
to help automate this process. In this lab, you will be using flex
,
which takes as input a file (scanner.l
) containing regular
expressions, and which as output generates a C file which implements a
scanner recognizing strings described by the regular expressions you
declared.
To be able to complete Lab 1: Lexical Analysis, you will have to learn how to use the
flex
tool. A complete manual is included
in the instructions; here we will only give a very brief overview of the tool does.
Every time it is called, the scanner reads characters from the terminal
until it can return a token to the parser. A token is simply an integer
representing a certain terminal symbol. The next token 1 is read by
calling the scanner’s yylex()
procedure. Some tokens require
additional attributes to be stored (for later use by the parser), such
as the value of an integer or real number, or the string table index of
an identifier. The variable yylval
is used for this. To be able to
generate good error messages, the line and column number for each token
read is stored in the yylloc
variable. The yylval
variable is of
type YYSTYPE
, a union of all allowed token attributes, while
yylloc
is of type YYLTYPE
, a struct of position data. A list of
all tokens in DIESEL, as well as definitions of the above-mentioned
types, are contained in the file scanner.hh
which you will use in
lab 1 and 2.
Example: Assume we have the following DIESEL fragment:
program main;
const version = 'version 1.1';
var x : integer;
begin
x := 0;
shows the token values returned by yylex()
as this code fragment is
fed to the scanner, and also what the yylval
variable is set to for
each call to yylex()
.
MAIN |
|||
VERSION |
|||
version 1.1 |
|||
X |
|||
INTEGER |
|||
X |
|||
0 |
|||
The parser¶
The tokens that the scanner
produces form the input to the
parser
.
The parser
, which is of the LR type, checks whether the
program is syntactically correct. The syntax is described using a
context-free grammar which for DIESEL consists of a number of
productions of which the four first are:
<program> ::= <prog_decl> <subprog_part> <comp_stmt> ’.’ ;
<prog_decl> ::= <proghead> ’;’ <const_part> <variable_part> ;
<proghead> ::= ’PROGRAM’ ’IDENT’ ;
<const_part> ::= ’CONST’ <const_decls>
| ;
To complete this lab, you will have to learn how to use the parser
generator tool bison
. A complete manual can be found on the course
home page. We will only give a brief overview of how it works here.
In this lab, the grammar is fed into bison
, a so-called parser
generator which generates C code from the grammar input file,
parser.y
. The parser is, essentially, a function which loops through
the stream of tokens it gets as input, from the scanner, until it
encounters fatal errors or an end-of-file
token is read. The aim is
to reduce the flow of tokens to the start symbol <program>
. If this
is achieved, the program is syntactically correct. In this lab, you will
be given the complete grammar. However, you will have to add semantic
actions to the input file (more about this in the semantic lab). It is
also your job to try to improve the parser’s error handling, which is
rather non-existing in the bare skeleton grammar file.
Reducing tokens to the start symbol is, of course, not accomplished in one step. Example: for row
x := 0
and the following reductions:
1. <stmt_list> ::= <stmt>
2. <stmt> ::= <variable> ’:=’ <expr>
3. <variable> ::= <var_id>
4. <var_id> ::= <id>
5. <id> ::= T_IDENT
6. <expr> ::= <simple_expr>
7. <simple_expr> ::= <term>
8. <term> ::= <factor>
9. <factor> ::= T_INTNUM
we get the following sequence of reductions:
x := 0 <= (production 5)
rm
<id> := 0 <= (production 4)
rm
<var_id> := 0 <= (production 3)
rm
<variable> := 0 <= (production 9)
rm
<variable> := <factor> <= (production 8)
rm
<variable> := <term> <= (production 7)
rm
<variable> := <simple_expr> <= (production 6)
rm
<variable> := <expr> <= (production 2)
rm
<stmt> <= (production 1)
rm
<stmt_list>
Before a reduction can take place, the whole of the right-hand side must be completed, e.g. for:
2. <stmt> ::= <variable> ’:=’ <expr>
<variable>
must first have been reduced, :=
scanned and
<expr>
reduced before you can reduce to <stmt>
. The right-hand
side is stored in a parse stack, which bison will keep track of for you,
until a reduction can be performed.
Each time a production is reduced, the right-hand side is replaced on
the parse stack by the left-hand side and a semantic action is
performed. See the file parser.y
for how this is done.
Semantics¶
The majority of productions have a semantic action belonging to them. A semantic action is nothing more than a few lines of code that are executed when the parser reduces with the associated production. Note that a compiler can work in one or several phases. The one we are writing here will only have one phase, which means that we will not have to write any intermediary forms to file. However, certain things are not done every time a production is reduced, but rather every time a block (e.g., a function, procedure or program) has been reduced successfully. These things include type checking, optimization, parameter checking, quad generation and code generation.
The semantic actions¶
update the symbol table. This occurs, of course, only when we parse declarations and in the lab will be realised as calls to a family of routines (
enter_variable()
,enter_array()
,enter_function()
,enter_procedure()
etc.) with suitable parameters. This is an important distinction: we first parse declarations and collect information about the current code block, and then we start building an internal form using that information.check static semantics. for example checks that the number of parameters at a procedure call is correct or that the operands of an integer division really are integers. This is done every time we are done parsing a whole block of code, by recursive calls propagating down through the abstract syntax tree representing the current program block. We will be consulting the symbol table heavily during this step, using the information we entered while parsing the block.
construct an internal form of the program. We have chosen abstract syntax trees (AST) as the internal form where the leaves are identifiers or numbers and the nodes operators, e.g. arithmetic operations or control structures. Each block is represented as an AST. Note that the tree only is made up of the procedure or function body, not the declarations. For example,
procedure goofy; var x : integer; begin while (x > 0) do x := x - 1; end; end;
has the abstract syntax tree representation from :
Each terminal and non-terminal on the parse stack has an attribute which the semantic actions can utilise. The attributes that each tree node has vary with the node type. In this lab course, each node type is a class of its own, i.e, we are using abstract syntax classes. The complete declarations for these classes can be found in the file
ast.hh
. For a terminal symbol, the attributes are the contents of the variableyylval
(see the Scanner section). For non-terminals, the attributes are pointers to other abstract syntax trees which have been built during previous reductions. A semantic action for the production:<stmt> ::= T_WHILE <expr> T_DO <stmt_list> T_END
is to create the tree shown in the diagram above. This is a rather trivial operation as the children of the
WHILE
-node are available as attributes of<expr>
and<stmt_list>
. The beauty of using abstract syntax classes is that when we reduce and want to create a new node, we simply pass on the various attributes on the right-hand side as parameters to the new node’s constructor.
In conclusion we can say that when we parse procedure/function:
declarations – we update the symbol table
statements – we construct an abstract syntax tree., and when the whole procedure/function is reduced, we check static semantics.
Optimization¶
After we have built an abstract syntax tree for a program block, we have
an excellent opportunity to perform some optimization on it.
Optimization is a whole field in itself, and we will only implement a
very simple type in this lab, called constant folding. This means that
we try to identify expressions on the form constant operator constant
(e.g., 4 + 5
), and evaluate them at compile time, replacing the
operator node in the AST with the result (the integer 9
, in our
example). This takes place after type checking has been done on the
block, and will be implemented pretty much the same way – through
recursive calls propagating downwards in the AST. A more graphic example
is shown in Figure 1.
Figure 1 Simple example of constant folding x := 3 - 1
to x := 2
.¶
Quadruples¶
As we’ve seen above, for each function- or procedure block an abstract syntax tree is constructed.
After it has been type checked and optimized, it is traversed once more, this time in order to generate a list of quadruples.
All in all there are about forty different quadruples – see the file quads.hh
.
As we have to break down structured programming (if
, while
etc.) to linear code, we must introduce jumps and labels.
Thus there will be quadruples which serve only as labels.
Later when we expand quadruples to assembler code, they will be translated to assembler labels.
In the symbol table manager there is a function get_next_label()
which returns unique labels. The label number for a function/procedure
block is stored in the symbol table so we can make a call using the
q_call
quadruple. It also makes sense to have a common exit from
each procedure/function, as in assembler you have to clear the runtime
stack. This can be performed in the last instructions before the return
jump. The quad list’s last_label
variable is assumed to contain the
number of the current procedure/function exit label. Example:
function foo(x : integer) : integer;
begin
if (x < 0) then
return -2*x;
else
return 2*x;
end;
end;
generates the following quadruples:
q_iload 0 $1 ($1 = 0)
q_ilt X $1 $2 ($2 = X < $1)
q_jmpf 6 $2 (if not $2 then jump to 8)
q_iload 2 $3 ($3 = 2)
q_imult $3 X $4 ($4 = X * $3)
q_iuminus $4 $5 ($5 = - $4)
(return $5)
q_ireturn 5 $5
(jump to label 7)
q_jmp 7
(can be removed by optimization)
q_labl 6
q_iload 2 $6
q_imult $6 X $7
(return $7)
q_ireturn 5 $7
(end of if-statement)
q_labl 7
(foo’s exit)
q_labl 5
($n
are temporary variables which are used to break down arithmetic
expressions).
Code generation¶
Code generation in this compiler is naive. Each quadruple is expanded to one or more assembler instructions. Before the code for each block you have to include some assembler instructions which reserve space on the stack for local and temporary variables (activation records). The amount of space needed is stored in the symbol table for the current procedure/function. After each block we include code for restoring the stack. This will be more explained in the code generation lab.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | L4: ; FOO ; PROLOGUE (FOO) push rbp mov rcx, rsp push [rbp-8] push rcx mov rbp, rcx sub rsp, 56 ; QUAD 1: q_iload 0 - $1 mov rax, 0 mov rcx, [rbp-16] mov [rcx-24], rax ; QUAD 2: q_ilt X $1 $2 mov rcx, [rbp-16] mov rax, [rcx+16] mov rcx, [rbp-16] mov rcx, [rcx-24] cmp rax, rcx jl L8 mov rax, 0 jmp L9 L8: mov rax, 1 L9: mov rcx, [rbp-16] mov [rcx-32], rax ; QUAD 3: q_jmpf 6 $2 - mov rcx, [rbp-16] mov rax, [rcx-32] cmp rax, 0 je L6 ; QUAD 4: q_iload 2 - $3 mov rax, 2 mov rcx, [rbp-16] mov [rcx-40], rax ; QUAD 5: q_imult $3 X $4 mov rcx, [rbp-16] mov rax, [rcx-40] mov rcx, [rbp-16] mov rcx, [rcx+16] imul rax, rcx mov rcx, [rbp-16] mov [rcx-48], rax ; QUAD 6: q_iuminus $4 - $5 mov rcx, [rbp-16] mov rax, [rcx-48] neg rax mov rcx, [rbp-16] mov [rcx-56], rax ; QUAD 7: q_ireturn 5 $5 - mov rcx, [rbp-16] mov rax, [rcx-56] jmp L5 ; QUAD 8: q_jmp 7 - - jmp L7 L6: ; QUAD 9: q_labl 6 - - ; QUAD 10: q_iload 2 - $6 mov rax, 2 mov rcx, [rbp-16] mov [rcx-64], rax ; QUAD 11: q_imult $6 X $7 mov rcx, [rbp-16] mov rax, [rcx-64] mov rcx, [rbp-16] mov rcx, [rcx+16] imul rax, rcx mov rcx, [rbp-16] mov [rcx-72], rax ; QUAD 12: q_ireturn 5 $7 - mov rcx, [rbp-16] mov rax, [rcx-72] jmp L5 L7: ; QUAD 13: q_labl 7 - - L5: ; QUAD 14: q_labl 5 - - ; EPILOGUE (FOO) leave ret |
The symbol table¶
The symbol table stores information about the identifiers that appear in
the program. The sym_type
type contains the objects that can occur
in DIESEL. Each symbol type is represented by a class. Each class
contains a number of attributes. Certain attributes are common to all
symbol types, and are therefore placed in the abstract symbol
superclass, which all the concrete symbols classes inherit:
-
class
symbol
Subclassed by array_symbol, constant_symbol, function_symbol, nametype_symbol, parameter_symbol, procedure_symbol, variable_symbol
Public Members
-
pool_index
id
-
sym_type
tag
-
sym_index
type
-
sym_index
hash_link
-
hash_index
back_link
-
block_level
level
-
int
offset
-
pool_index
Apart from the shared attributes listed above, some of the symbol classes have additional individual attributes which are summarized below:
-
class
constant_symbol
: public symbol Public Members
-
constant_value
const_value
-
constant_value
-
class
array_symbol
: public symbol
-
class
parameter_symbol
: public symbol
-
class
procedure_symbol
: public symbol
-
class
function_symbol
: public symbol
For example, for constants
we need only to remember which block they were declared in (another variable with the same name can “hide” the constant), what its type is (string, integer or floating-point) and, of course, its value.
For a variable
one must know where in the activation record the variable will be stored.
The offset
attribute contains this information.
For procedures and functions all the formal parameters have to be linked together.
When we then parse a call, we can check the type of the actual parameters with the formal parameters by looking up the procedure or function in the symbol table and following the exiting parameter link.
Some elements in the symbol table are predefined. For example the name
types VOID
, INTEGER
and REAL
are already installed in the
symbol table 2. The same goes for the standard functions READ
and
TRUNC
, and also the standard WRITE
procedure.