Lab 1: Lexical Analysis¶
In this lab, you are to do three things:
Describe DIESEL identifiers, integers, reals, comments and string constants with regular expressions (Assignment 1: Regular expressions).
Choose some of the tokens and draw a DFA showing how the scanner is going to identify them (Assignment 2: Deterministic finite state automation for DIESEL).
Complete an input file to the scanner generator program
flex
, calledscanner.l
(Implementation).
Some years ago, most scanners were written by hand. Writing a scanner is
one of the less complex parts of constructing a compiler, but
nevertheless one that can be very tedious and time-consuming. Soon the
need for automated tools became evident, and several have been developed
over time. The one we will use is called flex
, and is freely
available under a BSD-style license. It is a free version of the unix
tool lex
, which simply stands for "lexical <something>"
.
However, before we go into details about how it works, we will look at
some general scanner concepts.
Lexical scanners¶
Figure 2 A lexical scanner forms the interface between the source code and the syntax analysis part of the parser.¶
Input¶
Sequence of characters, e.g.:
IF X < 10 THEN Y := 15
Output¶
Tokens (so-called basic symbols, groups of successive characters which belong together). Other information is attached to tokens, e.g. the gathered characters and the value of constants. We will show, later on, exactly what this additional information consists of and how it is represented in our program.
Scanner functionality¶
These are the general steps to follow if you wish to write your own scanner:
Recognize tokens: variable names, reserved words, operators and special symbols.
Save identifiers in a string table and return references to them which are then used in other phases of the compiler.
Remove
comments
,blanks
,TAB
etc.Convert strings to values, for instance
'10'
to10
.Discover lexical errors, wrongly formatted floating point numbers, illegal characters such as
%
and$
, etc.Count rows, managing
#include-commands
etc., if this is not done by a preprocessor. The position information is used in later compiler phases to help generate useful error messages.
Constructing scanners¶
When constructing scanners, one would normally start by describing the tokens using regular expressions. Then, the aim is to generate a DFA (Deterministic Finite Automaton) – a graphic description of legal sequences of characters which can easily be implemented in a program. If the language is not too complicated, you can manually generate a DFA directly using the regular expressions. If you are a little more interested, or if you have a large language, you can first generate an NFA (Non-deterministic Finite Automaton), and then use the algorithm for converting an NFA to a DFA, that you find in the course textbook. However, this will not be necessary.
For this assignment we will use a program that automatically generates a
scanner for regular expressions (e.g. lex
, flex
).
For those who are interested, some summary information on how to do it “the hard way” is also provided.
A DFA has no edges with the empty string
.
In any arbitrary state in a DFA there may not be several edges with the same symbol.
Figure 3 Example of a small DFA¶
In this way a computer program which simulates a DFA can always decide what the next stage is after reading a character. In principle there are two ways of realising a DFA in a program:
Encode the DFA as a table. The scanner will thus be data-driven and easier to modify:
Table 4 Example of a small DFA in table representation¶ Blank
Digit
“D”
“O”
“.”
1
1
2
5
-
3
2
-
2
-
-
-
3
-
-
-
-
4
4
-
-
-
-
-
5
-
-
-
6
-
6
-
-
-
-
-
We suggest that the analysis progresses in this way: a character is read and the table is indexed with the current state and the character. This is repeated until the content in the table is not defined (-). If this is in an accepting state, the string is accepted and a token is returned; otherwise a lexical error is reported.
Encode the DFA as a program. This can be done using
switch
-case
constructions (or the infamousgoto
statement), where for each state, there is some code which reads in a character and decides which state to enter next. What you gain in flexibility (easy to add extra code), you lose in overview and maintenance.
As you can see, it is normal to divide the incoming characters into
classes such as letter, digit, etc. so that the tables/programs
don’t grow too large. There is, thus, a sort of scanner in the
scanner.1
1.
Tokens in DIESEL¶
The tokens are what makes up a DIESEL program. In this section, we will
go through all the various tokens in DIESEL, and show some examples of
valid/invalid ones. You will have to write up regular expressions for
them (see below), and later on add them (most of which will be trivial)
to the scanner.l
file. However, some of them are more complicated,
e.g., integers, reals, identifiers, strings and comments, and you should
make sure you get them right as well.
Numerical constants¶
A numerical constant is either an integer or a floating-point number. Exponents are allowed in floating-point numbers. No character may precede the constant as unary operators are managed by the parser.
Integers (non-negative)¶
Integers are only made from a sequence of digits as characters are dealt
with by the parser. For example, -39
will result in two tokens. The
parser decides afterwards whether the minus character is a unary
operator (e.g. A := -39
) or a binary (e.g. A := 17 - 39
).
|
OK |
|
OK |
|
OK |
|
Non-positive integer. |
|
Not an integer but an floating-point number. |
Floating-point numbers¶
Floating-point numbers are integers or decimal numbers followed by an
exponent. A decimal number is a sequence of numbers where a decimal
point is placed at an arbitrary place in the sequence. The exponent
consists of an E
or an e
followed by an integer, with or
without a sign.
|
OK |
|
OK |
|
OK |
|
OK |
|
OK |
|
OK |
|
OK |
|
Not floating-point, but integer. |
|
Not floating-point, but identifier. |
|
Not floating-point, lacks numbers. |
|
Not floating-point, lacks exponent. |
|
Not floating-point, because negative. |
String constants¶
A string constant starts and finishes with an apostrophe. Between these
there can be zero or several characters. They can not contain line
breaks. If you want an apostrophe in the string, you write it twice,
e.g. 'don''t care'
. The internal representation in the compiler will
then be don't care
.
Reserved words¶
All the reserved words in DIESEL have their own token, and you will need
to enter regular expressions describing them into the scanner.l
file. Most of these regular expressions will be trivial, since each
reserved word is a token of its own, e.g.: BEGIN
.
|
Reserved word |
|
Reserved word |
|
Reserved word |
|
Not a reserved word, but identifier |
|
Not a reserved word, but illegal token |
|
Not a reserved word, but illegal token |
Identifiers¶
An identifier always starts with a letter which can be followed by
letters/digits. Letters include A - Z,a - z
and also _
(underscore). We thus allow an identifier to begin with _
. Note that
reserved words and identifiers are case insensitive.
|
Identifier |
|
Identifier |
|
Identifier |
|
Identifier |
|
Identifier |
|
Not an identifier, but illegal token. |
Special characters¶
The following are the allowed special characters in DIESEL:
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
Separators¶
The separators are blank
, TAB
, and return
.
Theory assignment¶
These assignments are mandatory, and must be handed in to your lab assistant. Once you have finished these assignments the implementation part of the lab will be much easier to do.
Do not spend too much time doing this assignments though, any sketch on a paper will do. Save your time for the implementation of the future labs, you will need it.
Assignment 1: Regular expressions¶
Describe identifiers, constants (integers, reals and strings) and
special characters using regular expressions. You can write them
directly in the scanner.l
file.
Assignment 2: Deterministic finite state automation for DIESEL¶
Design a deterministic finite state automaton for a subset of DIESEL(but make sure to include floating-point numbers). Describe the automaton graphically. Show clearly which are start and final states. Mark the edges with the character causing the transition. You will need an error state; edges leading to the error state do not have to be drawn. It is implicitly assumed that all input which does not match one of your edges causes the DFA to go into an error state.
Implementation¶
Complete the input file to flex
, scanner.l
, using the
information given above and below. You will have to use the regular
expressions you defined earlier. You must add token-specific information
into the yylval
union where appropriate (integers, reals, string
constants and identifiers), and position information in the yylloc
struct for all tokens.
More specifically, your scanner should do the following:
Keep track of position information. This is done using the special variable
yylloc
, which is really a struct:-
struct
YYLTYPE
For every token, these fields should be set to the correct position in the source code file.
Public Members
-
int
first_line
-
int
first_column
-
int
last_line
-
int
last_column
-
int
Note
flex
helps you keep track of the line number by use of the yylineno option, but you will have to implement your own column counter and update it as necessary when reading tokens.-
struct
When you find an identifier, you have to install in the string pool, which resides in the
symbol_table
class. To this end, the methodspool_install()
,pool_forget()
,lookup_symbol()
andget_symbol_id()
are all (at least some, see below) useful. They are described in more detail below. Note that the symbol table is a global variable, calledsym_tab
. See the filesymtab.hh
for more information.Enter required information for integers, reals, identifiers and string constants into the
yylval
union.Make sure that DIESEL is case-insensitive, i.e., the identifiers
foo
andFoo
should both internally be represented asFOO
.
The yylval
variable is represented by a union:
-
union
YYSTYPE
- #include <scanner.hh>
Public Members
-
long
ival
Value of an integer
T_INTNUM
-
double
rval
Value of a real
T_REALNUM
-
pool_index
str
Value of a string
T_STRINGCONST
-
pool_index
pool_p
Value of an identifier
T_IDENT
-
long
IF FOO < 10 THEN BAR := 15.0
token |
pool_p |
ival |
rval |
str |
---|---|---|---|---|
|
||||
|
||||
10 |
||||
|
||||
|
||||
15 |
Note
A union
in C simply shares the memory location of the members.
It is memory-efficient, but if you are not careful you may introduce
weird bugs.
yylval.ival = 13;
yylval.rval = 1.0;
// Gives "garbage" - the binary representation of 1.0
x = yylval.ival;
Flex manual¶
For flex-specific information, see the manual.
You will have to study this manual in order to be able to complete this
lab successfully!
After having read this, the information in the skeleton scanner.l
file
coupled with the information below should be enough to get you started.
The files that you need to change¶
- scanner.l
is the
flex
input file, which you’re going to complete. This is the only file you will need to edit in this lab.
Other files of interest¶
These files you might have to read through, but they should not require editing at this point.
- scanner.hh
is a temporary include file used for scanner testing.
- scantest.cc
is an interactive test program for your scanner.
- symtab.hh
contains symbol table information, including string pool methods.
- symbol.cc
contains symbol implementations (will be edited in lab 2).
- symtab.cc
contains the symbol table implementation.
- error.hh and error.cc
contain debug and error message routines.
Declarations and routines available in scanner.hh, error.hh, and symtab.hh.¶
Constant declarations¶
The #defines
in the scanner.hh
file denote tokens. Using the
actual numbers is a very bad idea! They are not chosen randomly, but
rather by the bison
program, which we will look closer at later on
in this lab course. The declarations of the yylval
and yylloc
variables can also be found in this file. In total there are NR_SYMS
symbols(terminals) in the grammar.
The file error.hh
contains some handy error routines which are
globally available. Use yyerror()
to give error messages from the
scanner. Most of the other routines in this file will not need to be
used until later in the lab course. The file symtab.hh
is big, and
contains complete information on all symbol classes, as well as the
symbol table itself. You will not need to use most of them in this lab.
The ones of interest are described below.
Type declarations¶
In symtab.hh
, some types are defined to make the code more readable.
The only one you will need to use in this lab is the first one (the
second as well, if you want the symbol table to use shared strings):
typedef long pool_index; /* indices into the string pool */
typedef long hash_index; /* indices into the symbol table */
typedef long sym_index; /* indices into the hash table */
typedef int block_level; /* indices into the block table */
If this doesn’t make sense to you right now, don’t worry: You won’t need to understand what the three last ones are used for until the next lab.
Variable declarations¶
-
YYSTYPE
yylval
Contains the additional attributes needed to describe certain tokens (integers, reals, string constants, and identifiers).
-
YYLTYPE
yylloc
Contains variables for tracking position in the source code.
-
symbol_table *
sym_tab
A global pointer to the symbol_table.
You will use functions in the symbol_table to install strings (and, if you want to make your compiler use shared strings, to reduce memory requirements, to look-up and identify symbols, as well as forget strings).
flex
provides the variable yylineno
, which is guaranteed to
contain the current line number in the source code file. This
information should be entered in yylloc.first_line
for all tokens.
flex
also provides the variables yytext
, which contains the
string of the currently parsed token ("16"
, "foo"
, "BEGIN"
etc), and yyleng
, which is set to the length of yytext
.
The file error.hh
declares a variable called error_count
.
Ideally compilation should continue as long as possible, but if one or
more errors are found, no assembler code will be generated. Increase
this variable by one every time you find an error. If the error is a
fatal one, you can call the fatal()
method, which will abort
compilation with an error message.
Methods¶
The following methods might come in handy:
-
class
symbol_table
Public Functions
-
pool_index
pool_install
(char*) Install a string (an identifier or a string constant) in the string pool. Returns the index to the installed string.
-
char *
pool_lookup
(const pool_index) Given a
pool_index
into the string pool, returns the string it points to.
-
pool_index
pool_forget
(const pool_index) Remove last installed entry from the string pool. This will only be useful if you opt to implement shared strings.
-
char *
fix_string
(const char*) Remove double
''
in strings constants.Internalizes a string constant. You hopefully use this method in several labs.
-
char *
capitalize
(const char*) Capitalizes a string. You hopefully use this method in several labs.
-
pool_index
-
void
yyerror
(string) This must be defined (for
flex
), but usingerror(pos) << "foo"
is preferable.
Program specification¶
Lower- and upper-case characters should be treated in the same way, except for string constants.
When a numerical constant (this is not the same as a declared constant! we refer here to
"2"
,"4.0"
, etc.) is found, its value should be stored in theyylval
union.When an identifier is found, its index into the string table should be stored in the
yylval
union.When a string constant is found, an index into the string table, where its internalized form is stored, should be stored in the
yylval
union.Position information, both line and column, is to be kept for each token, using the
yylloc
variable.The scanner should only give up on truly fatal errors, such as a full string pool, un-terminated comments and the like. Otherwise, an error should be printed, but the compiler should continue reading tokens. The more errors printed in one compiler iteration, the better.
How do I know it’s correct?¶
When you compile your scanner using make
, a binary executable file
named scanner
will be created. When run, it will wait for keyboard
input and return the token found, its position, and any other
information associated with it. Type in all valid tokens and make sure
they are identified correctly. A list of all valid tokens can be found
in the scanner.hh
file. You can type in more than one token at a
time; the input will only be scanned after you hit <enter>
. End the
program by typing <enter>
on a new line, or by hitting Ctrl-C
. A
sample execution is provided at the end of this chapter.
Also note that the test program will get jammed pretty quickly if you
try to compile it using the bare-bones skeleton file scanner.l
.
Complete the scanner, and then run the test program.
Reporting your work¶
Hand in all the parts of the Theory assignment.
For the demonstration, show that your program works as expected on some example programs (also show that
scannertest1.d
produces the result shown inscannertest1.trace
by runningmake lab1
).For the implementation part, hand in your code as described in Handing in Results.
Example execution¶
Running make lab1
will take the sample input and compare it to the
sample output. You should test your code on more examples than this.
./scanner ../testpgm/scannertest1.d
Scanned T_PROGRAM 'program' (1, 0)
Scanned T_REALNUM '3.3' (2, 0) <yylval.rval = 3.3>
Scanned T_INTNUM '5' (2, 4) <yylval.ival = 5>
Scanned T_REALNUM '7.8e4' (2, 6) <yylval.rval = 78000>
Scanned T_REALNUM '4E10' (2, 12) <yylval.rval = 4e+10>
Scanned T_REALNUM '.1e-4' (2, 17) <yylval.rval = 1e-05>
Scanned T_REALNUM '1.e+3' (2, 23) <yylval.rval = 1000>
Scanned T_ADD '+' (4, 0)
Scanned T_SUB '-' (4, 2)
Scanned T_MUL '*' (4, 4)
Scanned T_RDIV '/' (4, 6)
Scanned T_LESSTHAN '<' (4, 8)
Scanned T_GREATERTHAN '>' (4, 10)
Scanned T_EQ '=' (4, 12)
Scanned T_NOTEQ '<>' (4, 14)
Scanned T_ASSIGN ':=' (4, 17)
Scanned T_LEFTPAR '(' (4, 20)
Scanned T_RIGHTPAR ')' (4, 22)
Scanned T_LEFTBRACKET '[' (4, 24)
Scanned T_RIGHTBRACKET ']' (4, 26)
Scanned T_COMMA ',' (4, 28)
Scanned T_DOT '.' (4, 30)
Scanned T_COLON ':' (4, 32)
Scanned T_SEMICOLON ';' (4, 34)
Scanned T_IF 'if' (5, 0)
Scanned T_LEFTPAR '(' (5, 3)
Scanned T_IDENT 'x' (5, 4) <yylval.pool_p = X>
Scanned T_LEFTBRACKET '[' (5, 5)
Scanned T_IDENT 'y' (5, 6) <yylval.pool_p = Y>
Scanned T_RIGHTBRACKET ']' (5, 7)
Scanned T_GREATERTHAN '>' (5, 8)
Scanned T_INTNUM '3' (5, 9) <yylval.ival = 3>
Scanned T_RIGHTPAR ')' (5, 10)
Scanned T_THEN 'then' (5, 11)
Scanned T_IDENT 'doit' (5, 16) <yylval.pool_p = DOIT>
Scanned T_SEMICOLON ';' (5, 20)
Scanned T_STRINGCONST ''string'' (9, 0) <yylval.str = STRING>
Scanned T_STRINGCONST ''panic: can''t happen'' (10, 0) <yylval.str = PANIC: CAN'T HAPPEN>
Scanned T_STRINGCONST '''''' (11, 0) <yylval.str = '>
Scanned T_INTNUM '12345678910' (13, 0) <yylval.ival = 12345678910>
Error: line 15: Integer out of range
Scanned T_INTNUM '36893488147419103231' (15, 0) <yylval.ival = 9223372036854775807>
Scanned T_INTNUM '9' (17, 0) <yylval.ival = 9>
Scanned T_IDENT 'b' (17, 1) <yylval.pool_p = B>
Scanned T_DOT '.' (18, 0)
Scanned T_IDENT 'e3' (18, 1) <yylval.pool_p = E3>
Error: line 19: Illegal character
Error: line 21: Newline in string
Scanned T_IDENT 'newline' (21, 2) <yylval.pool_p = NEWLINE>
Scanned T_IDENT 'in' (21, 10) <yylval.pool_p = IN>
Scanned T_IDENT 'it' (21, 13) <yylval.pool_p = IT>
Error: line 22: Newline in string
Scanned T_IDENT 'foo' (23, 0) <yylval.pool_p = FOO>
Error: line 23: Illegal character
Scanned T_IDENT 'fie' (23, 3) <yylval.pool_p = FIE>
Scanned T_IDENT '_id' (24, 0) <yylval.pool_p = _ID>
Scanned T_IDENT 'id_1' (25, 0) <yylval.pool_p = ID_1>
Scanned T_INTNUM '1' (26, 0) <yylval.ival = 1>
Scanned T_IDENT 'e' (26, 1) <yylval.pool_p = E>
Scanned T_IDENT 'integer' (27, 0) <yylval.pool_p = INTEGER>
End of file
- 1
Actually one ought to distinguish between a scanner, which works on characters, and a lexical scanner which groups output from the scanner to tokens.
Comments¶
Comments start with
{
and finish with}
and can not be nested. Comments can contain a line-break. This means that{
in a comment can either result in an error or be ignored.To honor the move in implementation language made 1998-1999, from Pascal to C++, DIESEL now also allows C++ style, one-line comments starting with
//
. Furthermore, C/C++ style comments starting with/*
and ending with*/
(can span several lines) are supported. Both of these have already been implemented in thescanner.l
file, and might be helpful to look at when you provide lexical actions for your own regular expressions later on (hint: especially the old-style comments mentioned above, and strings).