Lab 1: Lexical Analysis

In this lab, you are to do three things:

  1. Describe DIESEL identifiers, integers, reals, comments and string constants with regular expressions (Assignment 1: Regular expressions).

  2. Choose some of the tokens and draw a DFA showing how the scanner is going to identify them (Assignment 2: Deterministic finite state automation for DIESEL).

  3. Complete an input file to the scanner generator program flex, called scanner.l (Implementation).

Some years ago, most scanners were written by hand. Writing a scanner is one of the less complex parts of constructing a compiler, but nevertheless one that can be very tedious and time-consuming. Soon the need for automated tools became evident, and several have been developed over time. The one we will use is called flex, and is freely available under a BSD-style license. It is a free version of the unix tool lex, which simply stands for "lexical <something>". However, before we go into details about how it works, we will look at some general scanner concepts.

Lexical scanners

Lexical scanner

Figure 2 A lexical scanner forms the interface between the source code and the syntax analysis part of the parser.

Input

Sequence of characters, e.g.:

IF X < 10 THEN Y := 15

Output

Tokens (so-called basic symbols, groups of successive characters which belong together). Other information is attached to tokens, e.g. the gathered characters and the value of constants. We will show, later on, exactly what this additional information consists of and how it is represented in our program.

Scanner functionality

These are the general steps to follow if you wish to write your own scanner:

  1. Recognize tokens: variable names, reserved words, operators and special symbols.

  2. Save identifiers in a string table and return references to them which are then used in other phases of the compiler.

  3. Remove comments, blanks, TAB etc.

  4. Convert strings to values, for instance '10' to 10.

  5. Discover lexical errors, wrongly formatted floating point numbers, illegal characters such as % and $, etc.

  6. Count rows, managing #include-commands etc., if this is not done by a preprocessor. The position information is used in later compiler phases to help generate useful error messages.

Constructing scanners

When constructing scanners, one would normally start by describing the tokens using regular expressions. Then, the aim is to generate a DFA (Deterministic Finite Automaton) – a graphic description of legal sequences of characters which can easily be implemented in a program. If the language is not too complicated, you can manually generate a DFA directly using the regular expressions. If you are a little more interested, or if you have a large language, you can first generate an NFA (Non-deterministic Finite Automaton), and then use the algorithm for converting an NFA to a DFA, that you find in the course textbook. However, this will not be necessary.

For this assignment we will use a program that automatically generates a scanner for regular expressions (e.g. lex, flex).

For those who are interested, some summary information on how to do it “the hard way” is also provided.

  1. A DFA has no edges with the empty string \epsilon.

  2. In any arbitrary state in a DFA there may not be several edges with the same symbol.

digraph lab1_small_dfa {
rankdir=LR;
node [shape="circle"];
start [label="",style="invis"];

start -> 1 [label="start"];
1 -> 1 [label="blank"];
1 -> 2 [label="digit"];
1 -> 3 [label="\".\""];
3 -> 4 [label="\".\""];
1 -> 5 [label="D"];
5 -> 6 [label="O"];

2 [shape="doublecircle"];
3 [shape="doublecircle"];
4 [shape="doublecircle"];
6 [shape="doublecircle"];

key [shape="none", label=< <table border="0">
  <tr><td align="left">2 → integer</td></tr>
  <tr><td align="left">3 → dot</td></tr>
  <tr><td align="left">4 → interval</td></tr>
  <tr><td align="left">6 → DO</td></tr>
  </table> >]

4 -> key [style="invis"];
}

Figure 3 Example of a small DFA

In this way a computer program which simulates a DFA can always decide what the next stage is after reading a character. In principle there are two ways of realising a DFA in a program:

  1. Encode the DFA as a table. The scanner will thus be data-driven and easier to modify:

    Table 4 Example of a small DFA in table representation

    Blank

    Digit

    “D”

    “O”

    “.”

    1

    1

    2

    5

    -

    3

    2

    -

    2

    -

    -

    -

    3

    -

    -

    -

    -

    4

    4

    -

    -

    -

    -

    -

    5

    -

    -

    -

    6

    -

    6

    -

    -

    -

    -

    -

    We suggest that the analysis progresses in this way: a character is read and the table is indexed with the current state and the character. This is repeated until the content in the table is not defined (-). If this is in an accepting state, the string is accepted and a token is returned; otherwise a lexical error is reported.

  2. Encode the DFA as a program. This can be done using switch-case constructions (or the infamous goto statement), where for each state, there is some code which reads in a character and decides which state to enter next. What you gain in flexibility (easy to add extra code), you lose in overview and maintenance.

As you can see, it is normal to divide the incoming characters into classes such as letter, digit, etc. so that the tables/programs don’t grow too large. There is, thus, a sort of scanner in the scanner.1 1.

Tokens in DIESEL

The tokens are what makes up a DIESEL program. In this section, we will go through all the various tokens in DIESEL, and show some examples of valid/invalid ones. You will have to write up regular expressions for them (see below), and later on add them (most of which will be trivial) to the scanner.l file. However, some of them are more complicated, e.g., integers, reals, identifiers, strings and comments, and you should make sure you get them right as well.

Numerical constants

A numerical constant is either an integer or a floating-point number. Exponents are allowed in floating-point numbers. No character may precede the constant as unary operators are managed by the parser.

Integers (non-negative)

Integers are only made from a sequence of digits as characters are dealt with by the parser. For example, -39 will result in two tokens. The parser decides afterwards whether the minus character is a unary operator (e.g. A := -39) or a binary (e.g. A := 17 - 39).

Table 5 Examples of integer numbers.

10

OK

0

OK

862107

OK

-7

Non-positive integer.

3E5

Not an integer but an floating-point number.

Floating-point numbers

Floating-point numbers are integers or decimal numbers followed by an exponent. A decimal number is a sequence of numbers where a decimal point is placed at an arbitrary place in the sequence. The exponent consists of an E or an e followed by an integer, with or without a sign.

Table 6 Examples of floating-point numbers.

3.3

OK

.3

OK

5E6

OK

.2e-14

OK

7E+3

OK

4.

OK

5.e2

OK

3

Not floating-point, but integer.

E9

Not floating-point, but identifier.

.e3

Not floating-point, lacks numbers.

7E

Not floating-point, lacks exponent.

-3.2

Not floating-point, because negative.

String constants

A string constant starts and finishes with an apostrophe. Between these there can be zero or several characters. They can not contain line breaks. If you want an apostrophe in the string, you write it twice, e.g. 'don''t care'. The internal representation in the compiler will then be don't care.

Reserved words

All the reserved words in DIESEL have their own token, and you will need to enter regular expressions describing them into the scanner.l file. Most of these regular expressions will be trivial, since each reserved word is a token of its own, e.g.: BEGIN.

Table 7 Examples of reserved words.

end

Reserved word

while

Reserved word

IntEger

Reserved word

x3b

Not a reserved word, but identifier

9b

Not a reserved word, but illegal token

foo%fie

Not a reserved word, but illegal token

Identifiers

An identifier always starts with a letter which can be followed by letters/digits. Letters include A - Z,a - z and also _ (underscore). We thus allow an identifier to begin with _. Note that reserved words and identifiers are case insensitive.

Table 8 Examples of identifiers.

x

Identifier

a7

Identifier

odd_num

Identifier

_integer

Identifier

x4b

Identifier

9b

Not an identifier, but illegal token.

Special characters

The following are the allowed special characters in DIESEL:

Table 9 Special characters

+

-

*

/

<

>

=

<>

:=

(

)

[

]

,

.

:

;

Comments

Comments start with { and finish with } and can not be nested. Comments can contain a line-break. This means that { in a comment can either result in an error or be ignored.

To honor the move in implementation language made 1998-1999, from Pascal to C++, DIESEL now also allows C++ style, one-line comments starting with //. Furthermore, C/C++ style comments starting with /* and ending with */ (can span several lines) are supported. Both of these have already been implemented in the scanner.l file, and might be helpful to look at when you provide lexical actions for your own regular expressions later on (hint: especially the old-style comments mentioned above, and strings).

Separators

The separators are blank, TAB, and return.

Theory assignment

These assignments are mandatory, and must be handed in to your lab assistant. Once you have finished these assignments the implementation part of the lab will be much easier to do.

Do not spend too much time doing this assignments though, any sketch on a paper will do. Save your time for the implementation of the future labs, you will need it.

Assignment 1: Regular expressions

Describe identifiers, constants (integers, reals and strings) and special characters using regular expressions. You can write them directly in the scanner.l file.

Assignment 2: Deterministic finite state automation for DIESEL

Design a deterministic finite state automaton for a subset of DIESEL(but make sure to include floating-point numbers). Describe the automaton graphically. Show clearly which are start and final states. Mark the edges with the character causing the transition. You will need an error state; edges leading to the error state do not have to be drawn. It is implicitly assumed that all input which does not match one of your edges causes the DFA to go into an error state.

Implementation

Complete the input file to flex, scanner.l, using the information given above and below. You will have to use the regular expressions you defined earlier. You must add token-specific information into the yylval union where appropriate (integers, reals, string constants and identifiers), and position information in the yylloc struct for all tokens.

More specifically, your scanner should do the following:

  • Keep track of position information. This is done using the special variable yylloc, which is really a struct:

    struct YYLTYPE

    For every token, these fields should be set to the correct position in the source code file.

    Public Members

    int first_line
    int first_column
    int last_line
    int last_column

    Note

    flex helps you keep track of the line number by use of the yylineno option, but you will have to implement your own column counter and update it as necessary when reading tokens.

  • When you find an identifier, you have to install in the string pool, which resides in the symbol_table class. To this end, the methods pool_install(), pool_forget(), lookup_symbol() and get_symbol_id() are all (at least some, see below) useful. They are described in more detail below. Note that the symbol table is a global variable, called sym_tab. See the file symtab.hh for more information.

  • Enter required information for integers, reals, identifiers and string constants into the yylval union.

  • Make sure that DIESEL is case-insensitive, i.e., the identifiers foo and Foo should both internally be represented as FOO.

The yylval variable is represented by a union:

union YYSTYPE
#include <scanner.hh>

Public Members

long ival

Value of an integer T_INTNUM

double rval

Value of a real T_REALNUM

pool_index str

Value of a string T_STRINGCONST

pool_index pool_p

Value of an identifier T_IDENT

IF FOO < 10 THEN BAR := 15.0
Table 10 Tokens returned after the parsing of a piece of code

token

pool_p

ival

rval

str

T_IF

T_IDENT

FOO

T_LESS

T_INTNUM

10

T_THEN

T_IDENT

BAR

T_ASSIGNMENT

T_REALNUM

15

Note

A union in C simply shares the memory location of the members. It is memory-efficient, but if you are not careful you may introduce weird bugs.

yylval.ival = 13;
yylval.rval = 1.0;
// Gives "garbage" - the binary representation of 1.0
x = yylval.ival;

Flex manual

For flex-specific information, see the manual.

You will have to study this manual in order to be able to complete this lab successfully! After having read this, the information in the skeleton scanner.l file coupled with the information below should be enough to get you started.

The files that you need to change

scanner.l

is the flex input file, which you’re going to complete. This is the only file you will need to edit in this lab.

Other files of interest

These files you might have to read through, but they should not require editing at this point.

scanner.hh

is a temporary include file used for scanner testing.

scantest.cc

is an interactive test program for your scanner.

symtab.hh

contains symbol table information, including string pool methods.

symbol.cc

contains symbol implementations (will be edited in lab 2).

symtab.cc

contains the symbol table implementation.

error.hh and error.cc

contain debug and error message routines.

Declarations and routines available in scanner.hh, error.hh, and symtab.hh.

Constant declarations

The #defines in the scanner.hh file denote tokens. Using the actual numbers is a very bad idea! They are not chosen randomly, but rather by the bison program, which we will look closer at later on in this lab course. The declarations of the yylval and yylloc variables can also be found in this file. In total there are NR_SYMS symbols(terminals) in the grammar.

The file error.hh contains some handy error routines which are globally available. Use yyerror() to give error messages from the scanner. Most of the other routines in this file will not need to be used until later in the lab course. The file symtab.hh is big, and contains complete information on all symbol classes, as well as the symbol table itself. You will not need to use most of them in this lab. The ones of interest are described below.

Type declarations

In symtab.hh, some types are defined to make the code more readable. The only one you will need to use in this lab is the first one (the second as well, if you want the symbol table to use shared strings):

typedef long pool_index; /* indices into the string pool */
typedef long hash_index; /* indices into the symbol table */
typedef long sym_index;  /* indices into the hash table */
typedef int block_level; /* indices into the block table */

If this doesn’t make sense to you right now, don’t worry: You won’t need to understand what the three last ones are used for until the next lab.

Variable declarations

YYSTYPE yylval

Contains the additional attributes needed to describe certain tokens (integers, reals, string constants, and identifiers).

YYLTYPE yylloc

Contains variables for tracking position in the source code.

symbol_table *sym_tab

A global pointer to the symbol_table.

You will use functions in the symbol_table to install strings (and, if you want to make your compiler use shared strings, to reduce memory requirements, to look-up and identify symbols, as well as forget strings).

flex provides the variable yylineno, which is guaranteed to contain the current line number in the source code file. This information should be entered in yylloc.first_line for all tokens.

flex also provides the variables yytext, which contains the string of the currently parsed token ("16", "foo", "BEGIN" etc), and yyleng, which is set to the length of yytext.

The file error.hh declares a variable called error_count. Ideally compilation should continue as long as possible, but if one or more errors are found, no assembler code will be generated. Increase this variable by one every time you find an error. If the error is a fatal one, you can call the fatal() method, which will abort compilation with an error message.

Methods

The following methods might come in handy:

class symbol_table

Public Functions

pool_index pool_install(char*)

Install a string (an identifier or a string constant) in the string pool. Returns the index to the installed string.

char *pool_lookup(const pool_index)

Given a pool_index into the string pool, returns the string it points to.

pool_index pool_forget(const pool_index)

Remove last installed entry from the string pool. This will only be useful if you opt to implement shared strings.

char *fix_string(const char*)

Remove double '' in strings constants.

Internalizes a string constant. You hopefully use this method in several labs.

char *capitalize(const char*)

Capitalizes a string. You hopefully use this method in several labs.

void yyerror(string)

This must be defined (for flex), but using error(pos) << "foo" is preferable.

Program specification

  1. Lower- and upper-case characters should be treated in the same way, except for string constants.

  2. When a numerical constant (this is not the same as a declared constant! we refer here to "2", "4.0", etc.) is found, its value should be stored in the yylval union.

  3. When an identifier is found, its index into the string table should be stored in the yylval union.

  4. When a string constant is found, an index into the string table, where its internalized form is stored, should be stored in the yylval union.

  5. Position information, both line and column, is to be kept for each token, using the yylloc variable.

  6. The scanner should only give up on truly fatal errors, such as a full string pool, un-terminated comments and the like. Otherwise, an error should be printed, but the compiler should continue reading tokens. The more errors printed in one compiler iteration, the better.

How do I know it’s correct?

When you compile your scanner using make, a binary executable file named scanner will be created. When run, it will wait for keyboard input and return the token found, its position, and any other information associated with it. Type in all valid tokens and make sure they are identified correctly. A list of all valid tokens can be found in the scanner.hh file. You can type in more than one token at a time; the input will only be scanned after you hit <enter>. End the program by typing <enter> on a new line, or by hitting Ctrl-C. A sample execution is provided at the end of this chapter.

Also note that the test program will get jammed pretty quickly if you try to compile it using the bare-bones skeleton file scanner.l. Complete the scanner, and then run the test program.

Reporting your work

  • Hand in all the parts of the Theory assignment.

  • For the demonstration, show that your program works as expected on some example programs (also show that scannertest1.d produces the result shown in scannertest1.trace by running make lab1).

  • For the implementation part, hand in your code as described in Handing in Results.

Example execution

Running make lab1 will take the sample input and compare it to the sample output. You should test your code on more examples than this.

./scanner ../testpgm/scannertest1.d
Listing 4 trace/scannertest1.trace
Scanned T_PROGRAM 'program' (1, 0)
Scanned T_REALNUM '3.3' (2, 0) <yylval.rval = 3.3>
Scanned T_INTNUM '5' (2, 4) <yylval.ival = 5>
Scanned T_REALNUM '7.8e4' (2, 6) <yylval.rval = 78000>
Scanned T_REALNUM '4E10' (2, 12) <yylval.rval = 4e+10>
Scanned T_REALNUM '.1e-4' (2, 17) <yylval.rval = 1e-05>
Scanned T_REALNUM '1.e+3' (2, 23) <yylval.rval = 1000>
Scanned T_ADD '+' (4, 0)
Scanned T_SUB '-' (4, 2)
Scanned T_MUL '*' (4, 4)
Scanned T_RDIV '/' (4, 6)
Scanned T_LESSTHAN '<' (4, 8)
Scanned T_GREATERTHAN '>' (4, 10)
Scanned T_EQ '=' (4, 12)
Scanned T_NOTEQ '<>' (4, 14)
Scanned T_ASSIGN ':=' (4, 17)
Scanned T_LEFTPAR '(' (4, 20)
Scanned T_RIGHTPAR ')' (4, 22)
Scanned T_LEFTBRACKET '[' (4, 24)
Scanned T_RIGHTBRACKET ']' (4, 26)
Scanned T_COMMA ',' (4, 28)
Scanned T_DOT '.' (4, 30)
Scanned T_COLON ':' (4, 32)
Scanned T_SEMICOLON ';' (4, 34)
Scanned T_IF 'if' (5, 0)
Scanned T_LEFTPAR '(' (5, 3)
Scanned T_IDENT 'x' (5, 4) <yylval.pool_p = X>
Scanned T_LEFTBRACKET '[' (5, 5)
Scanned T_IDENT 'y' (5, 6) <yylval.pool_p = Y>
Scanned T_RIGHTBRACKET ']' (5, 7)
Scanned T_GREATERTHAN '>' (5, 8)
Scanned T_INTNUM '3' (5, 9) <yylval.ival = 3>
Scanned T_RIGHTPAR ')' (5, 10)
Scanned T_THEN 'then' (5, 11)
Scanned T_IDENT 'doit' (5, 16) <yylval.pool_p = DOIT>
Scanned T_SEMICOLON ';' (5, 20)
Scanned T_STRINGCONST ''string'' (9, 0) <yylval.str = STRING>
Scanned T_STRINGCONST ''panic: can''t happen'' (10, 0) <yylval.str = PANIC: CAN'T HAPPEN>
Scanned T_STRINGCONST '''''' (11, 0) <yylval.str = '>
Scanned T_INTNUM '12345678910' (13, 0) <yylval.ival = 12345678910>
Error: line 15: Integer out of range
Scanned T_INTNUM '36893488147419103231' (15, 0) <yylval.ival = 9223372036854775807>
Scanned T_INTNUM '9' (17, 0) <yylval.ival = 9>
Scanned T_IDENT 'b' (17, 1) <yylval.pool_p = B>
Scanned T_DOT '.' (18, 0)
Scanned T_IDENT 'e3' (18, 1) <yylval.pool_p = E3>
Error: line 19: Illegal character
Error: line 21: Newline in string
Scanned T_IDENT 'newline' (21, 2) <yylval.pool_p = NEWLINE>
Scanned T_IDENT 'in' (21, 10) <yylval.pool_p = IN>
Scanned T_IDENT 'it' (21, 13) <yylval.pool_p = IT>
Error: line 22: Newline in string
Scanned T_IDENT 'foo' (23, 0) <yylval.pool_p = FOO>
Error: line 23: Illegal character
Scanned T_IDENT 'fie' (23, 3) <yylval.pool_p = FIE>
Scanned T_IDENT '_id' (24, 0) <yylval.pool_p = _ID>
Scanned T_IDENT 'id_1' (25, 0) <yylval.pool_p = ID_1>
Scanned T_INTNUM '1' (26, 0) <yylval.ival = 1>
Scanned T_IDENT 'e' (26, 1) <yylval.pool_p = E>
Scanned T_IDENT 'integer' (27, 0) <yylval.pool_p = INTEGER>
End of file
1

Actually one ought to distinguish between a scanner, which works on characters, and a lexical scanner which groups output from the scanner to tokens.