The Dragon Book describes Lex, a tool for creating lexical analyzers. We will use Flex, which is compatible with Lex, but is considerably faster.
Flex takes a program written in a combination of Flex and C, and it writes out a file (called lex.yy.c) that holds a definition of function yylex(), with the following prototype.
int yylex(void);
yylex reads from the file stored in variable yyin:
FILE* yyin;It is up to you to open a file for reading and store it into yyin before you call yylex.
Each time your program calls yylex, it returns the next token (an integer token code).
When yylex is finished, it call function yywrap(). If yywrap() returns 1, then yylex returns 0 to its caller. That means "end of file".
If yywrap returns 0, then yylex assumes that you have stored a different file into yyin, and it starts reading that file.
In addition to the usual regular expressions, Flex introduces some new notations.
[abcd]
[0-9]
[^abcd]
.
A+
A?
A/B
"string"
Warning: Flex is picky about spaces. Do not put any spaces in regular expressions, unless they are enclosed in double-quotes or square brackets.
A Flex program has 4 parts. The general structure is
%{ part 1 %} part 2 %% part 3 %% part 4
Part 1
Part 4
Part 2
letter [a-zA-Z] digit [0-9]To use one of those defined names, enclose it in braces. For example,
letter [a-zA-Z] digit [0-9] identifier {letter}({letter}|{digit})*defines an identifer to be a letter followed by a sequence of letters and digits.
Part 3
R Actionwhere R is a regular expression and the action is C code that tells what to do when a match for R is found.
In the action, say
return n;to cause yylex to return n as the token.
If you do not surround the action in braces then it must all be on one line. If you surround it in braces, then it will extend to the matching right brace. But the left brace must be on the same line as the regular expression.
If two or more different regular expressions match the same string, then the action associated with the first one is taken.
If your action does not return, then yylex will start over at its beginning, thowing away the characters that were matched. Use this for characters that are not part of a lexeme.
The token attribute (for tokens that have attributes) is stored into variable yylval, which has type YYSTYPE. You will want to define YYSTYPE. Also, state that YYSTYPE is defined. For example,
#define YYSTYPE_IS_DECLARED typedef union { const char* strval; int ival; } YYSTYPE;
Flex stores the lexeme that it found as a null-terminated string in variable yytext. The length of the lexeme is put into variable yyleng.
Here is a simple lexer written in flex.
%{ #include <stdio.h> #include "yystype.h" /* defines YYSTYPE */ #include "tokens.h" /* defines tokens */ YYSTYPE yylval; %} digit [0-9] letter [a-zA-Z] identifier {letter}+ comment "%".*"\n" %% [ \t] {} [\n] {line_num++; } {comment} {} "while" {return TOK_WHILE; } "(" {return TOK_LPAREN; } ")" {return TOK_RPAREN; } {digit}+ {yylval.ival = atoi(yytext); return TOK_INTCONST; } {identifier} {yylval.strval = intern(yytext); return TOK_IDENTIFIER; } %% int yywrap(void) { return 1; }
If your lexer is called lexer.lex, then command
flex lexer.lexwill run flex. If there are no errors, it will write file lex.yy.c.
The manual describes other things that a flex program can do. For a simple program, what has been described here should suffice. The course web page provides links to manuals.