3.4. Flex

The Dragon Book describes Lex, a tool for creating lexical analyzers. We will use Flex, which is compatible with Lex, but is considerably faster.

Flex takes a program written in a combination of Flex and C, and it writes out a file (called lex.yy.c) that holds a definition of function yylex(), with the following prototype.

  int yylex(void);


The file to read

yylex reads from the file stored in variable yyin:

  FILE* yyin;
It is up to you to open a file for reading and store it into yyin before you call yylex.

Each time your program calls yylex, it returns the next token (an integer token code).


Handling end of file

When yylex is finished, it call function yywrap(). If yywrap() returns 1, then yylex returns 0 to its caller. That means "end of file".

If yywrap returns 0, then yylex assumes that you have stored a different file into yyin, and it starts reading that file.


Flex regular expressions

In addition to the usual regular expressions, Flex introduces some new notations.

[abcd]

A bracketed expression describes a set of characters. Expression [abcd] is equivalent to (a|b|c|d).

[0-9]

In brackets, a dash indicates a range of characters. For example, [a-zA-Z] matches any single letter. If you want a dash as one of the characters, put it first.

[^abcd]

This indicates any character except a, b, c or d. For example, [^a-zA-Z] matches any nonletter.

.

Dot is the same as [^\n]. That is, it is any character except a newline character.

A+

A+ is equivalent to AA*, So it matches one or more members of A in a row.

A?

A? stands for zero or one occurrence of A. So it is an optional A.

A/B

A, but only when followed by B.

"string"

Any string in double-quotes is literally that string. You will need to quote special characters.

Warning: Flex is picky about spaces. Do not put any spaces in regular expressions, unless they are enclosed in double-quotes or square brackets.


The structure of a Flex program

A Flex program has 4 parts. The general structure is

  %{
  part 1
  %}
  part 2
  %%
  part 3
  %%
  part 4

Part 1

Part 1 is copied, character for character, to the beginning of lex.yy.c.

Part 4

Part 4 is copied, character for character, to the end of lex.yy.c.

Part 2

Part 2 contains definitions of named regular expressions. Each line has a name, some white space, then a regular expression. For example,
  letter      [a-zA-Z]
  digit       [0-9]
To use one of those defined names, enclose it in braces. For example,
  letter      [a-zA-Z]
  digit       [0-9]
  identifier  {letter}({letter}|{digit})*
defines an identifer to be a letter followed by a sequence of letters and digits.

Part 3

Part 3 tells how yylex works. It is a sequence of parts that look like this.
  R     Action
where R is a regular expression and the action is C code that tells what to do when a match for R is found.

In the action, say

  return n;
to cause yylex to return n as the token.

If you do not surround the action in braces then it must all be on one line. If you surround it in braces, then it will extend to the matching right brace. But the left brace must be on the same line as the regular expression.


Rule precedence

If two or more different regular expressions match the same string, then the action associated with the first one is taken.


Skipping inter-lexeme characters

If your action does not return, then yylex will start over at its beginning, thowing away the characters that were matched. Use this for characters that are not part of a lexeme.


Token attributes

The token attribute (for tokens that have attributes) is stored into variable yylval, which has type YYSTYPE. You will want to define YYSTYPE. Also, state that YYSTYPE is defined. For example,

#define YYSTYPE_IS_DECLARED
typedef union {
  const char* strval;
  int         ival;
}
YYSTYPE;


Getting the lexeme

Flex stores the lexeme that it found as a null-terminated string in variable yytext. The length of the lexeme is put into variable yyleng.


A sample flex file

Here is a simple lexer written in flex.

%{
#include <stdio.h>
#include "yystype.h" /* defines YYSTYPE */
#include "tokens.h"  /* defines tokens  */

YYSTYPE yylval;
%}

digit         [0-9]
letter        [a-zA-Z]
identifier    {letter}+
comment       "%".*"\n"

%%
[ \t]         {}

[\n]          {line_num++;
              }

{comment}     {}

"while"       {return TOK_WHILE;
              }

"("	      {return TOK_LPAREN;
              }

")"           {return TOK_RPAREN;
              }

{digit}+      {yylval.ival = atoi(yytext);
               return TOK_INTCONST;
              }

{identifier}  {yylval.strval = intern(yytext);
               return TOK_IDENTIFIER;
              }
%%

int yywrap(void)
{
  return 1;
}

Running flex

If your lexer is called lexer.lex, then command

  flex lexer.lex
will run flex. If there are no errors, it will write file lex.yy.c.


Additional trickery

The manual describes other things that a flex program can do. For a simple program, what has been described here should suffice. The course web page provides links to manuals.