3.5. Building Your Own Lexer

You can build your own lexer, without Flex, by designing and implementing a finite state machine. If you look at the code that Flex produces, this is what it does.

(The Flex finite state machine is encoded in tables though, making it more difficult to understand. The tables describes the states, the transitions and numbers for the actions. You will see a switch-statement where the action number leads to an action.)


How a finite state machine works.

For our purposes, a finite state machine has a set of states. The states are numbered from 0 to n for some n.

One of the states is designated as the start state, Some states have an associated action.

There are transitions between states, labeled by characters or sets of characters.

The idea is to start in the start state, read characters and follow transitions until you reach a state where no transition is available for the next character.

As you go, you remember the last state they you have reached that has an associated action. Back up to that state (putting characters that were read after reaching that state back into the input) and perform the action at that state.


A sample finite state machine

   start
     |
     v
   ->0
  |  |  (
  |  |-----> 1 return '(';
  |  |
  |  |  -         >
  |  |-----> 2 ------> 3 return TOK_ARROW;
  |  |    return '-';
  |  |
  |  |
  |  |                other      *
  |  |                ___       ___
  |  |               |   |     |   |
  |  |  /         *   \ /   *   \ /   /
  |   -----> 4 ------> 5 ------> 6 -----
  |                    ^         |      |
  |                    |         |      |
  |                     ---------       |
  |                       other         |
  |_____________________________________|

Implementing a finite state machine

Here is an example implementing the above finite state machine. Use lexer() to read and return the next token.

/*=======================================
 * Variables.  These are made global for
 * simplicity.
 *=======================================*/

char        lexeme[MAX_LEXEME_LENGTH];
int         lexemeLength           = 0;
static int  done;
static int  state;
static int  finalState             = -1;
static int  finalStateLexemeLength = -1;


/*=======================================
 * lexer().
 *=======================================*/

int lexer(void)
{
  int tok;

  done                   = 0;
  state                  = 0;
  finalState             = -1;
  finalStateLexemeLength = -1;
  
  while(!done) {
    step();
    setFinal();      
  }
  if(finalState < 0) {
    ... error
  }
  cleanLexeme();
  return action();       
}


/*=======================================
 * step() reads a character and, if
 * possible, 
 *   (a) adds the character to 'lexeme',
 *   (b) moves to the next state.
 *
 * If there is nowhere to go, step() sets
 * done = 1;
 *=======================================*/

void step(void)
{
  int ch = getchar();
  switch(state) {
    case 0:
      switch(ch) {
        case '(': 
          moveToState(1, ch);
          break;
        case '-': 
          moveToState(2, ch);
          break;
        case '/':
          moveToState(4, ch);
          break;
        default:
          done = 1;     
      }

    case 1: 
      done = 1;
      break;

    case 2:
      if(ch == '>') {
        moveToState(3, ch);
      }
      else {
        done = 1;
      }
      break;

    case 3:
      done = 1;
      break;

    ...
  }
}


/*=======================================
 * moveToState(st, ch) changes the current
 * state to st and adds ch to the lexeme. 
 *=======================================*/
 
void moveToState(int st, int ch)
{
  state = st;
  if(lexemeLength < MAX_LEXEME_LENGTH) {
    lexeme[lexemeLength] = (char) ch;
    lexemeLength++;
  }
}


/*=======================================
 * isFinalState[n] is 1 if state n has an
 * action associated with it.
 *=======================================*/

int isFinalState[] =
{
  /* 0 */  0,
  /* 1 */  1,
  /* 2 */  1,
  /* 3 */  1,
  /* 4 */  0,
  /* 5 */  0,
  /* 6 */  0
}


/*=======================================
 * setFinal() stores the current state
 * into finalState if the current state
 * has an action associated with it.
 *=======================================*/

void setFinal(void)
{
  if(isFinalState[state]) {
    finalState             = state;
    finalStateLexemeLength = lexemeLength;
  }
}


/*=======================================
 * cleanLexeme() removes any characters
 * from lexeme that occur after the last
 * final state and puts them back into 
 * the input.
 * 
 * Then it null-terminates lexeme.
 *=======================================*/

void cleanLexeme(void)
{
  int n = lexemeLength;
  while(n > finalStateLexemeLength) {
    n--;
    putback(lexeme[n]);  /* This needs to be written */
  }  
  lexeme[n] = '\0';
}


/*=======================================
 * action() performs the action indicated
 * by finalState and returns the token.
 *=======================================*/

int action(void)
{
  switch(finalState) {
    case 1:
      return '(';

    case 2:
      return '-';

    case 3:
      return TOK_ARROW;
  }
}

Comparing finite state machines and regular expressions

We have seen two ways of describing sets of lexemes: regular expressions and finite state machines. Which is better?

Certainly, regular expressions are simple and compact, and don't require states. So they are convenient.

But in terms of what you can do with them, they have exactly the same expressive power.

Theorem. Any set of strings that can be described by a regular expression can also be described by a finite state machine, and vice versa.

Flex converts a collection of regular expressions to a single finite state machine. The theorem tells us that it is always possible to do that.