3.1. Tokens, Lexemes and the Lexer


Lexemes

A programming language has a collections of words and symbols that are called lexemes.

For example, C has symbols (, ), ->, etc. Reserved words include if and while. A variable or function name is also considered a lexeme, as are numeric and string constants.


Breaking a program into lexemes

A lexical analyzer, also called a lexer, is responsible for breaking a program up into a sequence of lexemes.

Not all of the characters belong to lexemes. Spaces and comments, for example, are part of space between lexemes.

The usual rule that a lexer follows is:

  1. First, skip over characters that cannot begin a lexeme.

  2. Then find a longest prefix of what is left that is an allowed lexeme.

For example, given C code

  ifx = 12*54;
the first lexeme is ifx. Notice that it will not be if, since that is not the longest possible lexeme. The next few lexemes are =, 12, *, 54 and ;.

Notice that spaces are only needed in places where otherwise a longer lexeme would be chosen. For example, the first lexeme of

  if x = 12*54;
is if. (Of course, that is not allowed in a C program, but that has no bearing on lexical analysis.)


Tokens

The parser wants to work at a higher level that individual characters.

Instead, it works with tokens. Tokens have no structure to them. Token TOK_IF, for example, might the the token that corresponds to lexeme if.

After getting a lexeme, the lexer converts it to a token and passes the token on to the parser.