R1. Lexical Analysis


What is lexical analysis?

A lexeme is a word or symbol. For example, reserved words such as if and symbols such as -> are lexemes.

A token is a representation of a lexeme that usually has two parts: a token number (an integer) and a token attribute, which provides additional information about the lexeme that the token represents.

The lexical analyzer, or lexer, converts a sequence of characters into a sequence of tokens.

You should be able to describe the key ideas of lexical analysis, including the difference between lexemes and tokens and properties of tokens.


What steps does a lexical analyzer perform?

A lexical analyzer repeats the following steps.

  1. Skip over characters, such as spaces, that cannot begin a lexeme.

  2. Read a longest possible prefix of what is left that is an allowed lexeme.

  3. Convert the lexeme into a token.

You should be prepared to describe the major steps in lexical analysis.


Why are lexemes converted into tokens?

Parsers are typically based on a single-token lookahead. Without the lexer, a single-character lookahead would rarely be adequate for parsing.

Parsers are also much easier to describe in terms of token numbers instead of in terms of lexemes.

A parser can refer to an identifier without the need to say what an identifier looks like. It can put the token attribute aside until later.

You should be prepared to explain why lexical analyzers are used.


How are lexical analyzers typically defined?


Finite state machines

The standard tool for defining a lexical analyzer is a finite-state machine

The lexer starts in an initial state. As it reads each character, it follows transitions that takes it to other states.

The states and transitions can handle the entire process of lexical analysis.

At the initial state, there will usually be a transition labeled by a space that points back to the start state, so that the lexer will skip over initial spaces.

At the initial state, the lexer might have a transition labeled 'f' that can be used to handle lexemes for, from and identifiers such as frog. There is no need for a separate check for each word.

Some states indicate that, if the lexer reaches them and it cannot moves to another state, then the lexer should produce a particular token. For example, reading characters 'i' and 'f' might take it to a state where it will produce token TOK_IF, but only if no more characters can be read.

The code for a lexer uses a variable to keep track of the state, and follows the finite-state machine in the most direct manner possible.


Regular expressions

Regular expressions are a simple and compact notation for describing sets of lexemes.

The fundamental operations of regular expressions are:

Others are

See these examples of regular expressions.

You should be prepared to write a regular expression that describes a given set of strings — neither too many strings nor too few.


Flex

Flex is a tool that builds a lexical analyzer from a collection of regular expressions and associated actions.

See Flex regular expressions, the structure of a Flex file and this sample flex file.

You should be able to write a very small lexical analyzer in Flex.