3.3. Regular Expressions

How do we specify the set of lexemes that have a given token?

If the token has just one lexeme, such as the reserved word while, there is no problem.

But a token that has a lot of lexemes can be more complicated. The standard way to describe sets of lexemes is by regular expressions. If e is a regular expression, set(e) is the set of strings that e describes.

a

For any non-special symbol a,
set(a) = {a},
a singleton set holding a string that is only one character long.

A B

Juxtaposition is concatenation. The concatenation of two sets S and T of strings is defined to be
ST = {xy | xS and yT}.
The meaning of regular expression A B is
set(A B) = set(A) ⋅ set(B).

A | B

The union ST of sets S and T is defined
ST = {x | xS or xT}.
In a regular expression, | indicates union, so
set(A | B) = set(A) ∪ set(B).

A*

Postfix operator * indicates repetition 0 or more times. For any set of strings S,
S* = {x1xn | n ≥ 0 and xiS for i = 1, …, n}
and
set(A*) = set(A)*

Precedence

By convention, the precedence is as follows, from high to low.

*
justaposition
|

Examples of regular expressions

We need a name, ε, indicating the empty string.

horse

Regular expression horse describes the set {horse} of one string.

rabbit | bunny

This represents set {rabbit, bunny}.

(a|b)*

This represents the set {ε, a, b, aa, ab, ba, bb, aaa, …} of all strings of a's and b's.

a*b*

This represents the set {ε, a, b, aa, bb, ab, aaa, abb, …} of all strings that consist of 0 or more a's followed by 0 or more b's. Notice that it does not include ba.

a*| b*

This represents the set {ε, a, b, aa, bb, aaaa, bbb, …} of all strings that consist of 0 or more a's or 0 or more b's.

a(a|b)*

The represents any string of a's and b's that starts with an a.