Regular expressions

←	• HOME • UP •	→
	Top/Lexical/Regular expressions

3.3. Regular Expressions

How do we specify the set of lexemes that have a given token?

If the token has just one lexeme, such as the reserved word while, there is no problem.

But a token that has a lot of lexemes can be more complicated. The standard way to describe sets of lexemes is by regular expressions. If e is a regular expression, set(e) is the set of strings that e describes.

a: For any non-special symbol a,
set(a) = {a},
a singleton set holding a string that is only one character long.
A B: Juxtaposition is concatenation. The concatenation of two sets S and T of strings is defined to be
S ⋅ T = {xy | x ∈ S and y ∈ T}.
The meaning of regular expression A B is
set(A B) = set(A) ⋅ set(B).
A | B: The union S ∪ T of sets S and T is defined
S ∪ T = {x | x ∈ S or x ∈ T}.
In a regular expression, | indicates union, so
set(A | B) = set(A) ∪ set(B).
A*: Postfix operator * indicates repetition 0 or more times. For any set of strings S,
S* = {x₁…x_n | n ≥ 0 and x_i ∈ S for i = 1, …, n}
and
set(A*) = set(A)*

Precedence

By convention, the precedence is as follows, from high to low.

justaposition

Examples of regular expressions

We need a name, ε, indicating the empty string.

horse: Regular expression horse describes the set {horse} of one string.
rabbit | bunny: This represents set {rabbit, bunny}.
(a|b)*: This represents the set {ε, a, b, aa, ab, ba, bb, aaa, …} of all strings of a's and b's.
a*b*: This represents the set {ε, a, b, aa, bb, ab, aaa, abb, …} of all strings that consist of 0 or more a's followed by 0 or more b's. Notice that it does not include ba.
a*| b*: This represents the set {ε, a, b, aa, bb, aaaa, bbb, …} of all strings that consist of 0 or more a's or 0 or more b's.
a(a|b)*: The represents any string of a's and b's that starts with an a.

←	• HOME • UP •	→
	Top/Lexical/Regular expressions