How do we specify the set of lexemes that have a given token?
If the token has just one lexeme, such as the reserved word while, there is no problem.
But a token that has a lot of lexemes can be more complicated. The standard way to describe sets of lexemes is by regular expressions. If e is a regular expression, set(e) is the set of strings that e describes.
a
set(a) = {a},a singleton set holding a string that is only one character long.
A B
S ⋅ T = {xy | x ∈ S and y ∈ T}.The meaning of regular expression A B is
set(A B) = set(A) ⋅ set(B).
A | B
S ∪ T = {x | x ∈ S or x ∈ T}.In a regular expression, | indicates union, so
set(A | B) = set(A) ∪ set(B).
A*
S* = {x1…xn | n ≥ 0 and xi ∈ S for i = 1, …, n}and
set(A*) = set(A)*
By convention, the precedence is as follows, from high to low.
* |
justaposition |
| |
We need a name, ε, indicating the empty string.
horse
rabbit | bunny
(a|b)*
a*b*
a*| b*
a(a|b)*