9.11. Canonical LR(1) Parsers


LR(1) items

We need a way to bring the notion of following tokens much closer to the productions that use them.

An LR(1) item has the form [I, t] where I is an LR(0) item and t is a token.

As the dot moves through the right-hand side of I, token t remains attached to it. LR(1) item [A → α , t] calls for a reduce action when the lookahead is t.

The follow context is determined when an LR(1) item [A α, t] is created, and is carried along with it. Token t is a token that can follow A in the context in which the A occurs. This does not suffer from the difficulty of using FOLLOW sets to determine when to do reduce actions.


Closures of sets of LR(1) items

The closure of a set S of LR(1) items for augmented grammar G is computed as follows.

closure(S)
For each item [A → α  B β, t] in S,
  For each production B → γ in G,
    For each token b in FIRST(βt),
      Add [B γ, b] to S

You need to continue adding LR(1) items until no more can be added, just like for LR(0) closures.


Building the LR(1) finite-state machine.

Suppose that G is an augmented grammar with augmenting production S′ → S. The initial state has kernel {[S′ → S, $]}. Form its closure.

For each state, you need to build other state (if not yet built) and add transitions to them as follows.

Suppose that state q contains set I of LR(1) items.

For each symbol s (either a token or a nonterminal) that immediately follows a dot in an LR(1) item [A → α sβ, t] in set I, let Is be the set of all LR(1) items in I where s immediately follows the dot.

Move the dot to the other side of s in each of them. That set becomes the kernel of state q′, and you make a transition from q to q′ on s. As usual, form the closure of the set of LR(1) items in state q′.


Example (Page 262, Dragon Book)

For example, let G be the following grammar.

0. S S
1. S C C
2. C c C
3. C d

The LR(1) finite-state machine for G is as follows.

State 0.
[S′ →  S, $]
[S C C, $]
[C c C, c]
[C c C, d]
[C d, c]
[C d, d]
On S goto 1
On C goto 2
On c goto 3
On d goto 4

State 1.
[S′ → S , $]

State 2.
[SC  C, $]
[C c C, $]
[C d, $]
On C goto 5
On c goto 6
On d goto 7


State 3.
[Cc  C, c]
[Cc  C, d]
[C c C, c]
[C c C, d]
[C d, c]
[C d, d]
On C goto 8
On c goto 3
On d goto 4

State 4.
[Cd , c]
[Cd , d]

State 5.
[SC C , $]

State 6.
[Cc  C, $]
[C c C, $]
[C d, $]
On C goto 9
On c goto 6
On d goto 7

State 7.
[Cd , $]

State 8.
[Cc C , c]
[Cc C , d]

State 9.
[Cc C , $]

The LR(1) parsing table is as follows.

Actions Goto
c d $ S C
0 s3 s4 1 2
1 acc
2 s6 s7 5
3 s3 s4 8
4 r3 r3
5 r1
6 s6 s7 9
7 r3
8 r2 r2
9 r2

An abbreviation

It is common to have sets of LR(1) items where several of the LR(1) items contain the same LR(0) item. For example, state 8 above is

State 8.
[Cc C , c]
[Cc C , d]

Notation [Cc C , c/d] abbreviates a combination of those two items. State 8 can be rewritten as

State 8.
[Cc C , c/d]

LALR(1) parsers

Now we are able to explain what a LALR(1) parser is. Start with the LR(1) finite-state machine.

Compare each pair of states to one another by looking only at the LR(0) items that the LR(1) items contain.

If two states have exactly the same LR(0) items, combine those states into a single state by combining their LR(1) items.

The LR(1) finite state machine above is changed to the following.

State 0.
[S′ →  S, $]
[S C C, $]
[C c C, c/d]
[C d, c/d]
On S goto 1
On C goto 2
On c goto (3,6)
On d goto (4,7)

State 1.
[S′ → S , $]

State 2.
[SC  C, $]
[C c C, $]
[C d, $]
On C goto 5
On c goto (3,6)
On d goto (4,7)

State (3,6).
[Cc  C, c/d/$]
[C c C, c/d/$]
[C d, c/d/$]
On C goto (8,9)
On c goto (3,6)
On d goto (4,7)

State (4,7).
[Cd , c/d/$]

State 5.
[SC C , $]

State (8,9).
[Cc C , c/d/$]

The LALR(1) parsing table is as follows.

Actions Goto
c d $ S C
0 s(3,6) s(4,7) 1 2
1 acc
2 s(3,6) s(4,7) 5
(3,6) s(3,6) s(4,7) (8,9)
(4,7) r3 r3 r3
5 r1
(8,9) r2 r2 r2

Notice that

  1. there are fewer states, and

  2. there are no conflicts.

The LALR(1) parser always has exactly the same states as the SLR(1) parser. But, because it does not use the FOLLOW sets, it avoids some reduce actions that might cause conflicts.

Combining similar states does not always work. Sometimes, parsing conflicts arise. That means that LR(1) parsers more powerful than LALR(1) parsers.

But the state reduction is significant. For a typical programming language grammar, the LALR(1) finite-state machine can have an order of magnitude fewer states than the LR(1) finite-state machine.

A LALR(1) parser for a programming language like Java would have roughly 1000 states, where the LR(1) table would have roughly 10,000 states.