R2. Syntactic Analysis


Context-free grammars

The syntax of a programming language is convenient to describe using a context-free grammar.

A context-free grammar consists of

  1. A set of tokens,
  2. A set of nonterminals,
  3. A designated start nonterminal,
  4. A set of productions.

Each production has the form N → α where N is a nonterminal and α is a string of zero or more tokens and nonterminals.

By convention, we use upper case letters for nonterminals and lower case letters and special symbols for tokens.

If the right-hand side of a production is an empty string, that is conventionally shown as ε.

See these examples of context-free grammars.

You should be prepared to write a context-free grammar for a given set of strings.


Rewrites

A rewrite step using production N → α has the form

βNγ ⇒ βαγ

For example, if At B r is a production then n n A s r Xn n t B r s r X is an allowed rewrite step.

We say that α ⇒* β if there is a sequence of zero or more rewrite steps that start at α and end at β.


Derivations

A derivation is a sequence of rewrite steps that starts at S, where S is the start nonterminal, and that ends on a string that consists only of tokens.

For example, suppose the grammar is:

S A B
A a A
A ε
B B b
B b

where the start nonterminal is S. The following is a derivation of abb.

S A B
a A B
a B
a B b
a b b

A derivation is a leftmost derivation if, at each step, the leftmost nonterminal is replaced. The derivation above is leftmost.

A derivation is a rightmost derivation if, at each step, the rightmost nonterminal is replaced.

You should be prepared to do a derivation of a given sequence of tokens using a given grammar.


Parse trees

A parse tree is a graphical form of derivation. A parse tree for abb using the above grammar is

                  S
                 / \
                /   \
               /     \
              A       B
             / \     / \
            a   A   B   b
                |   |
                ε   b

Remember that ε is not a token. If you read off the leaves from left to right, you get a b b.

Notice that the parse tree shows which steps were done in a derivation without showing the order in which they were done (which corresponds to the order in which the parse tree is drawn).

A parse tree shows structure. It shows the subexpressions of an expression or the components of a statement.

You should be prepared to draw a parse tree for a particular string using a given grammar.


Ambiguity

A context-free grammar is ambiguous if there exists a sequence of tokens that can be derived by two different leftmost derivations.

Equivalently, a context-free grammar is ambiguous if there is a sequence of tokens that has two different parse trees.

To show that a grammar is ambiguous, it suffices to find a sequence of tokens α and to show two different parse trees or leftmost derivations of α. Be sure that they both derive the same string α.

Showing that a grammar is unambiguous takes more involved reasoning.

The difficulty with an ambiguous grammar is that it does not unambiguously describe the structure of every program.

You should be prepared to demonstrate that a given grammar is ambiguous. See this example.


The parsing problem

The parsing problem for a particular context-free grammar G is as follows.

Input. A sequence α of tokens.

Output. A derivation of α using the start nonterminal and productions of G.

A parser is an algorithm that solves the parsing problem for a particular context-free grammar.


The end marker $

Parsers typically look ahead one token to decide what to do. But what if there are no more tokens?

By convention, $ means the end of the input. The parsing problem is slightly modified to derive α$ where α is a sequence of tokens.

So $ acts like a token, and is usually included among tokens. But there can only be one occurrence of $, and it must come last.


Deterministic parsing

A deterministic parser is a parser that finds a derivation without the need to back up. At each step, it must know exactly what to do.

Normally, a deterministic parser requires an unambiguous grammar. The parser cannot know what to do at each step when there are two or more derivations of a particular sequence of tokens.

Sometimes, a deterministic parser can be constructed for ambiguous grammars by carefully choosing how to resolve conflicts. But information on resolving conflicts is external to the grammar.


FIRST and FOLLOW sets

Two notions that are useful in constructing a deterministic parser are FIRST and FOLLOW sets.

For a particular grammar G and a sequence of tokens and nonterminals α, FIRST(α) is the set of all tokens that can begin a sequence of tokens β where α ⇒* β. That is, it contains all tokens that can start something derived starting with α. Special symbol $ can never be in FIRST(α) for any α.

Additionally, FIRST(α) contains ε if α ⇒ ε.

See computing the FIRST sets for a simple algorithm to compute FIRST(α) for any string α and this example.

Suppose that S is the start nonterminal of G. If N is a nonterminal then FOLLOW(N) is the set of all tokens t (possibly including $) so that S ⇒* α N t β, for some string α and β, That is, FOLLOW(N) is the set of all tokens that can come immediately after N in a derivation.

See computing the FOLLOW sets for a simple algorithm to compute FOLLOW(α) for any string α and this example.

You should be prepared to compute the FIRST and FOLLOW sets for a given grammar.


Top-down predictive parsing

The most intuitively simple kind of deterministic parser is a top-down predictive parser, which constructs a leftmost derivation, and constructs the derivation in the natural forward order.

Typically, a top-down parser uses a single lookahead token to make each decision about which rewrite step to do next. In that case, the parser is called an LL(1) parser.


LL(1) parsing tables

The productions of a grammar are numbered.

An LL(1) parsing table stores information that allows a top-down parser to make decisions. The table has a row for each nonterminal and a column for each token (including $).

Suppose that t ≠ $. Then (row N, column t) of the table contains the numbers of all productions N → α such that t ∈ FIRST(α).

Additionally, (row N, column $) holds the numbers of all productions N → α where FIRST(α) contains ε

See this example.

You should be prepared to fill in an LL(1) parsing table for a given grammar.


Conflicts

A conflict occurs if at least one cell of the LL(1) parsing table contains more than one production number.

When there is a conflict, the grammar cannot be parsed by an LL(1) parser.

An ambiguous grammar will always have at least one conflict, but the presence of a conflict does not necessarily mean that the grammar is ambiguous, and other approaches might work for the grammar.


Limitations of LL(1) parsing

A production is left-recursive if it has the form NN α, where N is a nonterminal.

An LL(1) parsing table always has conflicts when the grammar has a left-recursive production.

An LL(1) parsing table also has conflicts when the grammar contains two productions whose right-hand sides have a common nonempty prefix.

For example, a grammar having productions

A ( A B )
A ( C )

always has conflicts in its LL(1) parsing table. The parser cannot know what to do when it is trying to read an A and the lookahead is a left parenthesis.

You should be prepared to recognize a grammar that cannot be parsed using an LL(1) parser by noticing that it uses left-recursion or has two productions for the same nonterminal that share a common prefix on the right-hand side.


LL(1) grammars

A context-free grammar is an LL(1) grammar if its LL(1) parsing table has no conflicts.

Even if you can see how to resolve conflicts, the grammar is still not LL(1) when there are conflicts.

You should know the definition of an LL(1) grammar.


Recursive descent

Recursive descent is one way to implement a top-down predictive parser.

There is a function for each nonterminal. Function A reads a sequence of tokens that can be derived from nonterminal A.

The body of A uses the lookahead token t to select a production to use by looking in row A, column t of the parsing table.

Then it reads the right-hand side of the production. For example, to handle production A → ( B , C ), A does the following.

  match('(');
  B();
  match(',');
  C();
  match(')');

See this example of a recursive-descent parser.

You should be prepared to write a recursive-descent parser for a given grammar. You should be ready to add semantic actions to the parsing functions.


Bottom-up parsing

A bottom-up parsing algorithm produces a rightmost derivation, but produces it backwards.

The parser starts with the sequence of tokens to be derived and works backwards toward the start symbol, using productions backwards: it uses production A → α to replace α by A.

There is a sense in which bottom-up parsing is natural: the input to the parsing problem is a sequence of tokens, and that is where the parser starts.

But producing the derivation backwards is less intuitive than producing it forward.

Bottom-up parsers can handle grammars that top-down parsers cannot. Of particular interest is that bottom-up parsers can handle left recursion.


Shift-reduce parsing

Shift-reduce parsing is one scheme for bottom-up parsing.

A right-sentential form is a sequence of tokens and nonterminals that can be derived by a rightmost derivation, starting at the start nonterminal.

The parser keeps two sequences: the remaining input, which is a sequence of tokens, and the stack, which contains tokens and nonterminals.

At any point during the parse, suppose that the stack contains sequence α, from bottom to top, and that the remaining input is β. Then the concatenation αβ is a right-sentential form that is part way through the derivation.

Initially, the stack is empty and β is the entire input. When the parse is finished, β is an empty string and the stack holds the start nonterminal.

There are four actions that a shift-reduce parser can perform.

  1. A shift action removes the first token from β and pushes it onto the stack.

  2. A reduce action uses a production, N → γ. It can only be used when the stack sequence α (bottom to top) ends on γ. It removes γ and pushes N.

    That is, if γ has length k, the reduce action pops k symbols from the stack and pushes nonterminal N.

  3. An accept action is only done when the remaining input is empty and the stack contains only the start nonterminal.

  4. An error action occurs when there is a syntax error.

See this example of a shift-reduce parse controlled by an SLR(1) parsing table.

You should be able to carry out a shift-reduce parse of a particular sequence of tokens using a given grammar and parsing table.


LR parsers

LR parsers are a kind of shift-reduce parsers, distinguished by the way in which their parsing tables are constructed.

An LR-parser requires an augmented grammar, where a new start nonterminal S′ is added. One production, S′ → S, is added, where S is the original start nonterminal.

The construction process involves creating a finite-state machine.

During parsing, the idea is to use that finite-state machine to read the content of the stack, from bottom to top. The state that it ends in tells which action to do.

In practice, it is not necessary to reread the stack each time. The state reached is stored in each cell of the stack.

The parser begins by pushing the start state on the stack. There is no token or nonterminal with that stack cell.

The parsing table has a row for each state of the finite-state machine. The table has two parts.

  1. The goto part has a column for each nonterminal. Each cell contains a state of the finite-state machine.

  2. The action part has a column for each token (including $).

    The parser decides what to do next by getting the state q on top of the stack and the lookahead token t. It looks up action(q, t), and performs that action.

    Each shift action sq in the action table says to do a shift and to push state q with the shifted token onto the stack.

    Each reduce action rn says to do a reduce action using production number n.

    If production number n is A → α where α has length k, then the parser pops k things from the stack and pushes A. If the state on top of the stack after doing those pops is q, then state goto(q, A) is pushed with A on the stack.


LR(0) finite-state machines

An LR(0) item is a production with one dot on the right-hand side of the production. (Note: There is only one LR(0) item for production A → ε, and it is A.)

The closure of a set of LR(0) items is formed by adding zero or more LR(0) items using the following rule.

If the set contains LR(0) item A → α B β where B is a nonterminal, then add LR(0) item B γ to the set for each production B → γ in the grammar, provided that LR(0) item is not already in the set.

Each state of the LR(0) finite-state machine has a set of LR(0) items in it.

The start state contains the closure of set {S′ → S}.

Transitions are as follows. Suppose that q is a state.

  1. Find all symbols (tokens or nonterminals) σ that immediately follow a dot in an LR(0) item in q.

  2. For all such symbols σ, select all LR(0) items that have a dot immediately followed by σ

    For each of those LR(0) items, move the dot to just after σ. Let W be the set of LR(0) items obtained that way.

  3. Make a new state q′ containing closure(W).

  4. Add a transition from q to q′ labeled by σ.

Note that, for each symbol σ, there can be only one transition from a state that is labeled σ.

See this example of an LR(0) finite-state machine.

You should be prepared to construct the LR(0) finite-state machine for a given grammar.


SLR(1) parsers

To build the SLR(1) parsing table for a grammar G, build the LR(0) finite-state machine for G.

  1. If there is a transition from state q to q′ that is labeled by a token t, then add shift action sq′ to the action table at row q, column t.

  2. Suppose state q contains an LR(0) item A → α , where production A → α is production number n. For each token t in FOLLOW(A) (including $), add reduce action rn to the action table in row q, column t.

  3. If there is a transition from state q to q′ labeled by nonterminal A, put q′ in row q, column A of the goto table.

See this example.

You should be prepared to construct an SLR(1) parsing table for a given grammar.


Conflicts and SLR(1) grammars

Conflicts of two kinds can occur in an SLR(1) parsing table.

  1. A shift-reduce conflict occurs if a particular cell contains both a shift action and a reduce action.

  2. A reduce-reduce conflict occurs if a particular cell contains two or more different reduce actions.

A grammar is an SLR(1) grammar if its SRL(1) parsing table has no conflicts.

Sometimes it is possible to eliminate shift-reduce conflicts by selecting some of the actions (for example, using precedence) but that is external to the grammar, and, by definition, an SRL(1) grammar must not rely on removing any parsing conflicts.


LR(1) and LALR(1) parsers

LALR(1) and LR(1) parsers use ideas similar to what SLR(1) grammars do, but they are able to handle grammars that SLR(1) parsers cannot handle.

But even those kinds of parsers cannot handle all deterministically parsable grammars.

Like SLR(1) parsers, LALR(1) parsers use the LR(0) finite-state machine.

I will not ask questions about LALR(1) or LR(1) parsers that require more than a high level view of what they are capable of.


Bison

Bison is a tool that reads a context-free grammar and builds a LALR(1) parser for that grammar. Additionally, you can attach semantic actions to productions.

You should know enough about Bison to write a small Bison parser with simple semantic actions.

See this sample Bison file.