6.1. Context-Free Grammars

Context-free grammars are the fundamental tool used to describe the syntax of a programming language. They are much more powerful than regular expressions, but, partly because of that power, require significantly more careful care and feeding than regular expressions.


Tokens and nonterminals

Having chosen a collection of tokens, the next step is to choose a collection of nonterminals. A nonterminal represents a syntactic idea, such as an expression, a list of expressions, a statement, etc.

For brevity in this description, I will use upper-case letters for nonterminals and lower-case letters for tokens. I will also use special symbols such as +, * and comma as tokens.

A greek letter stands for a string of zero or more tokens and nonterminals.


Productions

The next step in creating a context-free grammar is to write a collection of productions. A production has the form

N → α
where N is a nonterminal and α is a string of zero or more tokens and nonterminals.

A production is a rewrite rule. Production N → α says that N can be replaced by α.

For example, suppose that nonterminal E is intended to stand for an expression. Production

EE + E
says that an expression can have the form of an expression, then token +, then another expression.


Context-free grammars

In addition to a list of tokens and nontermials, a context-free grammar has a set of productions and a start nonterminal.

For example, here is a small context-free grammar for expressions. There is only one nonterminal, E, so it is obviously the start nonterminal. Token n stands for a number.

E n
E E + E
E E * E
E ( E )

In the last of those productions, ( and ) are tokens.

Other kinds of grammars, beyond context-free, exist, but in this course we will only use context-free grammars. So I drop the context-free designation and just call them grammars.


Abbreviating grammars

By convention, a group of productions that have the same left-hand side are shown with the left-hand side written only once. In all but the first production in the group, write | instead of →. For example, the above grammar for expressions is usually written as follows.

E n
  | E + E
  | E * E
  | ( E )