Syntax [Chapter 3]

Syntax refers to the form, or appearance, of a program. It is not concerned with meaning. Syntax can show structure, indicating that certain parts of a program are components of other parts, but it does not tell you the semantic significance of that structure.

Syntax design

Languages use a variety of different syntactic rules. Some are intended to look familiar, while others choose simplicity over familiarity. Here is the same thing written in three different languages.

Pascal

if x < y then begin w := y; z := 0 end

Ada

if x < y then w := y; z := 0; end if

C

if(x < y) { w = y; z = 0; end

Scheme

(if (< x y) (begin (set! 'w y) (set! 'z 0) ) () )

Form and function

A syntax should be designed so that the form of a program reflects its function. One way to help in doing that is to use indentation to show structure. That can be accomplished in two ways.

The language can be free form, treating a line break like a space, and treating several spaces in a row like one space. Then the programmer can indent as he or she sees fit.

Indentation can be significant to the syntax of the language, and can be used to indicate structure. This is called layout syntax, and is used by Haskell and Python.

Languages that do not use layout need some way to indicate structure. There are two common ways to do that.

Some languages employ a compound statement to indicate grouping. A compound statement in C has the form {...}. In Pascal, it has the form begin ... end.

Some languages make the grouping part of the form of statements and expressions, using full bracketing. An example is Ada, where each if-statement must end with end if.

Misleading programs

Of the indentation/grouping schemes, one of the most common is free-form with compound statements to show grouping. Unfortunately, that invites misleading programs. For example, C statement
while(f(x) > 0) x := x + 1; print(x);
appears to print x each time around the loop, if you use indentation as a cue for structure. But that is not what happens.

Lexical rules

Tokens

The "words" of a program are called tokens. For example, Pascal program
Program Test; begin writeln('This is a test') end.
is broken into a sequence of ten tokens:

Program

identifier

;

begin

identifier

(

string-constant

)

end

.

From the standpoint of syntax, you do not care what the name of the program is. It is just an identifier - a name for something.

Lexemes

The sequence of characters that stands for a token in a program is called the lexeme of that token. For example, one of the identifiers in the Pascal program has lexeme Test.

Some tokens have only one lexeme. (A semicolon is an example.) Some tokens, such as identifiers, have many possible lexemes.

Comments and spaces

Spaces and comments are not part of any lexemes at all. They are only present to space out or comment a program.

Sometimes a space is required between two lexemes just to separate them. For example,
the dog
is probably two identifiers in a row, while
thedog
is a single identifier.

Lexical rules

Lexical rules tell how to break a program up into lexemes, and then how to convert the lexemes into tokens. A typical rule for breaking a program into lexemes is a follows.
At each point, skip over any leading part that is not part of a lexeme, such as spaces. Then select the longest prefix of what is left that is the lexeme of some token. That is the next lexeme. Continue in this fashion.

Parse trees and program structure

A parse tree describes the structure of a part of a program, such as a statement or an expression. The leaves of a parse tree are tokens. The nonleaves are labeled by what kind of thing the tree describes, such as expression or statement.

The following parse tree describes expression 4 + 9 + 2. It assumes that 4, 9 and 2 are lexemes that describe the token number.

This tree shows structure: it indicates that 9+2 is done first, because that is a subtree.

Parse trees and program structure
A parse tree describes the structure of a part of a program, such as a statement or an expression. The leaves of a parse tree are tokens. The nonleaves are labeled by what kind of thing the tree describes, such as expression or statement. The following parse tree describes expression 4 + 9 + 2. It assumes that 4, 9 and 2 are lexemes that describe the token number. This tree shows structure: it indicates that 9+2 is done first, because that is a subtree.

Describing syntax and structure: Grammars

You can describe the syntax of a language by describing the parse trees that are allowed. Anyone creating a parse tree must follow the rules that you give.

The rules describe what is allowed locally in a tree. The example tree contains a node labeled expression with three children, labeled expression, + and expression, in that order. A rule that indicates this is allowed is

<expression> ::= <expression> + <expression>

In rules, labels of nonleaf nodes are usually written in angle brackets.

A collection of rules is called a grammar. A grammar that allows the parse tree shown above is

<expression> ::= <expression> + <expression>

<expression> ::= <expression> * <expression>

<expression> ::= number

If you have several rules in a row with the same thing on the left of ::=, you normally write | in each but the first rule. So the above grammar can be written

<expression> ::= <expression> + <expression>

| <expression> * <expression>

| number

Example grammar: statements

<statement> ::= <assignment>

| <if-statement>

| <while-statement>

| <procedure-call>

<assignment> ::= identifier := <expression>

<if-statement> ::= if <expression> then <statement>

| if <expression> then <statement> else <statement>

...

Repetition

You can use recursion to simulate repetition in a grammar. Here is a definition of a C compound statement, which consists of braces surrounding one or more statements.

<compound-statement> ::= { <statements> }

<statements> ::= (empty)

| <statement> <statements>

Here, (empty) indicates an empty sequence of tokens. It is only written so that you understand that it is deliberately empty.

Ambiguity

A grammar should, ideally, not only specify the order in which tokens can appear, but also the structure of a parse tree. So (ideally) any syntactically correct program or program part, such as an expression, should have just one possible parse tree.

Unfortunately, not all grammars are ideal.

Definition

A grammar is ambiguous if there exists a sequence of tokens that has two or more different parse trees.

The expression grammar shown above is ambiguous. Here are two different parse trees for number + number * number

Designing unambiguous grammars

An alternative expression grammar introduces asymmetry to avoid ambiguity. Here is an unambiguous expression grammar that allows binary operators +, - and *, numbers, and parentheses.

<expression> ::= <term>

| <expression> + <term>

| <expression> - <term>

<term> ::= <factor>

| <term> * <factor>

<factor> ::= number

| ( <expression> )

The following is the only parse tree for expression 2+3*4 using this grammar.

The following is a parse tree for expression 2+(4*5) using this grammar.

Precedence and associativity

The precedence level of a binary operator tells how it relates to other binary operators. Operators with higher precedence have higher binding power than operators of lower precedence. In the expression grammar above, * has higher precedence than + or -.

You can enforce precedence by having a symbol in the grammar that stands for each level of precedence. Our grammar has a symbol, <expression>, for the precedence level of + and -; a symbol, <term>, for the precedence level of *; and a symbol, <factor>, for parentheses, which have very high precedence.

Associativity tells how an operator interacts with other operators (including itself) that have the same precedence level. An operator is left-associative if it is parenthesized to the left, and is right-associative if it is parenthesized to the right. For example, if + is left-associative, then a+b+c is understood to mean the same thing as (a+b)+c. But if + is right-associative, then a+b+c is understood to be as if written a+(b+c).

The grammar above forces +, - and * to be left-associative. Notice that this means that operations of like precedence are done from left to right.

Precedence and associativity
The precedence level of a binary operator tells how it relates to other binary operators. Operators with higher precedence have higher binding power than operators of lower precedence. In the expression grammar above, * has higher precedence than + or `-`. You can enforce precedence by having a symbol in the grammar that stands for each level of precedence. Our grammar has a symbol, <expression>, for the precedence level of + and `-`; a symbol, <term>, for the precedence level of ; and a symbol, <factor>, for parentheses, which have very high precedence. Associativity* tells how an operator interacts with other operators (including itself) that have the same precedence level. An operator is left-associative if it is parenthesized to the left, and is right-associative if it is parenthesized to the right. For example, if + is left-associative, then a+b+c is understood to mean the same thing as (a+b)+c. But if + is right-associative, then a+b+c is understood to be as if written a+(b+c). The grammar above forces +, - and * to be left-associative. Notice that this means that operations of like precedence are done from left to right.

Extended BNF notation

There are some convenient abbreviations that are not part of standard BNF notation, but are frequently used in what is called extended BNF notation.

Square brackets indicate an optional thing. For example, [<expression>] indicates that an expression might or might not be here.

Curly braces indicate repetition. More precisely, {X} indicates zero or more copies of X.

Here is a grammar that uses EBNF to define some C statements. Note that the tokens { and } are quoted, to avoid confusion with braces used to indicate repetition.

<if-statement> ::= if ( <expression> ) <statement> [else <statement>]

<compound-statement> ::= '{' {<statement>} '}'

Syntax diagrams

Another common notation for describing syntax is the syntax diagram, a graphical form that is equivalent to BNF, but is often easier to understand. The following is a syntax diagram for a kind of expression.

Syntax diagrams
Another common notation for describing syntax is the syntax diagram, a graphical form that is equivalent to BNF, but is often easier to understand. The following is a syntax diagram for a kind of expression.

Terminology

Tokens are also called terminals.

Symbols such as <expression> that label nonleaves of parse trees are called nonterminals.

Grammars written in BNF notation are also called context-free grammars.