Implementation of SFL

The following describes a step by step plan for creating an implementation of SFL. Follow the plan.

1. Strings

You will need to use strings quite a bit in the implementation, and you will not want to worry about memory management for them. A good idea is to create a table with all of the strings that you have seen stored in it. Create such a table. Include a function intern(s) that looks up string s in the table, and returns the string stored for s. If s is not in the table, it should be inserted.

Choose a very simple implementation of the string table. For example, just use an array, and do a linear search of the array to perform an intern operation. Do not try to be fancy.

2. Lexical analysis

Look at the language description, and identify all of the lexemes. Decide on tokens. For lexemes that are one character, such as '(', you can use the character itself as the token. For more complicated lexemes, you will need to choose a name. For example, lexeme -> will need to have a name, such as TOK_ARROW. Create a file that defines numbers for token names. Choose numbers that are 256 or larger. For example, your file might be called token.h, and begin as follows.

#define TOK_ARROW       256
#define TOK_ID	        257

Include a definition of every token that has a name. Also include a definition of type YYSTYPE, as follows.

  typedef union {
    char* str;
  } 
  YYSTYPE;

(YYSTYPE is the type of a token attribute. This assumes that a token attribute is a string.)

Now create a lexical analyzer using Flex. See the Flex manual (unreadable pdf version). The lexer should return the tokens. Here is a start.

%{
#include <stdio.h>
#include "token.h"
YYSTYPE yylval;
%}

letter  [a-zA-Z]
digit   [0-9]
id      {letter}+

%%
def     {return TOK_DEF;
        }

'('     {return '(';
        }

"->" {return TOK_ARROW;
        }

{id}    {yylval.str = intern(yytext);
         return TOK_ID;
        }
%%
int yywrap()
{
  return 1;
}

int main()
{
  while(1) {
    int tok = yylex();
    if(tok == 0) break;
    else if(tok == TOK_ID) printf("ID, name = %s\n", yylval.str);
    else printf("Token %d\n");
  }
  return 0;
}

You will need to handle all of the lexemes and tokens. You will also need to include handling of spaces and comments. Before you do this, read about the parser. It has some suggestions for lexical analysis.

To compile a flex lexer that is in file lexer.lex, use command

  flex lexer.lex

Flex will create a file called lex.yy.c containing the lexer function, called yylex, written in C. Each time you call yylex( ), it returns the next token. yylex( ) returns 0 at the end of the file.

Now test the lexer. Write a small test SFL program. Write a main program that just runs the lexer over and over, printing out the tokens. Pay attention to what it says, and make sure thyat it works.

Test your lexer. Write a program that just runs the lexer repeatedly, printing each token and attribute. To compile a flex program called lexer.lex, use command

  flex lexer.lex

That creates file lex.yy.c, which defines function yylex( ), the lexer. So your main program will call yylex over and over. yylex( ) returns 0 when it encounters the end of file.

By default, yylex reads from the standard input. To get it to read from a file called filename, include the following in your main program.

  extern FILE* yyin;
  yyin = fopen(filename, "r");

Be sure to do this before you call yylex for the first time.

When you have the lexer working, submit it using the following command, if your lexer is called lexer.lex.

  ~karl/5220/bin/submit lexer lexer.lex stringtable.c

Deadline: Feb 5, 11:59pm

3. Parser

Write a context-free grammar for SFL. You will need, at a minimum, a nonterminal for Program, one for Definition, and one for each level of precedence for expressions. You will also need some others to help out. Keep the grammar small where you can. For example, there are expressions of the form isNull A, head A, etc. Instead of having a rule for each, why not have one token for a standard function, and let the lexical analyzer handle all of the different lexemes.

Once you think your grammar is correct, write it in Bison. See the Bison manual (unreadable pdf version). Do not include actions, just write the grammar. Put the rules for Program first, so that parsing will start with Program.

Before the grammar, list all of the tokens that have names. Your Bison file might look something like this, where ... indicates more of similar things.

%{
#define YYDEBUG
#define YYSTYPE_IS_DECLARED
typedef union {
  char* str;
}
YYSTYPE;
%}

%token <str>    TOK_ID
%token          TOK_ARROW
...

%%
Program	        : Definition
	        | Program  Definition
                ;

Definition      : TOK_DEF  TOK_ID  '='  Expression  TOK_END
                ;
...
%%
int main()
{
  yydebug = 1;
  yyparse();
  return 0;
}

Remove the main function from your lexer. Instead of including token.h, include y.tab.h, which is created by Bison. You will not need token.h any more. Remove the declaration of yylval from your lexer. That is now created automatically by Bison. To use yylval, just say

 extern YYSTYPE yylval;

To compile your Bison program, use the following command.

  bison -y -v -d parser.y

That will create three files: y.tab.c, defining a function yyparse that does the parsing; y.tab.h, which defines the tokens; and y.output, which shows the grammar and the states of the LALR(1) parser. If there are conflicts, use y.output to understand what is causing them, and make modifications to your grammar to get rid of them.

Test the parser. The way it is shown above (with YYDEBUG defined and yydebug set to 1) it will show a trace of the parsing process. To turn off the trace, just set yydebug to 0.

When you have the parser working, submit it using the following command (presuming given file names).

  ~karl/5220/bin/submit parser parser.y lexer.lex stringtable.c

Be sure to submit all of your source files. Do not assume that I already have them because you submitted them before.

Deadline: Mar. 4, 11:59pm

4. Abstract syntax trees

Create a type of abstract syntax trees that describe an expression. Look at the kinds of expressions that you need to model, and decide what the trees will look like.

Create a collection of functions for building particular forms of trees. For example, you might have a constructor tree_number(n) that builds a tree representing the integer n, and another tree_apply(A, B) that build a tree representing an application of A to B.

For reasons that will not be at all clear, but that are important, use the following form to handle function expressions, (x -> A). Build a tree that just has one subtree, representing the body of the function. Let's say that the root of such a tree is labeled FUNCTION. Do not store the name of the parameter in the FUNCTION node, or anywhere else. Inside the body (the subtree), use a node of kind PARAMETER, holding a nonnegative integer. For example, function tree_param(n) might create it, indicating the parameter of the function whose node is above this node, and found after skipping over n function nodes on the way up.

To create a tree for expression (x -> A), create a modified version A' of the tree for A. In the modified version, replace each occurrence of x by an appropriate PARAMETER node. You will find it convenient to write a function fix_param(T,x,n) that modifies tree T, replacing each occurrence of identifier x by a PARAMETER node. n is the number to use for that parameter node. Note that, each time fix_param descends into a FUNCTION node, it needs to increase n by 1.

Here are some suggestions about kinds of nodes.

An identifier: An identifier node holds the name of the identifier. It is written identifier(x) below, where x is the name.
A constant node: This is used for character, boolean and integer constants, as well as for the empty list. It should hold the kind of constant and the constant value. This kind of node is indicated by c below.
An operator node: An operator node is used for standard binary operators. It stores an indication of the operator, and has two subtrees, representing the parameters. This kind of node is indicated by op(a,b) below, where a and b are the subtrees.
A standard function node: A standard function node stands for one of the standard functions, such as head, applied to an argument. It has just one subtree, the argument, plus an indication of which function it runs. represents. This kind of node is indicated by fun(a) below,
A conditional node: This has three children: the condition, the true part and the false part. This kind of node is indicated by if(a,b,c) below.
An apply node: An apply node indicates a function application. It has two subtrees: the function and its argument. This kind of node is indicated by apply(a,b) below.
A parameter node: This holds a parameter number. It is written param(n) below.
A function expression node: This kind of node stands for an expression of the form x -> A, and is indicated by function(a) below, where a is the body. Remember that the name of the parameter, x, is not stored.

Write two functions that print syntax trees. One should show the details of the structure, and is used for debugging. The other writes the tree in a form that looks similar to the language syntax.

When you have the tree module working, submit it using the following command,

  ~karl/5220/bin/submit tree tree.c

Be sure to submit all of your source files. Do not assume that I already have them because you submitted them before.

Deadline: Feb. 19, 11:59pm

5. Make the parser produce abstract syntax trees

Modify your parser so that it builds abstract syntax trees for expressions. Modify the definition of YYSTYPE to the following, assuming that Node is the type of a tree node.

typedef union {
  char* str;
  Node* tree;
}
YYSTYPE;

In the same section where you define tokens using %token lines, add lines indicating types of all nonterminals that produce expressions. For example, write

%type <tree> Expression
%type <tree> SimpleExpresion

using the names of your expression nonterminals.

Now make each expression rule build a tree. For example, a rule that handles juxtaposition might look as follows.

 ApplyExpression : ApplyExpression  Term
                        {$$ = tree_apply($1, $2);
                        }

Modify your Definition rules so that they just print the definition by saying which identifier is being defined, and show the tree (in debug form) that it is equal to. The Definition rule would become something like this.

 Definition      : TOK_DEF  TOK_ID  '='  Expression  TOK_END
                        {printf("Define %s = \n", $2);
                         print_tree($4);
                         printf("\n");
                        }

6. Identifier table

Create a table that stores, with each identifier that is defined in a definition, the tree that describes its value. Keep the table simple. Modify the Definition rule so that, instead of printing the definition, it just inserts the definition into the table.

Test your program. When it is working, submit it, using command

  ~karl/5220/bin/submit builder parser.y tree.c table.c lexer.lex stringtable.c

Deadline: Mar 27, 11:59pm

7. Evaluation

Write an interpreter that evaluates an expression, where the expression is described by an abstract syntax tree. The interpreter should reduce the expression to its simplest form (where no more evaluations can be done) and return that simplified form.

The rules for evaluating expressions should be fairly obvious. (Ok, they look obvious to me.) Use call-by-value. To evaluate expression A B, first evaluate A, and check that its result is a function. Then evaluate B. Then perform a function application by doing a substitution. Here are basic rules for computing the simplest form or a tree, also called a normal form. nf(t) is the normal form of tree t. Node kinds described above.

perform-op indicates the result of performing operator op, and perform-fun indicates the result of performing a given function.

nf(c)	=	c [c a constant]
nf(identifier(x))	=	The tree associated with x in the name table, or error if there is none.
nf(op(a,b))	=	perform-op(nf(a), nf(b)) [when nf(a) and nf(b) are both constants]
nf(op(a,b))	=	op(nf(a), nf(b)) [when nf(a) and nf(b) are not both contants]
nf(fun(a))	=	perform-fun(nf(a))
nf(param(n))	=	param(n)
nf(if(a,b,c))	=	nf(b) [if nf(a) = true]
	=	nf(c) [if nf(a) = false]
	=	error [otherwise]
nf(apply(a, b))	=	nf(subst(nf(a), nf(b))) [when nf(a) is a FUNCTION node]
nf(apply(a, b))	=	apply(nf(a), nf(b)) [when nf(a) is not a FUNCTION node]
nf(function(a)	=	function(a)

subst(function(a), r)	=	subst1(a, r, 0)
subst1(c, r, n)	=	c [c a constant]
subst1(op(a,b), r, n)	=	op(subst1(a,r,n), subst1(b,r,n))
subst1(fun(a), r, n)	=	fun(subst1(a, r, n))
subst1(identifier(x), r, n)	=	identifier(x)
subst1(param(k), r, n)	=	r [if k = n]
subst1(param(k), r, n)	=	param(k) [if k =/= n]
subst1(if(a,b,c), r, n)	=	if(subst1(a,r,n), subst1(b,r,n), subst1(c,r,n))
subst1(apply(a, b), r, n)	=	apply(subst1(a,r,n), subst1(b,r,n))
subst1(function(a), r, n)	=	function(subst1(a, r, n+1))

Modify your parser so that the rule for Program looks up the value of "main", computes that tree's normal form, and prints the normal form in pretty form. Test your program. Try some things, and see if they work. Stress it.

When you have it working, submit it using the following command.

  ~karl/5220/bin/submit system parser.y lexer.lex tree.c table.c stringtable.c interpreter.c

Deadline: Apr 18, 11:59pm

8. Type checking [Graduate students only]

Add a type checking module that allows the compiler to report type errors. Assume that all members of a list are required to have the same type.

Representing types

First, you will need to create another kind of tree, representing a type. There are four kinds of types, represented by three kinds of nodes.

A type can be a basic type, either Integer or Boolean. In a tree, it is a basic type node with an indication of the type.
A function has a domain and a codomain. For example, a function that takes an integer and produces an integer has type Integer -> Integer. In a tree, this is a function-type node with two subtrees.
A list indicates the type of the members of the list. I will write the type of a list of integers as [Integer]. In general, [T] is the type of a list whose members have type T. In a tree, this is represented as a list node, with one subtree.
A variable stands for an arbitrary type. The variable node in the tree is marked as a variable, but it holds a pointer to a tree indicating the value of the variable. A variable with an unknown value has a NULL pointer for its value.

Define an appropriate collection of functions for creating these.

Unification

You will need to implement the unification algorithm. The idea is that unify(s,t) binds variables in trees s and t to make them equal. It performs as few bindings as possible. It returns true if the unification can be done, and false if not. The rules are as follows.

unify(name(x), name(y))	=	true if x = y, false if not
unify(list(a), list(b))	=	unify(a, b)
unify(function(a, b), function(c, d))	=	unify(a, c) and unify(b, d)
unify(var x, var x)	=	true
unify(var x, var y)	=	true (make x = y)
unify(var x, T)	=	true (make x = T) provided x does not occur in T
unify(var x, T)	=	false if x occur in T
unify(T, var x)	=	unify(var x, T)

(T is a type that is not a variable.) In all cases not mentioned, unify(a,b) = false.

Important. Unification can bind variables. Any time you look at a type, you need to skip through bound variables. Write a function skipbv(T) that produces what tree T is bound to. (If T is not a bound variable, then skipbv(T) = T.) Use skipbv before you look at any type tree.

Type inference

Add a new field to each node in an expression tree indicating the type of the expression. When an expression node is created, fill in the type with a new, unbound variable.

Now write a function that takes an expression and performs type inference and checking on it. Say that infer(A) returns true if type checking is successful, and false if not. Here are some of the rules for infer. type(A) is the type field of the root of tree A. You should be able to figure out the remaining rules.

infer(c)	=	unify(type(c), Integer) if c is an integer constant
infer(c)	=	unify(type(c), Boolean) if c is a boolean constant
infer(identifier(x))	=	unify(type(identifier(x)), T) where T is the type of the value of x
infer(param(n))	=	true
infer(op(a, b)	=	unify(type(a, Integer) and unify(type(b), Integer) and unify(type(op(a, b), Integer)) for an integer operation
infer(op(a, b)	=	unify(type(a, Boolean) and unify(type(b), Boolean) and unify(type(op(a, b), Boolean)) for a boolean operation
infer(head(a)	=	unify(type(a, list(type(head(a)))))
infer(if(a, b, c))	=	unify(type(a, Boolean) and unify(type(b), type(c)) and unify(type(if(a,b,c)), type(b))
infer(apply(a, b))	=	unify(type(a, function-type(type(b), type(apply(a, b))))

To handle a node of the form function(a), infer should create a new type variable v. It should unify the types of all occurrences of this function's parameter in a with v. Then it should unify the type of this function with function-type(v, type(a)). To do this, you will need a function that finds all of the parameters that belong to this function, and unifies them all with a given type.

Reporting types

After building an expression at a definition, run infer on it. If infer returns false, report that the function is poorly typed, but move on as usual. (Consider it a warning.) If infer produces true, then print the type in a reasonable form. For example, you might print Integer -> [Integer].

To print variables, go through the tree and number the variables. Then print variable number 1 as t1, variable 2 as t2, etc.

When you have type checking working, submit it using the following command (presuming given file names).

  ~karl/5220/bin/submit types type.c parser.y lexer.lex tree.c table.c stringtable.c interpreter.c

Be sure to submit all of your source files. Do not assume that I already have them because you submitted them before.

Deadline: Apr. 29, 11:59pm