Implementation of SFL

The following describes a step by step plan for creating an implementation of SFL. Follow the plan.

1. Strings

You will need to use strings quite a bit in the implementation, and you will not want to worry about memory management for them. A good idea is to create a table with all of the strings that you have seen stored in it. Create such a table. Include a function intern(s) that looks up string s in the table and returns the string stored for s. If s is not in the table, it should be inserted.

Choose a very simple implementation of the string table. For example, just use an array, and do a linear search of the array to perform an intern operation. Do not try to be fancy.

When you have the lexer working, submit it using the following command, if your string table files are stringtable.c and stringtable.h.

  ~abrahamsonk/5220/bin/submit strings stringtable.c stringtable.h

(The first word, strings, is the name of the assignment. After that are names of files to submit.)

Deadline: February 2, 11:59pm

2. Lexical analysis

Look at the language description.

Identify all of the reserved words. Write them down.
Identify all of the symbols. Write them down.
Look for other kinds of lexemes that a program can contain. Write them down.

Now you have a list of kinds of lexemes. Decide on a token name for each kind. For lexemes that are one character, such as '(', you can use the character itself as the token. For more complicated lexemes, you will need to choose a name. For example, lexeme -> will need to have a name, such as TOK_ARROW. Create a file that defines numbers for token names. Choose numbers that are 256 or larger. For example, your file might be called token.h, and begin as follows.

#define TOK_ARROW       256
#define TOK_ID	        257
...

Include a token for every reserved word. Do not try to treat reserved words as identifiers. Also include a definition of type YYSTYPE, as follows.

  typedef union {
    char* str;
  } 
  YYSTYPE;

(YYSTYPE is the type of a token attribute. This assumes that a token attribute is a string.)

Now create a lexical analyzer using Flex. See the Flex manual (pdf version). The lexer should return the tokens. Here is a start.

%{
#include <stdio.h>
#include "token.h"
YYSTYPE yylval;
%}

letter  [a-zA-Z]
digit   [0-9]
id      {letter}+

%%
def     {return TOK_DEF;
        }

'('     {return '(';
        }

"->" {return TOK_ARROW;
        }

{id}    {yylval.str = intern(yytext);
         return TOK_ID;
        }
%%
int yywrap()
{
  return 1;
}

int main(int argc, int argv)
{
  if(argc == 2) {
    yyin = fopen(argv[1], "r");
    if(yyin == NULL) return 1;
    while(1) {
      int tok = yylex();
      if(tok == 0) break;
      else if(tok == TOK_ID) printf("ID, name = %s\n", yylval.str);
      else printf("Token %d\n");
    }
    return 0;
  }
}

You will need to handle all of the lexemes and tokens. You will also need to include handling of spaces and comments. Before you do this, read about the parser. It has some suggestions for lexical analysis.

To compile a flex lexer that is in file lexer.lex, use command

  flex lexer.lex

Flex will create a file called lex.yy.c containing the lexer function, called yylex, written in C. Each time you call yylex( ), it returns the next token. yylex( ) returns 0 at the end of the file.

Now test the lexer. Write a small test SFL program. Modify the given main program so that it runs the lexer over and over, printing out the tokens and their attributes, when they have attributes Pay attention to what it says, and make sure that it works.

By default, yylex reads from the standard input. To get it to read from a file called filename, include the following in your main program.

  extern FILE* yyin;
  yyin = fopen(filename, "r");

Be sure to do this before you call yylex for the first time. That is done in the main program above, so that the name of the program is written on the command line.

When you have the lexer working, submit it using the following command, if your lexer is called lexer.lex.

  ~abrahamsonk/5220/bin/submit lexer lexer.lex tokens.h stringtable.c stringtable.h

Deadline: February 9, 11:59pm

3. Abstract syntax trees

Create a type of abstract syntax trees that describe an expression or value. (You will use them both for describing expressions and for describing values of expressions.) Look at the kinds of expressions that you need to model, and decide what the trees will look like. That should be a matter of reading the expressions section.

Create a collection of functions for building particular forms of trees. For example, you might have a constructor numberNode(n) that builds a tree representing the integer n, and another applyNode(A, B) that builds a tree representing an application of A to B.

Keep the number of different kinds of nodes as small as you reasonably can. Instead of building a separate kind of node for each basic function, just have one that has a function number stored in it. For example, basicFunNode(BASICFUN_ISNULL, A) might build a node that stands for expression isNull A. Just define BASICFUN_ISNULL to be a particular integer.

Some kinds of expressions do not need separate kinds of nodes. For example, expession let x = A in B end is equivalent to ((x -> B) A). Just use a function and apply node to build it. There is no 'let' node.

Instead of a 'case' node with a child for every case, just use an 'if' node with three children (a condition, a 'then' expression and an 'else' condition. Translate a case expression into if nodes.

For reasons that will not be at all clear, but that are important, use the following form to handle function expressions, (x -> A). Build a tree that just has one subtree, representing the body of the function. Let's say that the root of such a tree is labeled FUNCTION. Do not store the name of the parameter in the FUNCTION node, or anywhere else. Inside the body (the subtree), use a node of kind PARAMETER, holding a nonnegative integer. A PARAM node holding n stands for the (single) parameter of the FUNCTION node that is found by walking up the tree from the PARAMETER node and skipping over n FUNCTION nodes before reaching the desired FUNCTION node.

To create a tree for expression (x -> A), create a modified version A' of the tree for A. In the modified version, replace each occurrence of x by an appropriate PARAMETER node. You will find it convenient to write a function fix_param(T,x,n) that modifies tree T, replacing each occurrence of identifier x by a PARAMETER node, where n is the number to use for that PARAMETER node. Note that, each time fix_param descends past a FUNCTION node, it needs to increase n by 1.

Here are some suggestions about kinds of nodes.

An identifier: An identifier node holds the name of the identifier. It is written identifierNode(x) below, where x is the name.
A constant node: This is used for character, boolean and integer constants, as well as for the empty list. It should hold the kind of constant and the constant value.
A cons node: has two subtrees, and stands for expression A : B. It is really a linked list node. It is written consNode(A, B) below.
An operator node: An operator node is used for standard binary operators, excluding the colon. It stores an indication of the operator and has two subtrees, representing the parameters. This kind of node is indicated by opNode(op, a,b) below, where a and b are the subtrees.
A standard function node: A standard function node stands for one of the standard functions, such as head, applied to an argument. It has just one subtree, the argument, plus an indication of which function it runs. represents. This kind of node is indicated by funNode(fun,a) below,
A conditional node: This has three children: the condition, the true part and the false part. This kind of node is indicated by ifNode(a,b,c) below. (Note that any case construct can be translated into uses of the simpler concept of an if node.)
An apply node: An apply node indicates a function application. It has two subtrees: the function and its argument. This kind of node is indicated by applyNode(a,b) below.
A parameter node: This holds a parameter number. It is written paramNode(n) below.
A function expression node: This kind of node stands for an expression of the form x -> A, and is indicated by functionNode(a) below, where a is the body. Remember that the name of the parameter, x, is not stored.
A reader node: This kind of node stands for readChar or readInt. It is written readerNode(fun) below, where fun indicates the function (readChar or readInt).
An action node with a child: This kind of node stands for print A or produce A, where A is its subtree. It is written actionNode(fun, A) below.
A sequence node: This has two children, A and B, and stands for expression A ~> B. It is written sequenceNode(A, B) below. (Why don't you need a node kind for A ; B?)

Write a function that prints a syntax tree for debugging purposes. Then test your trees, by themselves. Write a main program that builds and prints a tree. Did you get what you expected?

When you have the tree module working, submit it using the following command,

  ~abrahamsonk/5220/bin/submit trees tree.c tree.h

Include any other files that are needed for testing trees.

Deadline: February 16, 11:59pm

4. Symbol table

Create a table whose keys are strings and whose values are trees. Keep it simple. You should be able to

ask whether a particular string is a key in the table,
insert a particular string, with a corresponding tree, into the table,
look up a string and get the corresponding tree.

You will use this table for identifiers that are defined in a def definition. For example, at definition def main = A end, you store key "main", tree A into the table. (Here, A is both the expression and the tree that it stands for.)

When you have the symbol table working, submit it using the following command.

  ~abrahamsonk/5220/bin/submit symboltable symboltable.c symboltable.h

Deadline: February 23, 11:59pm

5. Parser

Write a context-free grammar for SFL. You will need, at a minimum, a nonterminal for Program, one for Definition, and one for each level of precedence for expressions. You will also need some others to help out. Keep the grammar small where you can. For example, there are expressions of the form isNull A, head A, etc. Instead of having a rule for each, why not have one token for a standard function, and let the lexical analyzer handle all of the different lexemes. Make the lexeme be an attribute of the token.

Write the grammar in Bison. See the Bison manual (pdf version). Do not include actions, just write the grammar. Put the rules for Program first, so that parsing will start with Program.

Before the grammar, list all of the tokens that have names. Your Bison file might look something like this, where ... indicates more of similar things.

%{
#define YYDEBUG
%}

%union {
  char* str;
}

%token <str>    TOK_ID
%token          TOK_ARROW
...

%%
Program	        : Definition
	        | Program  Definition
                ;

Definition      : TOK_DEF  TOK_ID  '='  Expression  TOK_END
                ;
...
%%
int main(int argc, char** argv)
{
  if(argc == 2) {
    yyin = fopen(argv[1], "r");
    if(yyin == NULL) return 1;
    yydebug = 1;
    return yyparse();
  }
  else return 1;
}

Remove the main function from your lexer. It will be replaced with the main program in the parser.

Instead of including token.h, include y.tab.h, which is created by Bison. You will not need token.h any more. Remove the declaration of yylval from your lexer. That is now created automatically by Bison. To use yylval, just say

 extern YYSTYPE yylval;

To compile your Bison program, use the following command.

  bison -y -v -d parser.y

That will create three files: y.tab.c, defining a function yyparse that does the parsing; y.tab.h, which defines the tokens; and y.output, which shows the grammar and the states of the LALR(1) parser. If there are conflicts, use y.output to understand what is causing them, and make modifications to your grammar to get rid of them.

Test the parser. The way it is shown above (with YYDEBUG defined and yydebug set to 1) it will show a trace of the parsing process. To turn off the trace, just set yydebug to 0.

When you have the parser working, submit it using the following command (presuming given file names).

  ~abrahamsonk/5220/bin/submit parser parser.y lexer.lex stringtable.c stringtable.h

Be sure to submit all of your source files. Do not assume that I already have them because you submitted them before.

Deadline: March 5, 11:59pm

6. Make the parser produce abstract syntax trees

Modify your parser so that it builds abstract syntax trees for expressions. Modify the %union directive to the following, assuming that Node is the type of a tree node.

%union {
  char* str;
  Node* tree;
}

In the same section where you define tokens using %token lines, add lines indicating types of all nonterminals that produce expressions. For example, write

%type <tree> Expression
%type <tree> SimpleExpresion

using the names of your expression nonterminals.

Now make each expression rule build a tree. For example, a rule that handles juxtaposition might look as follows.

 ApplyExpression : ApplyExpression  Term
                        {$$ = applyNode($1, $2);
                        }

Modify your Definition rules so that they just print the definition by saying which identifier is being defined, and show the tree (in debug form) that it is equal to. The Definition rule would become something like this.

 Definition      : TOK_DEF  TOK_ID  '='  Expression  TOK_END
                        {printf("Define %s = \n", $2);
                         print_tree($4);
                         printf("\n");
                        }

When you have this working, submit it using the following command (presuming given file names).

  ~abrahamsonk/5220/bin/submit semanticparser parser.y lexer.lex stringtable.c stringtable.h tree.c tree.h

Be sure to submit all of your source files. Do not assume that I already have them because you submitted them before.

Deadline: March 19, 11:59pm

7. Simplification

Write a function that simplfies a tree. It should perform transformations on a tree with the idea of reducing it to a form that does not require further simplification.

The rules for simplifying trees should be fairly obvious. (Ok, they look obvious to me.) Mainly use call-by-value. Here are some suggestions.

An identifier node always stands for something that was defined using def. To simplify identifierNode(x), look up x in the symbol table and return the tree that is stored there for x.
To simplify a constant node, just return it unchanged.
To simplify consNode(a, b), let a' be the result of simplifying a and let b' be the result of simplifying b. Return consNode(a', b').
To simplify an operator node opNode(op, a, b), first simplify a, then simplify b, then perform operator op on the results. For example, to simplify an addition node, check that both subtrees are integer constants and produce a new integer constant (their sum).
To simplify a standard function node funNode(fun, a), first evaluate a, then apply function fun to the result.
To simplify a conditional node ifNode(a, b, c), first simplify a. If a simplifies to a constant true, the return the result of simplifying b. If a simplifies to a constant false, the return the result of simplifying c. If a yields any other kind of result, it is an error.
To simplify applyNode(a, b), first simplify a, then simplify b. If the simplified version of a is functionNode(E), then find all param nodes in E that refer to this function (by looking at their numbers). Replace each of them by the simplified version of b. That yields a modified version of E. Call it E'. Now simplify E', and return that simplified result.

If the result of simplifying a is not a function node, it is an error.
To simplify a paramNode(n) just return it as is.
To simplify functionNode(E), just return it as is. Do not make any changes.
To simplify a reader node, an action node or a sequence node, return it as is. Do not make any changes.

Modify your parser so that the rule for Program looks up the value of "main", simplifies the value, then prints the simplified tree.

When you have this working, submit it using the following command.

  ~abrahamsonk/5220/bin/submit simplify parser.y lexer.lex tree.c tree.h symboltable.c symboltable.h stringtable.c stringtable.h simplify.c simplify.h

Deadline: April 6, 11:59pm

8. The top level

Write two functions, performAction, print, where

performAction(t) takes a tree t and "performs" it, returning a tree that is its answer.
print(mode, t) takes a tree t and prints its value, in a form suitable for the final result. The mode tells whether the print is 'pretty'. In a pretty print, you show character 'c' as simply c, without quotes. In an ugly print, you show 'c' as 'c', with quotes. Below, we say to pretty-print a node if you should print it with mode = pretty, and to ugly-print a node to print it with mode = ugly.

Here are some rules for performAction.

Always start by simplifying the tree. Then look at the result of the simplification.
If the node is a constant node or a function node then ugly-print it and return a tree for constant '\0'.
To perform consNode(a, b) or functionNode(a, just ugly-print that node and return constant '\0'.
To perform a reader node, read a character or integer from the standard input and return it, without printing anything.
To perform a print node, first simplify the subtree. Then pretty-print the simplified subtree and return constant '\0'.
To perform a produce node, simplify the subtree. Then return the simplified subtree.
To perform sequenceNode(a, b), make x = performAction(a). Then build y = applyNode(b, x) and z = performAction(y). Return z.

Here are some rules for print.

Always start by simplifying the tree. Then look at the result of the simplification.
If the node is a constant node or a function node then print it in a format suitable for showing. Show 321 as 321. In ugly mode, show character 'y' as 'y' (with the quotes). In pretty mode, show 'y' as y. Show an empty list as [ ].
To print consNode(a, b), print a then a colon then print b, where the recursive calls have the same mode as the call at the consNode.
To print functionNode(a), just show "(a function)".
To print a reader or action or sequence node, just show "(an action)".

Modify your parser so that now, after fetching the tree t called "main", it does performAction(t). You should now have a full, working program.

When you have this working, submit it using the following command.

  ~abrahamsonk/5220/bin/submit full parser.y lexer.lex tree.c tree.h symboltable.c symboltable.h stringtable.c stringtable.h simplify.c simplify.y toplevel.c toplevel.h

Deadline: April 13, 11:59pm

9. Garbage collection [Graduate students only]

Your current implementation will go through memory very quickly. This improvement allows you to recycle unused memory.

When the program starts, create a large array of tree nodes. Make it easy to change the size of this array. Now link all of the tree nodes in the array together into a linked list (for example, using their left-subtree pointers). That list is called the free space list.
Write a function to allocate a tree node. It should get the first node from the free space list, and remove that node from the list. If there are no more nodes in the free space list then it should call function GarbageCollect(), which can initially be a stub, and try again. If it still can't get a node after calling GarbageCollect it should print a message that the program is out of memory and stop the program.
Modify the tree module so, whenever it needs a tree node, it calls the function written above to get a node from the free space list. Test your program to ensure that it still works.
Add a field to the tree nodes called the mark bit.
Create a linked list of tree** pointers. It will hold pointers to local variables in functions that point to trees that need to be remembered. This list should initially be empty.

Write a function getMark() that returns a pointer to the current front of the linked list (or that returns NULL if the linked list is empty).

Write function remember(p) adds tree** pointer p to the beginning of the list, and returns a pointer to the list cell that was formerly the start of the linked list, or that returns NULL if the list used to be empty.

Write function forget(q) that takes a pointer that was returned by remember and removes all list cells up to but not including q from the list.
Modify your functions that work on trees. When a function starts it should do something like
```
  RememberList* mark = getMark();
```
to remember where things were. After it stores a tree into variable t, it should do
```
  remember(&t);
```
to ensure that the tree will be visible to the garbage collector. (If a function changes what is in a variable that has already been remembered, it should not remember that variable again.) When the function is finished, it should do
```
  forget(mark);
```
to remove things that were stored here. It is very important that you do this forget call in every function that did any remember calls.
Write GarbageCollect. It should work as follows.
1. Scan the entire array of nodes, setting all of the mark bits to 0.
2. Scan all of the trees in the symbol table. Set the mark bit of each node in each tree to 1. If you ever encounter a node that already has its mark bit set, do not mark it or its subtrees again.
3. Scan all the "remember" list. If it contains tree** pointer v, then set the mark node of every node in tree *v to 1. Again, if you ever encounter a node that already has its mark bit set, do not mark it or its subtrees again.
4. Scan the entire array of nodes. Each node that has a mark bit of 0 is inaccessible. Add it to the free space list.

When you have this working, submit it using the following command.

  ~abrahamsonk/5220/bin/submit fullgc parser.y lexer.lex tree.c tree.h symboltable.c symboltable.h stringtable.c stringtable.h simplify.c simplify.y toplevel.c toplevel.h memmanage.c memmanage.h

Deadline: April 27, 11:59pm