The following describes a step by step plan for creating an implementation of SFL. Follow the plan.
You will need to use strings quite a bit in the implementation, and you will not want to worry about memory management for them. A good idea is to create a table with all of the strings that you have seen stored in it. Create such a table. Include a function intern(s) that looks up string s in the table and returns the string stored for s. If s is not in the table, it should be inserted.
Choose a very simple implementation of the string table. For example, just use an array, and do a linear search of the array to perform an intern operation. Do not try to be fancy.
When you have the lexer working, submit it using the following command, if your string table files are stringtable.c and stringtable.h.
~abrahamsonk/5220/bin/submit strings stringtable.c stringtable.h(The first word, strings, is the name of the assignment. After that are names of files to submit.)
Deadline: February 2, 11:59pm
Look at the language description.
Now you have a list of kinds of lexemes. Decide on a token name for each kind. For lexemes that are one character, such as '(', you can use the character itself as the token. For more complicated lexemes, you will need to choose a name. For example, lexeme -> will need to have a name, such as TOK_ARROW. Create a file that defines numbers for token names. Choose numbers that are 256 or larger. For example, your file might be called token.h, and begin as follows.
#define TOK_ARROW 256 #define TOK_ID 257 ...Include a token for every reserved word. Do not try to treat reserved words as identifiers. Also include a definition of type YYSTYPE, as follows.
typedef union { char* str; } YYSTYPE;(YYSTYPE is the type of a token attribute. This assumes that a token attribute is a string.)
Now create a lexical analyzer using Flex. See
the Flex manual
(pdf version). The lexer should
return the tokens. Here is a start.
%{
#include <stdio.h>
#include "token.h"
YYSTYPE yylval;
%}
letter [a-zA-Z]
digit [0-9]
id {letter}+
%%
def {return TOK_DEF;
}
'(' {return '(';
}
"->" {return TOK_ARROW;
}
{id} {yylval.str = intern(yytext);
return TOK_ID;
}
%%
int yywrap()
{
return 1;
}
int main(int argc, int argv)
{
if(argc == 2) {
yyin = fopen(argv[1], "r");
if(yyin == NULL) return 1;
while(1) {
int tok = yylex();
if(tok == 0) break;
else if(tok == TOK_ID) printf("ID, name = %s\n", yylval.str);
else printf("Token %d\n");
}
return 0;
}
}
You will need to handle all of the lexemes and tokens. You will also need
to include handling of spaces and comments. Before you do this, read
about the parser. It has some suggestions for lexical analysis.
To compile a flex lexer that is in file lexer.lex, use command
flex lexer.lexFlex will create a file called lex.yy.c containing the lexer function, called yylex, written in C. Each time you call yylex( ), it returns the next token. yylex( ) returns 0 at the end of the file.
Now test the lexer. Write a small test SFL program. Modify the given main program so that it runs the lexer over and over, printing out the tokens and their attributes, when they have attributes Pay attention to what it says, and make sure that it works.
By default, yylex reads from the standard input. To get it to read from a file called filename, include the following in your main program.
extern FILE* yyin; yyin = fopen(filename, "r");Be sure to do this before you call yylex for the first time. That is done in the main program above, so that the name of the program is written on the command line.
When you have the lexer working, submit it using the following command, if your lexer is called lexer.lex.
~abrahamsonk/5220/bin/submit lexer lexer.lex tokens.h stringtable.c stringtable.h
Deadline: February 9, 11:59pm
Create a type of abstract syntax trees that describe an expression or value. (You will use them both for describing expressions and for describing values of expressions.) Look at the kinds of expressions that you need to model, and decide what the trees will look like. That should be a matter of reading the expressions section.
Create a collection of functions for building particular forms of trees. For example, you might have a constructor numberNode(n) that builds a tree representing the integer n, and another applyNode(A, B) that builds a tree representing an application of A to B.
Keep the number of different kinds of nodes as small as you reasonably can. Instead of building a separate kind of node for each basic function, just have one that has a function number stored in it. For example, basicFunNode(BASICFUN_ISNULL, A) might build a node that stands for expression isNull A. Just define BASICFUN_ISNULL to be a particular integer.
Some kinds of expressions do not need separate kinds of nodes. For example, expession let x = A in B end is equivalent to ((x -> B) A). Just use a function and apply node to build it. There is no 'let' node.
Instead of a 'case' node with a child for every case, just use an 'if' node with three children (a condition, a 'then' expression and an 'else' condition. Translate a case expression into if nodes.
For reasons that will not be at all clear, but that are important, use the following form to handle function expressions, (x -> A). Build a tree that just has one subtree, representing the body of the function. Let's say that the root of such a tree is labeled FUNCTION. Do not store the name of the parameter in the FUNCTION node, or anywhere else. Inside the body (the subtree), use a node of kind PARAMETER, holding a nonnegative integer. A PARAM node holding n stands for the (single) parameter of the FUNCTION node that is found by walking up the tree from the PARAMETER node and skipping over n FUNCTION nodes before reaching the desired FUNCTION node.
To create a tree for expression (x -> A), create a modified version A' of the tree for A. In the modified version, replace each occurrence of x by an appropriate PARAMETER node. You will find it convenient to write a function fix_param(T,x,n) that modifies tree T, replacing each occurrence of identifier x by a PARAMETER node, where n is the number to use for that PARAMETER node. Note that, each time fix_param descends past a FUNCTION node, it needs to increase n by 1.
Here are some suggestions about kinds of nodes.
An identifier
An identifier node holds the name of the identifier. It is written identifierNode(x) below, where x is the name.
A constant node
This is used for character, boolean and integer constants, as well as for the empty list. It should hold the kind of constant and the constant value.
A cons node
has two subtrees, and stands for expression A : B. It is really a linked list node. It is written consNode(A, B) below.
An operator node
An operator node is used for standard binary operators, excluding the colon. It stores an indication of the operator and has two subtrees, representing the parameters. This kind of node is indicated by opNode(op, a,b) below, where a and b are the subtrees.
A standard function node
A standard function node stands for one of the standard functions, such as head, applied to an argument. It has just one subtree, the argument, plus an indication of which function it runs. represents. This kind of node is indicated by funNode(fun,a) below,
A conditional node
This has three children: the condition, the true part and the false part. This kind of node is indicated by ifNode(a,b,c) below. (Note that any case construct can be translated into uses of the simpler concept of an if node.)
An apply node
An apply node indicates a function application. It has two subtrees: the function and its argument. This kind of node is indicated by applyNode(a,b) below.
A parameter node
This holds a parameter number. It is written paramNode(n) below.
A function expression node
This kind of node stands for an expression of the form x -> A, and is indicated by functionNode(a) below, where a is the body. Remember that the name of the parameter, x, is not stored.
A reader node
This kind of node stands for readChar or readInt. It is written readerNode(fun) below, where fun indicates the function (readChar or readInt).
An action node with a child
This kind of node stands for print A or produce A, where A is its subtree. It is written actionNode(fun, A) below.
A sequence node
This has two children, A and B, and stands for expression A ~> B. It is written sequenceNode(A, B) below. (Why don't you need a node kind for A ; B?)
Write a function that prints a syntax tree for debugging purposes. Then test your trees, by themselves. Write a main program that builds and prints a tree. Did you get what you expected?
When you have the tree module working, submit it using the following command,
~abrahamsonk/5220/bin/submit trees tree.c tree.hInclude any other files that are needed for testing trees.
Deadline: February 16, 11:59pm
Create a table whose keys are strings and whose values are trees. Keep it simple. You should be able to
You will use this table for identifiers that are defined in a def definition. For example, at definition def main = A end, you store key "main", tree A into the table. (Here, A is both the expression and the tree that it stands for.)
When you have the symbol table working, submit it using the following command.
~abrahamsonk/5220/bin/submit symboltable symboltable.c symboltable.h
Deadline: February 23, 11:59pm
Write a context-free grammar for SFL. You will need, at a minimum, a nonterminal for Program, one for Definition, and one for each level of precedence for expressions. You will also need some others to help out. Keep the grammar small where you can. For example, there are expressions of the form isNull A, head A, etc. Instead of having a rule for each, why not have one token for a standard function, and let the lexical analyzer handle all of the different lexemes. Make the lexeme be an attribute of the token.
Write the grammar in Bison. See the Bison manual (pdf version). Do not include actions, just write the grammar. Put the rules for Program first, so that parsing will start with Program.
Before the grammar, list all of the tokens that have names. Your Bison file might look something like this, where ... indicates more of similar things.
%{ #define YYDEBUG %} %union { char* str; } %token <str> TOK_ID %token TOK_ARROW ... %% Program : Definition | Program Definition ; Definition : TOK_DEF TOK_ID '=' Expression TOK_END ; ... %% int main(int argc, char** argv) { if(argc == 2) { yyin = fopen(argv[1], "r"); if(yyin == NULL) return 1; yydebug = 1; return yyparse(); } else return 1; }
Remove the main function from your lexer. It will be replaced with the main program in the parser.
Instead of including token.h, include y.tab.h, which is created by Bison. You will not need token.h any more. Remove the declaration of yylval from your lexer. That is now created automatically by Bison. To use yylval, just say
extern YYSTYPE yylval;
To compile your Bison program, use the following command.
bison -y -v -d parser.yThat will create three files: y.tab.c, defining a function yyparse that does the parsing; y.tab.h, which defines the tokens; and y.output, which shows the grammar and the states of the LALR(1) parser. If there are conflicts, use y.output to understand what is causing them, and make modifications to your grammar to get rid of them.
Test the parser. The way it is shown above (with YYDEBUG defined and yydebug set to 1) it will show a trace of the parsing process. To turn off the trace, just set yydebug to 0.
When you have the parser working, submit it using the following command (presuming given file names).
~abrahamsonk/5220/bin/submit parser parser.y lexer.lex stringtable.c stringtable.hBe sure to submit all of your source files. Do not assume that I already have them because you submitted them before.
Deadline: March 5, 11:59pm
Modify your parser so that it builds abstract syntax trees for expressions. Modify the %union directive to the following, assuming that Node is the type of a tree node.
%union { char* str; Node* tree; }In the same section where you define tokens using %token lines, add lines indicating types of all nonterminals that produce expressions. For example, write
%type <tree> Expression %type <tree> SimpleExpresionusing the names of your expression nonterminals.
Now make each expression rule build a tree. For example, a rule that handles juxtaposition might look as follows.
ApplyExpression : ApplyExpression Term {$$ = applyNode($1, $2); }
Modify your Definition rules so that they just print the definition by saying which identifier is being defined, and show the tree (in debug form) that it is equal to. The Definition rule would become something like this.
Definition : TOK_DEF TOK_ID '=' Expression TOK_END {printf("Define %s = \n", $2); print_tree($4); printf("\n"); }
When you have this working, submit it using the following command (presuming given file names).
~abrahamsonk/5220/bin/submit semanticparser parser.y lexer.lex stringtable.c stringtable.h tree.c tree.hBe sure to submit all of your source files. Do not assume that I already have them because you submitted them before.
Deadline: March 19, 11:59pm
Write a function that simplfies a tree. It should perform transformations on a tree with the idea of reducing it to a form that does not require further simplification.
The rules for simplifying trees should be fairly obvious. (Ok, they look obvious to me.) Mainly use call-by-value. Here are some suggestions.
An identifier node always stands for something that was defined using def. To simplify identifierNode(x), look up x in the symbol table and return the tree that is stored there for x.
To simplify a constant node, just return it unchanged.
To simplify consNode(a, b), let a' be the result of simplifying a and let b' be the result of simplifying b. Return consNode(a', b').
To simplify an operator node opNode(op, a, b), first simplify a, then simplify b, then perform operator op on the results. For example, to simplify an addition node, check that both subtrees are integer constants and produce a new integer constant (their sum).
To simplify a standard function node funNode(fun, a), first evaluate a, then apply function fun to the result.
To simplify a conditional node ifNode(a, b, c), first simplify a. If a simplifies to a constant true, the return the result of simplifying b. If a simplifies to a constant false, the return the result of simplifying c. If a yields any other kind of result, it is an error.
To simplify applyNode(a, b), first simplify a, then simplify b. If the simplified version of a is functionNode(E), then find all param nodes in E that refer to this function (by looking at their numbers). Replace each of them by the simplified version of b. That yields a modified version of E. Call it E'. Now simplify E', and return that simplified result.
If the result of simplifying a is not a function node, it is an error.
To simplify a paramNode(n) just return it as is.
To simplify functionNode(E), just return it as is. Do not make any changes.
To simplify a reader node, an action node or a sequence node, return it as is. Do not make any changes.
Modify your parser so that the rule for Program looks up the value of "main", simplifies the value, then prints the simplified tree.
When you have this working, submit it using the following command.
~abrahamsonk/5220/bin/submit simplify parser.y lexer.lex tree.c tree.h symboltable.c symboltable.h stringtable.c stringtable.h simplify.c simplify.h
Deadline: April 6, 11:59pm
Write two functions, performAction, print, where
performAction(t) takes a tree t and "performs" it, returning a tree that is its answer.
print(mode, t) takes a tree t and prints its value, in a form suitable for the final result. The mode tells whether the print is 'pretty'. In a pretty print, you show character 'c' as simply c, without quotes. In an ugly print, you show 'c' as 'c', with quotes. Below, we say to pretty-print a node if you should print it with mode = pretty, and to ugly-print a node to print it with mode = ugly.
Here are some rules for performAction.
Always start by simplifying the tree. Then look at the result of the simplification.
If the node is a constant node or a function node then ugly-print it and return a tree for constant '\0'.
To perform consNode(a, b) or functionNode(a, just ugly-print that node and return constant '\0'.
To perform a reader node, read a character or integer from the standard input and return it, without printing anything.
To perform a print node, first simplify the subtree. Then pretty-print the simplified subtree and return constant '\0'.
To perform a produce node, simplify the subtree. Then return the simplified subtree.
To perform sequenceNode(a, b), make x = performAction(a). Then build y = applyNode(b, x) and z = performAction(y). Return z.
Here are some rules for print.
Always start by simplifying the tree. Then look at the result of the simplification.
If the node is a constant node or a function node then print it in a format suitable for showing. Show 321 as 321. In ugly mode, show character 'y' as 'y' (with the quotes). In pretty mode, show 'y' as y. Show an empty list as [ ].
To print consNode(a, b), print a then a colon then print b, where the recursive calls have the same mode as the call at the consNode.
To print functionNode(a), just show "(a function)".
To print a reader or action or sequence node, just show "(an action)".
Modify your parser so that now, after fetching the tree t called "main", it does performAction(t). You should now have a full, working program.
When you have this working, submit it using the following command.
~abrahamsonk/5220/bin/submit full parser.y lexer.lex tree.c tree.h symboltable.c symboltable.h stringtable.c stringtable.h simplify.c simplify.y toplevel.c toplevel.h
Deadline: April 13, 11:59pm
Your current implementation will go through memory very quickly. This improvement allows you to recycle unused memory.
Modify the tree module so, whenever it needs a tree node, it calls the function written above to get a node from the free space list. Test your program to ensure that it still works.
Create a linked list of tree** pointers. It will hold pointers to local variables in functions that point to trees that need to be remembered. This list should initially be empty.
Write a function getMark() that returns a pointer to the current front of the linked list (or that returns NULL if the linked list is empty).
Write function remember(p) adds tree** pointer p to the beginning of the list, and returns a pointer to the list cell that was formerly the start of the linked list, or that returns NULL if the list used to be empty.
Write function forget(q) that takes a pointer that was returned by remember and removes all list cells up to but not including q from the list.
Modify your functions that work on trees. When a function starts it should do something like
RememberList* mark = getMark();to remember where things were. After it stores a tree into variable t, it should do
remember(&t);to ensure that the tree will be visible to the garbage collector. (If a function changes what is in a variable that has already been remembered, it should not remember that variable again.) When the function is finished, it should do
forget(mark);to remove things that were stored here. It is very important that you do this forget call in every function that did any remember calls.
When you have this working, submit it using the following command.
~abrahamsonk/5220/bin/submit fullgc parser.y lexer.lex tree.c tree.h symboltable.c symboltable.h stringtable.c stringtable.h simplify.c simplify.y toplevel.c toplevel.h memmanage.c memmanage.h
Deadline: April 27, 11:59pm