CSCI 5220
Spring 2002
Projects

As part of your software development work for CSCI 5220 you will make a modification to the Astarte compiler. This page discusses choosing what modification to make, and also provides some pointers on how to navigate around the compiler.

Background

Astarte is a programming language that I have defined and implemented. It is intended to support several paradigms within a single framework. The implementation has two major parts: a compiler that translates from Astarte into a byte code; and an interpreter that executes the byte code.

Documentation on the language is available. It is not necessary to read it all, but it is a good idea to familiarize yourself with a few of the examples before choosing what to do. You will need to read some more that is closely related to what you want to do as you go. You will probably want to ask me some questions.

Choosing what to do

You can choose any reasonable project that involves making a change to the compiler. The change can involve the front end or the back end. If you have any ideas, I will be happy to hear them. Below are some suggestions. If done well, whatever you do might be incorporated into the implementation.

I do not recommend that you choose a project that involves substantial modification of the interpreter (as oppossed to the compiler). You will probably find that kind of project too ambitious. If what you want to do involves changes to the interpreter, I will need to help make the necessary modifications there, and you will need to coordinate with me.

Here are suggestions for projects.

Astarte allows you to override one definition with another. Sometimes, when writing a new definition of a function, you want to be able to use the prior definition that you are replacing.

Add expression prior x, where x is an identifier. The idea is that this will occur inside a definition of function x that overrides a prior definition. prior x is the prior (overridden) definition.

This involves creating a {copy} definition of x and referring to the copy definition. For example, if function f is written
```
    Define f(?x) =
      Let y = (prior f)(x).
      If y == 0 then x else y %If
    %Define
```
then this definition is processed as if it were written as follows.
```
    Team
      Define{copy} oldf = f.
      Define f(?x) =
        Let y = oldf(x).
        If y == 0 then x else y %If
      %Define
    %Team
```
Of course, you cannot use the fixed name oldf, but must choose a new name. The name should involve the name of the current package, the name f and a number in case several of these occurs in a single package.

Modifications will involve the following files.
- The parser (src/parser/t_parser.y) will need to handle the new syntax and create the new token prior.
- The lexer (src/lexer/t_lexer.lex) needs to recognize the new reserved word prior.
- src/lexer/t_lexsup.c contains some tables related to tokens that will need updating. The tables are in the same order as the tokens, so be sure to put the new token in the right place.
- messages/syntax.txt holds a simple example for each kind of construct. The lines are in the same order as the tokens. Add a line the gives a brief synopsis of the syntax prior x.
- src/exprs/t_exprutil.c holds general utilities for creating expression trees (syntax trees). Add a function that builds the expression tree for the team. See how the parser handles teams to see how to build a team.
- src/exprs/expr.doc contains documentation on expr nodes, so that you can see what each kind of node can contain. This is a text file, not a Microsoft Word file.
- src/utils/t_strops.c contains general utilities for building and modifying strings. Add one that builds the name to use for oldf.
- You will need to see how modes such as {copy} are handled, so that you can create the mode in the definition of oldf. Modes are handled in src/lexer/t_modes.c.
- src/misc/types.h contains most of the types, including the EXPR type.
- src/debug/t_dprtexpr.c contains a function called print_expr that can be used to print an EXPR data structure, so that you can see what you have built.
Documentation files of interest include the following.
Provide expression After a do b %After. The idea is to perform statement b after doing expression a, but to have the same value as expression a. Statement b should be performed even if evaluation of a fails. After a do b %After should behave like the following.
```
   Try catchingTermination catchingEachThread
     Let res = a.
   then
     b 
     res
   else
     b
     PropagateFailure.
   %Try
```
This says to evaluate a. If evaluation of a succeeds then the then part is executed. Statement b is performed and the result is res, the value of a. But if evaluation of a fails then the else part is executed. Statement b is performed, and the failure is propagated outwards.

Modifications will involve the following files.
- The parser (src/parser/t_parser.y) will need to handle the new syntax and create the new token After.
- The lexer (src/lexer/t_lexer.lex) needs to recognize the new reserved word After.
- src/lexer/t_lexsup.c contains some tables related to tokens that will need updating. The tables are in the same order as the tokens, so be sure to put the new token in the right place.
- messages/syntax.txt holds a simple example for each kind of construct. The lines are in the same order as the tokens. Add a line the gives a brief synopsis of the syntax prior x.
- src/exprs/t_exprutil.c holds general utilities for creating expression trees (syntax trees). Add a function that builds the expression tree for try expression. Since expression b occurs twice you will need to copy it. See src/exprs/t_copyexpr.c.
- src/exprs/t_expr.c contains a functions try_expr and apply_expr that you will find useful.
- src/exprs/expr.doc contains documentation on expr nodes, so that you can see what each kind of node can contain. This is a text file, not a Microsoft Word file.
- src/misc/types.h contains most of the types, including the EXPR type.
- src/debug/t_dprtexpr.c contains a function called print_expr that can be used to print an EXPR data structure, so that you can see what you have built.
Documentation files of interest include the following.
A class (object-oriented programming) can be thought of as a collection of variables, constants and functions. It is common practice to define a function that gets the value of a given variable and another that sets the value of the variable. For example, if your class has a variable called amount, then you might write a function SetAmount(x) that sets amount to x, and another function getAmount that returns the value of amount. Writing those functions is just busywork, and can easily be automated.

Add options (modes) {set} and {get} to var declarations so that a var declaration in a class that has the get option will cause a get method to be defined automatically, and a var declaration with the set option will cause a set method to be defined automatically. For example, declaration
```
    Var{get,set} size holding Integer.
```
would cause the following definitions to be made.
```
    Define my-getSize = \my-size.
    Define My-SetSize ?x. = 
      Assign my-size := x.
    %Define
```
In Astarte, notation my-getSize stands for getSize(self) (or getSize(?self) in pattern match context), my-size stands for size(self), etc. Also, my-size := x abbreviates (my-size, x). So, in less abbreviated form, the definitions would be as follows.
```
    Define getSize(?self) = \(size self).
    Define SetSize ?self, ?x. = 
      Assign (size self, x).
    %Define
```
This involves the following modifications.
- Introduce modes get and set into src/lexer/modes.h and src/lexer/t_modes.c. Be sure to add them to the allowed modes for var declarations in modes.h.
- Add functions to src/exprs/t_exprutil.c to create the get and set functions for a variable. Just build the definitions as shown above, as EXPR data structures. See other functions in t_exprutil.c and t_choose.c for examples. See src/exprs/expr.doc for a description of what can be put into different kinds of EXPR nodes. You will find functions apply_expr and id_expr from src/expr/t_expr.c useful.
- See src/standard/s_stdids.c for an array of standard identifiers. You can use std_id[ASSIGN_ID] as the name of the Assign function and std_id[CONTENT_SYM] as the name of the \ operator.
- When processing a var-declaration in a class, add entries to list class_expects for the get and set methods that are called for. Those entries are processed later so that the compiler knows what was defined inside the class. (This is important in a class interface, where the definitions are only expected, not defined.) You can look at src/dcls/t_dclclass.c:do_class_expectations for how list class_expects is processed later during compilation.
  
  See src/exprs/t_exprutil.c:make_var_expr_p, where if says if(context == 2) ... This is handling var declaration inside a class. The type of the variable being defined by that function is id_ty. The type of the get function for that variable is also id_ty. (The domain, which is the class itself, is implicit, and is added elsewhere.) The type of the set function is id_ty -> (), which can be built as follows in the compiler.
```
          function_t(id_ty, hermit_type)
```
  Use function add_class_expects (src/dcls/t_dclclass.c) to add the expectations.
- Function declare_selectors (src/dcls/t_dclclass.c) declares the selectors for a class. (They are the functions that extract the variables and constants from an object. Without them you would not know how to find the variables and constants.) Create a similar function to create the get and set functions. Look at how declare_selectors works. Call your function where declare_selectors is called in declare_class_polymorphics (src/dcls/t_dclclass.c).
- See src/misc/types.h for definitions of most of the types.
- src/debug/t_dprtexpr.c contains a function called print_expr that can be used to print an EXPR data structure, so that you can see what you have built.
Documentation pages that will be of interest include the following.
Sometimes a program needs to manage resources. For example, if an object gains control of some resource (such as a semaphore) it will later want to release control of the resource. Sometimes an object wants to hold a resource for as long as the object is alive. The object releases the resource only when the object is destroyed. The problem is how an object realizes that it is about to be destroyed.

Astarte currently has a simple resource management facility called finalizer that allows you to attach a function to a box so that, when the box is recovered by the garbage collector, the function is automatically run. The function is supposed to release resources. But finalizer can only attach a function to a box, not to a more general object.

Make it possible to define a finalizer in a class, so that attaching the finalizer to a box related to an object is automatic. Whenever a new object is created, a finalizer for it is created as well. The finalizer should be attached to a box that is part of the object. Every object contains all of the characteristics of type Object, so a box stored there will do the job.

Add a reasonable syntax indicating that a particular function is a finalizer for the class. Think carefully about the semantics of this feature. For example, should the finalizer work for all objects in subclasses of C, or just for objects that belong only to class C?

This will involve the following modifications.
- The parser (src/parser/t_parser.y) and lexer (src/parser/t_lexer.lex) might need to be modified, depending on how you define the syntax for a finalizer. See the first suggested modification for what needs to be done to modify the parser and lexer.
- Semantic actions will need to be added to src/dcls/dclclass.c Each class has a basic constructor that is created by function define_basic_class_constructor. You modify the definition of the basic constructor to create the finalizer.
- The name of function Finalizer in the compiler is std_id[FINALIZER_ID]. See src/standard/t_stdids.c.
- The structure of type Object is defined in ast/object.ast.
Documentation of interest includes the following.
Astarte supports definition by cases. (See defining values by cases and defining functions by cases.) Loop and Choose expressions allow a case to have subcases. (See subcases for choose expressions.), but currently definitions by cases do not allow subcases. Modify the compiler to allow subcases in definition by cases. For example,
```
    Define f by
      ...
      case f(p) when c
        Subcases
          case f(q) = a
          case f(r) = b
          ...
        %Subcases
      ...
    %Define
```
indicates that, when this case is selected, further cases should be tried in the subcases construct.

This will involve modifications to the following parts of the compiler.
- The parser (src/parser/t_parser.y) will need to handle the new syntax. See nonterminal defLetCases.
- You will need to build the syntax tree (EXPR data structure) for this kind of thing. See src/exprs/t_choose.c for functions that build parts of choose/loop expressions and definitions by cases. See src/exprs/expr.doc for documentation on what each kind of EXPR node can contain.
- src/debug/t_dprtexpr.c contains a function called print_expr that can be used to print an EXPR data structure, so that you can see what you have built.
A copy statement has the form
```
     Let y = x.
```
where y is defined to be the same as x. It is common to get copy statements, particularly after the early stages of semantic processing of a definition, even when processing a definition that has no obvious copy statements. There is usually no need for them. Statement Let y = x. can typically be eliminated by simply using x whereever y occurs. That improves efficiency.

Add a semantic processing stage that eliminates copy definitions. For example,
```
    Let x = 0.
    Let y = x.
    Let z = x + y.
```
is replaced by
```
    Let x = 0.
    Let z = x + x.
```
or possibly
```
    Let x = 0.
    ()
    Let z = x + x.
```
where the copy definition has been replaced by a "do nothing" operation (). (The do-nothing operation will not have any code generated for it, so it is not a problem.) Notice that all occurrences of y are replaced by x. You should only eliminate a definition Let y = x if there are no relets to either x or y, so that neither can be changed. You need to check for that.

This does not involve any syntactic changes, but involves adding a phase to the semantic processing. Perform this phase just before the second phase of identifier processing. See Compiler.txt. Files of interest include the following.
- src/exprs/expr.doc describes EXPR data structures. src/exprs/t_exprutil.c contains a function called scan_expr that traverses an expression tree.
- src/ids/t_lateids.c contains the second phase of identifier processing, and src/ids/t_ids.c contains the first phase.
Allow a package to contain an embedded package, where things can be hidden. This involves lengthening the names of packages to include subpackage names, as well as some syntactic and semantic processing to handle the embedded packages. For example,
```
    Package Test
      ...
      Package Embed
         ...
      %Package
      ...
    %Package
```
would use a private name test.embed.x for a private thing called x inside the Embed package. Names like this are already in use, so it should not be to hard just to change the name. It will also be necessary to update which names are visible. Inside Embed, both Test and Test.Embed are visible. Outside the embedded package, only Test is visible.

This will involve modification to or use of the following files.
- The parser (src/parser/t_parser.y) will need to handle the new syntax.
- src/tables/t_fileinfo.c maintains information about the package that is being read, including the name of the package and visibility lists. The information is kept in a stack, so that you push a new frame when you start a new package, and pop that frame when you end a package. You can push a frame onto this stack for the embedded package.
- src/tables/t_globtbl.c and src/tables/t_visible.c handle the visibility issues of identifiers defined in the outer environment. You probably do not need to modify this, though.
- src/util/t_strops.c and src/util/s_strops.c handle naming issues, including looking for package separators. Be sure that names are handled correctly. The "." that separates a package name from the rest of the name is called NAME_SEP_CHAR. You need to be sure that your code remains viable even if the definition of NAME_SEP_CHAR is changed.
Packages are described in packages and the package hierarchy.
(This is probably the most involved of the suggestions.) Astarte supports inline expansions (expand declarations) and pattern matching (pattern declarations). Pattern matching is handled by performing inline expansions.

Currently pattern and expand declarations are only allowed at the outer level. They cannot be embedded inside other declarations. Modify the compiler to allow pattern and expand declarations to be local to a function definition. For example, it should be possible to write
```
    Define f(?x) =
      Expand g(?y) => x + y.
      ...
      Let z = g(n).
      ...
    %Define
```
which is translated to
```
    Define f(?x) =
      ...
      Let z = x + n.
      ...
    %Define
```
Information about pattern and expand definitions is stored in a table. This modification involves adding pattern and expand declarations to a local table for each declaration, and inspecting that local table when looking for a pattern or expand definition. The semantic processing is fairly complex, and the expression tree needs to be scanned more than once. Each time, during scanning, local tables need to be regenerated based on what is in the expression. The pattern and expand definitions will need to be stored explicitly in the EXPR data structure.

This involves changes in expression processing (probably in several modules, but especially in src/exprs/t_exprutil.c) and in pattern match and expand substitution (src/patmatch/t_translt.c). Local tables are handled in src/tables/src/t_loctbl.c.

Navigating around the compiler

The compiler and interpreter are available as source code. Get file astarte-0-11-3.tar.gz. Put it into a subdirectory in your account, and pull it apart:

  gunzip ast*.gz
  tar xvf ast*.tar

You should get directories called ast, htdoc, messages, src and test. Directory ast/ contains the library. Directory messages/ contains strings that the compiler and interpreter print, in a collection of files. Directory src/ contains the C source code. Directory test/ contains some test files. You can run them by going to directory test and doing a make. To rerun them, do make clean and make again.

After unpacking, edit the following files.

src/misc/config.h: Define BASE_DIR to be the directory where you have put the source files.
src/misc/options.h: Comment out the line #include "../misc/prod.h" and uncomment the line #include "../misc/debug.h".
src/exec/Makefile: Define TOPDIR to be the directory where the source is put and define BINDIR to be the directory where you want the binaries put. Change the definition of CFLAGS to be
```
    CFLAGS = -g -W -Wshadow -Wcomment -Wformat -Wimplicit -Wreturn-type
```

After making these changes, build the compiler by setting your current directory to src/exec and performing command

    make

Modify your PATH variable to contain the directory where you have put the binaries. (You might, for example, choose to put them in a directory called bin that you create beside the src, ast, test, etc. directories.)

The source files should initially be read only. Change a file's permission to read/write if you need to edit it. When you are done you will be able to see which files you modified simply by seeing which ones have write permission.

Files Compiler.txt and Interpreter.txt describe the anatomy of the compiler and interpreter, respectively. You should be able to find the files that you need, and avoid reading things that are irrelevant or uninteresting to you.

Ask questions when you need to. I expect to get questions.

Reference counts

The compiler uses reference counts for memory management. For example, each EXPR node has a ref_cnt variable that tells how many pointers are pointing to it. You need to handle reference counts. Thre is a simple recipe for that.

When you create a pointer p to an EXPR node, call bump_expr(p) to add 1 to the reference count of that node.
When you lose a pointer p to an EXPR node for any reason at all, call drop_expr(p). The pointer might, for example, be a variable that is going out of scope. Function drop_expr decreases the reference count of a node and frees that node if the reference count goes to 0. Function drop_expr also knows to do recursive drops for all pointers contained in any node that is being destroyed.
If you want to change the value of a pointer variable that points to an EXPR node, use SET_EXPR instead of an assignment. SET_EXPR(p,x) does the following.
```
     bump_expr(x);
     drop_expr(p);
     p = x;
```

Handling strings

Strings are kept in hash tables. That way, there is one permanent copy of each string that the compiler has seen. That makes it easy to put strings in data structures, without worrying about ever deallocating the strings.

You will probably want to use the identifier table, which contains strings that are identifiers. Function id_tb(s) takes any string s and returns a pointer to the official copy of s in the identifier table. (If there is not an entry yet, it creates one.) You can then use that pointer without worrying about whether it will ever change or be deallocated. Function id_tb is defined in file src/tables/t_strtbl.c.

If you create an EXPR node with the name of an identifier in it, then be sure that the identifier is the official copy, as returned by id_tb. All pointers in the str_tb array are official, so you do not need to call id_tb on them. The str_tb array is defined in src/standard/s_stdids.c.

Never change a string that is in the identifier table. They are constants. Do not even change them temporarily.