Computer Science 2311
Spring 2005
Laboratory Assignment 8

Handed out: 3/23/05

Note. Read this entire assignment before beginning to work on it.


Edit distance

We studied the edit distance problem in class. I will repeat the problem definition and the basic facts that we derived.

The problem

A basic edit operation on a string is either to insert a character anywhere, to delete any character, or to replace one character by another.

You are given two strings X and Y. You want to know the smallest number of basic edit operations that can be used to change X into Y. Define that to be distance(X, Y). For example, distance("cat", "cute") = 2, since you can change "cat" into "cute" by performing two basic edit operations. (Replace the a by u, then insert e at the end.)

Prefixes and prefix distances

Suppose that X and Y are two strings. Define X(i) to be the length i prefix of X, and Y(j) to be the length j prefix of Y. For example, if X is "kangaroo", then X(5) is "kanga".

Define dist(X,Y,i,j) to be distance(X(i), Y(j)). That is, dist(X,Y,i,j) is the edit distance between the length i prefix of X and the length j prefix of Y.

Facts

Suppose that the characters in strings X and Y are numbered starting at 0. So the first character of X is X0, the second character is X1, etc.

We derived the following facts in class.

  1. dist(X, Y, 0, j) = j

  2. dist(X, Y, i, 0) = i

  3. dist(X, Y, i, j) = dist(X, Y, i-1, j-1)  provided Xi-1 = Yj-1

  4. dist(X, Y, i, j) = 1 + min(dist(X, Y, i-1, j-1), dist(X, Y, i-1, j), dist(X, Y, i, j-1)  provided Xi-1 =/= Yj-1


Algorithm 1

A straightforward algorithm is to use the above facts in a recursive function definition. Here is pseudo-code (not Java) for the recursive definition.

   dist(X,Y,i,j)

     if i = 0 return j
     else if j = 0 return i
     else if X[i-1] = Y[j-1] return dist(X,Y,i-1,j-1)
     else return 1 + min(dist(X,Y,i-1,j-1), dist(X,Y,i-1,j), dist(X,Y,i,j-1))

   distance(X,Y)
     return dist(X,Y,length(X),length(Y))

Algorithm 2

Algorithm 1 has the problem that it recomputes the same thing many times. That makes it very slow. Recomputation can be avoided by writing things down, and using the results that you wrote down earlier rather than recomputing them. You can create a two-dimensional array (a grid) D where Di, j will be set to hold dist(X,Y,i,j). If you store values in the D array in the right order, (such as by rows, from row 0 to row n, left-to-right within a row) you will find that the values you need have always been computed earlier, and you just get them out of the array.

The following algorithm uses the same facts as the first algorithnm, but avoids repeated computation of the same thing. Again, it is written in pseudo-code.

   distance(X,Y)

   xlen = length(X)
   ylen = length(Y)

   --Create a two-dimensional array D with xlen+1 rows and ylen+1 columns.
   --The idea is that D[i][j] will be set to hold dist(X,Y,i,j) for
   --all i from 0 to xlen and j from 0 to ylen.

   --Using fact 1:

   for j = 0,...,ylen
     D[0][j] = j
   end for

   --Using fact 2:
   
   for i = 0,...,xlen
     D[i][0] = i
   end for

   --Using facts 3 and 4:

   for i = 1,...,xlen
     for j = 1,...,ylen
       if X[i-1] = Y[j-1]
         D[i][j] = D[i-1][j-1]
       else
         D[i][j] = 1 + min(D[i-1][j-1], D[i-1][j], D[i][j-1])
       end if
     end for
   end for
  
   return D[xlen][ylen]

The assignment

  1. Implement algorithm 1 in Java. Call your function distance1. Write a short main program to test it. For example, a start might be

        public static void main(String[] args)
        {
          System.out.println("The distance between \"cat\" and \"cute\" is "
                             + distance1("cat", "cute"));
        }
      
    Try at least two examples, but keep them reasonably short. Do not proceed until you believe that distance1 is working correctly.

     

  2. Implement algorithm 2 in Java. Call your function distance2. Do not remove algorithm 1. Leave your distance1 function in your program. Just add this new one.

    Change your main program to call distance2. Test it. Do not proceed until it works.

     

  3. Comment out your main program that performs tests. Do not remove it.

     

  4. Write a main program that gets two strings X and Y from the user using message boxes. See your text, page 118, for an example of messages boxes. There are notes below on message boxes as well. Using another message box, your program should ask the user for a number, either 1 or 2, telling which algorithm to use. Your main program should then pop up a box telling the edit distance between X and Y, using algorithm 1 if the number is 1, and algorithm 2 if the number is 2.

    Test your program before continuing.

     

  5. Try your algorithms on some long strings. Your algorithms will take the most time when the two strings do not contain any characters in common. For example, compute the edit distance between abcdefghijklm and nopqrstuvwxyz. Which algorithm apperars to be more efficient? Is the difference significant? What do you think will happen if you use two strings of length 30?


Strings

You can get the i-th character of a string x using Java expression x.charAt(i). The characters are numbered starting at 0.

You can get the length of string x using x.length(). Notice that the pseudo-code writes this length(x). You need to remember that when people describe algorithms to you, they are not writing programs, and you cannot expect them to use Java notation. You need to translate into Java.


Two-dimensional arrays

For algorithm 2, you need to use a two-dimensional array. A two dimensional array of integers has type int[][]. So

  int[][] d;
creates a variable d that can refer to a two-dimensional array of integers. Now you need to build an actual array, and make d refer to it. Statement
  d = new int[m][n];
builds a two dimensional array of integers with m rows and n columns, and puts that array into variable d. For example, assignment
  d = new int[2][3];
creates an array that looks like this:

              
              

Each slot can hold one integer.

To use the value at row i and column j in array d, write d[i][j]. But remember that numbering is from 0, so the first row of the array is row 0, and the first column is column 0.

What a two-dimensional array really is

A two-dimensional array is really just an array of arrays. It is an array of its rows, and each row is an array of colums. If d is a two-dimensional array, then d.length is the number of rows, and d[0].length is the number of columns in row 0. Typically, all of the rows have the same number of columns, but you can create a two-dimensional array where different rows have different sizes.


Message boxes using the Swing library

The Java library has been built up over time, and has different sections that were contributed by different groups, or at different times. One part is the Swing library, which helps you do graphics, and is popular and easy to use. To use Swing, you should put

  import javax.swing.*;
in your program.

Java statement

  String xstr = JOptionPane.showInputDialog("What is the first string?");
causes a message box to pop up that contains the text "What is the first string?", and also has a box where the user can type a string. It will set xstr to the string that the user types.

If you just want to show a message, not ask for information, use something like the following. If msg is a string that you want to show, then

  JOptionPane.showMessageDialog(null, msg);
will pop up a box showing message msg, with an ok button to allow the user to way when he or she is done reading the message.

Important. If you use graphics, then you need to shut down the graphics support when the program is done. At the end of main, add statement

  System.exit(0);
to shut everything down. If this statement is performed anywhere in a program, the entire program is stopped. You cannot do anything after this statement.