Computer Science 2611
Fall 2004
Laboratory Assignment 7

Handed out: 11/1/04


The edit distance problem

We studied the edit distance problem in class. I will repeat the problem definition and the basic facts that we derived.

The problem

A basic edit operation on a string is either to insert a character anywhere, to delete any character, or to replace one character by another.

You are given two strings X and Y. You want to know the smallest number of basic edit operations that can be used to change X into Y. Define that to be distance(X, Y). For example, distance("cat", "cute") = 2, since you can change "cat" into "cute" by performing two basic edit operations. (Replace the a by u, then insert e at the end.)

Prefixes and prefix distances

Suppose that X and Y are two strings. Define X(i) to be the length i prefix of X, and Y(j) to be the length j prefix of Y. For example, if X is "kangaroo", then X(5) is "kanga".

Define dist(X,Y,i,j) to be distance(X(i), Y(j)). That is, dist(X,Y,i,j) is the edit distance between the length i prefix of X and the length j prefix of Y.

Facts

Suppose that the characters in strings X and Y are numbered starting at 0. So the first character of X is X0, the second character is X1, etc.

We derived the following facts in class.

  1. dist(X, Y, 0, j) = j
  2. dist(X, Y, i, 0) = i
  3. dist(X, Y, i, j) = dist(X, Y, i-1, j-1)  provided Xi-1 = Yj-1
  4. dist(X, Y, i, j) = 1 + min(dist(X, Y, i-1, j-1), dist(X, Y, i-1, j), dist(X, Y, i, j-1)  provided Xi-1 =/= Yj-1


Algorithm 1

A straightforward algorithm is to use the above facts in a recursive function definition. Here is pseudo-code (not C++) for the recursive definition.

   dist(X,Y,i,j)

     if i = 0 return j
     else if j = 0 return i
     else if X[i-1] = Y[j-1] return dist(X,Y,i-1,j-1)
     else return 1 + min3(dist(X,Y,i-1,j-1), dist(X,Y,i-1,j), dist(X,Y,i,j-1))

   distance(X,Y)
     return dist(X,Y,strlen(X),strlen(Y))

Algorithm 2

Algorithm 1 has the problem that it recomputes the same thing many times. That makes it slow. Recomputation can be avoided by writing things down, and using the results that you wrote down earlier rather than recomputing. The following algorithm uses the same facts, but avoids repeated computation of the same thing. Again, it is written in pseudo-code.

   distance(X,Y)

   xlen = strlen(X)
   ylen = strlen(Y)

   Create a two-dimensional array D with xlen+1 rows and ylen+1 columns.
   The idea is that D[i][j] will be set to hold dist(X,Y,i,j) for
   all i from 0 to xlen and j from 0 to ylen.

   for j = 0,...,ylen
     D[0][j] = j
   end for

   for i = 0,...,xlen
     D[i][0] = i
   end for

   for i = 1,...,xlen
     for j = 1,...,ylen
       if X[i-1] = Y[j-1]
         D[i][j] = D[i-1][j-1]
       else
         D[i][j] = 1 + min3(D[i-1][j-1], D[i-1][j], D[i][j-1])
       end if
     end for
   end for
  
   return D[xlen][ylen]

The assignment

  1. Implement algorithm 1 in C++. Call your function distance1. Write a short main program to test it. For example, a start might be

        int main()
        {
          cout << "The distance between \"cat\" and \"cute\" is "
               << distance1("cat", "cute")
               << endl;
          return 0;
        }
      
    Try at least two examples, but keep them reasonably short. Do not proceed until you believe that distance1 is working correctly.

     

  2. Implement algorithm 2 in C++. Call your function distance2. Do not remove algorithm 1. Leave your distance1 function in your program. Just add this new one.

    Change your main program to call distance2. Test it. Do not proceed until it works.

     

  3. Comment out your main program that performs tests. Do not remove it.

     

  4. Write a main program that reads two strings X and Y from the user followed by a number, either 1 or 2. Your main program should then print the edit distance between X and Y, using algorithm 1 if the number is 1, and algorithm 2 if the number is 2. For example, if the user types

        mickey
        mouse
        2
      
    then your program should compute the edit distance between "mickey" and "mouse" using algorithm 2 to compute the distance. Make the output sensible. It should not just be a number. Say what you computed.

    Assume that the two strings do not contain blanks or other white-space characters. So you can read the first string into an array of characters X using

        cin >> X;
      
    Test your program before continuing.

     

  5. [This part requires a compiler that supports type long long (64-bit integers) and that provides function gettimeofday. Not all systems support those. The Gnu C++ compiler on Unix does support them.] Add the following function to your program.

      #include <ctime>
      using namespace std;
    
      //=======================================================
      // currentTime() returns the number of microseconds that
      // have elapsed since the beginning of January 1, 1970.
      //=======================================================
    
      long long currentTime()
      {
        timeval tv;
        gettimeofday(&tv, NULL);
        return ((long long) tv.tv_sec)*1000000 + tv.tv_usec;
      }
      
    Now modify your main program so that it gets the current time just before computing the distance and again just after computing the distance. Take the difference between those times to get the number of microseconds used. After reporting the edit distance, your program should now report how many microseconds were used. Do not just write a number. Make the output clear and readable. Test your program. Do the numbers look reasonable?

     

  6. Now time the algorithms. Try each on pairs of strings of the same length, where the length is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. That means you will make 20 tests, since you need to test each algorithm. Choose strings X and Y that do not have any characters in common. For example, try both algorithms on "a" and "b". Then try them on "ab" and "cd". Keep trying longer strings. Record your results. Draw a graph to show the times of the two algorithms on strings of equal length, as a function of the length. You are not required to use graph paper, but produce the neatest work that you can.

    Which algorithm is more efficient? Is the difference significant? What do you think will happen if you use strings of length 20?


What to turn in

Turn in your finished program. It should contain functions distance1 and distance 2, a testing main program that has been commented out, and a main program that reads information from the user, as explained above.

Also turn in your timing data and a graph of the time functions for the two algorithms.


Remark

Recursion, as tool for designing algorithms, acts in some ways as an amplifier. If you do something that is a little bit clever, then recursion can amplify it, and make it into a very efficient algorithm. On the other hand, if you do something bad, recursion will amplify that too, and make it into a very inefficient algorithm. The recursive edit distance algorithm has the problem that it computes something more than once. That is bad. Recursion amplifies that. You should see the consequences.