Assigned: | Monday, November 16 |
Due: | Wednesd, December 2, 11:59pm |
There are three basic edit operations on a string.
cat cut (replace a by u) cute (insert e at the end)You can do the steps in a different order, but you cannot use fewer than two basic edit operations.
Your assignment is to write a Java program that reads two lines of text from the standard input, holding two strings. Trim any white space from the front and back of each string. (s.trim() yields the result of trimming s.) Then your program should show s and t and the edit distance between s and t. The output must be readable. Here is a sample of a readable output.
The edit distance between "cat" and "cute" is 2.
To derive an algorithm for the edit distance problem, we turn again to dynamic programming. Here are two key ideas.
It is not enough to compute the edit distance between strings s and t. We must compute the edit distance between every prefix of s and every prefix of t. Suppose s = s1…sm and t = t1…tn. Define d(i, j) to be the edit distance between s1…si and t1…tj. It is convenient to define s(i) to be s1…si and t(j) to be t1…tj. Then d(i, j) is defined to be the edit distance between s(i) and t(j).
It is important to store the answers for the prefixes so that they can be used to compute the values for longer prefixes. Eventually, we compute d(m, n), which is the edit distance between s and t, since s = s(m) and t = t(n).
Notice that
[Fact 1] d(i, 0) = i (since the only way to convert a length i string to a length 0 string is to remove each character, which uses i remove operations).
[Fact 2] d(0, j) = j (since the only way to convert a length 0 string to a length j string is to add each character, which uses i insert operations).
If i > 0 and j > 0, there are two cases.
[Fact 3] Suppose si = tj. Then the last characters of s(i) and t(j) are the same. So d(i,j) = d(i−1, j−1).
For example, suppose that s(6) = "flavor" and t(7) = "quarter". Since they both end on r, the edit distance between "flavor" and "quarter" is the same as the edit distance between "flavo" and "quarte". That is d(6,7) = d(5,6).
Suppose that si ≠ tj. Since the last characters of s(i) and t(j) do not match, that mismatch will need to be dealt with as some point. So we might as well deal with it now. Options for converting s(i) into t(j) are as follows.
Delete the last character of s(i), then convert s(i−1) to t(j). For example, one way to convert "flavo" to "quarte" is to remove the o from "flavo" and finish by converting "flav" to "quarte".
Add tj to the end of s(i) Then the last characters have been forced to match. Getting rid of the matched characters, it suffices to convert s(i) to t(j−1). For example, one way to convert "flavo" to "quarte" is to add e to the end of "flavoe", and then to convert "flavoe" to "quarte". But now the two strings end on the same character, so it suffices to convert "flavo" to "quart".
Replace the last character of s(i) by tj, then convert s(i−1) to t(j−1). For example, one way to convert "flavo" to "quarte" is to replace the o in "flavo" by e, which yields "flave". But now the two strings end on the same character. So finish by converting "flav" to "quart".
There are three options. Which one is the best? Why guess? We try them all and pick the one that yields the smallest number. Each option requires performing one basic edit operation and then continuing by converting different prefixes.
[Fact 4] If si ≠ tj then d(i, j) = 1 + min(d(i−1, j), d(i, j−1), d(i−1, j−1)).
The algorithm to find the edit distance between s = s1…sm and t = t1…tm starts by creating a two-dimensional array D of integers with m+1 rows and n+1 columns. Then it fills it in, with D[i][j] = d(i, j).
First fill in row 0 and column 0 using facts 1 and 2. Then fill in the rest of the values be rows, from top to bottom. (Top is index 1 and bottom is index m.) Fill in each row from left to right. For each entry, use either fact 3 or fact 4, whichever is appropriate.
Finally, the edit distance between s and t is found in D[m][n].
For extra credit, show a sequence of modified versions of s as you convert s into t. For example, if s = "forest" and t = "frost", then the program would show the following.
forest forst foost frost
Do this after you have filled in the D array. Keep track of the current version of s, after zero or more modifications (say, in variable current). Start at the lower right corner (D[m][n]) of the D array. So set i = m and j = m.
You can tell which operation was used to compute D[j][j]. If sj = tj then no operation was done, so move up and to the left (by decrementing i and j).
If si ≠ tj, then some basic edit operation must have been performed. Recompute which one was performed, just as you did when using fact 4 to fill in D[i][j]. But pay attention to which operation was selected in the min operation. (It suffices to try each possibility and to select any one that yields the answer found in D[i][j].) Alter variable current by performing the appropriate operation at index i. Change i and j to move to the appropriate location (the one that occurs in the min). For example, if a replace operation was selected, then replace current[i] with tj and subtract 1 for both i and j, since in that case the min operation selected D[i−1][j−1].
If you reach i = 0, then there are j remaining steps. If you reach j = 0, then there are i remaining steps. Try an example to see what they are. Stop when both i and j are 0.
To turn in your work, log into the Linux system, change your
directory for the one for assignment 6,
use the following command.
~abrahamsonk/3200/bin/submit 6 EditDistance.java
After submitting, you should receive confirmation that the
submission was successful. If you do not receive confirmation,
assume that the submission did not work.
Command
~abrahamsonk/3200/bin/submit 6
will show you what you have submitted for
assignment 6.
You can do repeated submissions. New submissions will replace old ones.
Late submissions will be accepted for one day after the due date. If you miss a late submission deadline by a microsecond, your work will not be accepted.