|
|
|
The
current incarnation of the misspeller is based on the minimum edit distance
algorithm for computational efficiency.
It has been modified to work with probabilities, although not
completely, so that is what you would want to grill me on. The program starts at the origin and
follows arcs to the corner, consuming the good word and producing the bad
one. So, someone might type a ‘b’ for
a ‘g’ and get charged for a substitution error. Another path would include the deletion of
the ‘g’ and subsequent insertion of the ‘b’.
Where arcs meet, probabilities are added. Where arcs leave a node, probabilities are
multiplied. One can compute a total
probability and a best path probability this in this way.
|
|
You
might notice that ‘good’ is the word ‘goo’ with a ‘d’ added and ‘goo’ is the
word ‘go’ with an ‘o’ added. In
performing the calculations for the longer words, values for ‘go’ and ‘goo’
can be reused. This is why I mentioned
once that we only have to correct the ends of words.
|
|
There is
a strange rule here called vowel change.
I don’t intend the system to discover rules automatically, not even
based on a template as with the Brill tagger.
Instead, they will be made by some practitioner of linguistics or a
teacher of second language learners.
|
|
Arc
probabilities can be trained by reinforcing the paths corresponding to the
kind of typo that the user makes.
|