Click to edit Master text styles

Second level

Third level

Fourth level

Fifth level

‹header›

‹date/time›

‹footer›

‹#›

My project involves a spell checker that works on a probabilistic basis.

These are the topics I’m going to cover. There aren’t too many slides, but I’m going to go through them really fast so that I can get to the demo which doesn’t run so well on paper.

I feel obligated to present this slide. I’m going to use the technical term “good” for the correct spelling of a word and “bad” for the misspelled version of the word. The equation on the lower right needs to be solved by searching through all the good words for the maximum value of the expression with the help of a dictionary and a misspeller, which is also a technical term.

The dictionaries I am using have been downsampled from a large dictionary that includes approximately 25,000 words of frequency greater than one taken from a corpus of over 6 million words. To do this you have to more or less reconstitute the corpus and resample a smaller number of words. This process preserves the frequency distributions. Various problems arise if this is not done, such as inconsistencies when the dictionary is trained and correction results that aren’t comparable across dictionary sizes.

The density of English words increases with the number of words in the dictionary. This reduces both storage and search costs, but it means that the number of lexical neighbors that a word has increases, making spell correction more difficult.

The current incarnation of the misspeller is based on the minimum edit distance algorithm for computational efficiency. It has been modified to work with probabilities, although not completely, so that is what you would want to grill me on. The program starts at the origin and follows arcs to the corner, consuming the good word and producing the bad one. So, someone might type a ‘b’ for a ‘g’ and get charged for a substitution error. Another path would include the deletion of the ‘g’ and subsequent insertion of the ‘b’. Where arcs meet, probabilities are added. Where arcs leave a node, probabilities are multiplied. One can compute a total probability and a best path probability this in this way. You might notice that ‘good’ is the word ‘goo’ with a ‘d’ added and ‘goo’ is the word ‘go’ with an ‘o’ added. In performing the calculations for the longer words, values for ‘go’ and ‘goo’ can be reused. This is why I mentioned once that we only have to correct the ends of words. There is a strange rule here called vowel change. I don’t intend the system to discover rules automatically, not even based on a template as with the Brill tagger. Instead, they will be made by some practitioner of linguistics or a teacher of second language learners. Arc probabilities can be trained by reinforcing the paths corresponding to the kind of typo that the user makes.

Here are some results showing the effects of training. The vertical scale is the average rank of a misspelled word in the list of suggested words, so lower is better. You can see that more training helps, but that larger dictionaries are harder to train.

This kind of spell checker might be used to correct real word errors which might be quite prevalent as dictionary size increases. For instance, a large dictionary might include the words ‘very’ and ‘veery’ (a kind of bird). How well this works depends on the accuracy of the typist. For a very poor typist, all words will be converted to ‘the’. From this graph it is difficult to tell what will happen, but it doesn’t rule out the possibility.

I have two little demonstrations. The first shows how the system can be trained differently for different typists. The second shows how a slightly trained version works.