Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
‹header›
‹date/time›
‹footer›
‹#›
My project involves a spell checker that
works on a probabilistic basis.
These are the topics I’m going to
cover. There aren’t too many slides,
but I’m going to go through them really fast so that I can get to the demo
which doesn’t run so well on paper.
I feel obligated to present this
slide. I’m going to use the technical
term “good” for the correct spelling of a word and “bad” for the misspelled
version of the word. The equation on
the lower right needs to be solved by searching through all the good words for
the maximum value of the expression with the help of a dictionary and a
misspeller, which is also a technical term.
The dictionaries I am using have been
downsampled from a large dictionary that includes approximately 25,000 words
of frequency greater than one taken from a corpus of over 6 million
words. To do this you have to more or
less reconstitute the corpus and resample a smaller number of words. This process preserves the frequency
distributions. Various problems arise
if this is not done, such as inconsistencies when the dictionary is trained
and correction results that aren’t comparable across dictionary sizes.
The density of English words increases
with the number of words in the dictionary.
This reduces both storage and search costs, but it means that the
number of lexical neighbors that a word has increases, making spell correction
more difficult.
The current incarnation of the misspeller
is based on the minimum edit distance algorithm for computational
efficiency. It has been modified to
work with probabilities, although not completely, so that is what you would
want to grill me on. The program starts
at the origin and follows arcs to the corner, consuming the good word and
producing the bad one. So, someone
might type a ‘b’ for a ‘g’ and get charged for a substitution error. Another path would include the deletion of
the ‘g’ and subsequent insertion of the ‘b’.
Where arcs meet, probabilities are added. Where arcs leave a node, probabilities are multiplied. One can compute a total probability and a
best path probability this in this way.
You might notice that ‘good’ is
the word ‘goo’ with a ‘d’ added and ‘goo’ is the word ‘go’ with an ‘o’ added. In performing the calculations for the
longer words, values for ‘go’ and ‘goo’ can be reused. This is why I mentioned once that we only
have to correct the ends of words.
There is a strange rule here called
vowel change. I don’t intend the system
to discover rules automatically, not even based on a template as with the
Brill tagger. Instead, they will be
made by some practitioner of linguistics or a teacher of second language
learners.
Arc probabilities can be trained by reinforcing the paths
corresponding to the kind of typo that the user makes.
Here are some results showing the effects
of training. The vertical scale is the
average rank of a misspelled word in the list of suggested words, so lower is
better. You can see that more training
helps, but that larger dictionaries are harder to train.
This kind of spell checker might be used
to correct real word errors which might be quite prevalent as dictionary size
increases. For instance, a large
dictionary might include the words ‘very’ and ‘veery’ (a kind of bird). How well this works depends on the accuracy
of the typist. For a very poor typist,
all words will be converted to ‘the’.
From this graph it is difficult to tell what will happen, but it
doesn’t rule out the possibility.
I have two little demonstrations. The first shows how the system can be
trained differently for different typists.
The second shows how a slightly trained version works.