Similarity-Based Estimation of Word Cooccurrence Probabilities

Cambridge MA 02138; Fernando Pereira (AT&T Bell Laboratories; Harvard University; Ido Dagan (AT&T Bell Laboratories; Lillian Lee (DAS; Murray Hill; NJ 07974; USA)

arxiv: cmp-lg/9405001 · v1 · submitted 1994-05-02 · cmp-lg · cs.CL

Similarity-Based Estimation of Word Cooccurrence Probabilities

Ido Dagan (AT&T Bell Laboratories , Murray Hill , NJ 07974 , USA) , Fernando Pereira (AT&T Bell Laboratories , Lillian Lee (DAS , Harvard University , Cambridge MA 02138 This is my paper

classification cmp-lg cs.CL

keywords wordcombinationsdetermineunseenbigramscombinationcorpusgiven

0 comments

read the original abstract

In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations ``eat a peach'' and ``eat a beach'' is more likely. Statistical NLP methods determine the likelihood of a word combination according to its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in a given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on ``most similar'' words. We describe a probabilistic word association model based on distributional word similarity, and apply it to improving probability estimates for unseen word bigrams in a variant of Katz's back-off model. The similarity-based method yields a 20% perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speech-recognition error.

This paper has not been read by Pith yet.

Similarity-Based Estimation of Word Cooccurrence Probabilities

discussion (0)