Lexical Affinities and Language Applications
Understanding interactions among words is fundamental for natural language applications. However, many statistical NLP methods still ignore this important characteristic of language. For example, information retrieval models still assume word independence. This work focuses on the creation of lexical affinity models and their applications to natural language problems. The thesis develops two approaches for computing lexical affinity. In the first, the co-occurrence frequency is the calculated by point estimation. The second uses parametric models for co-occurrence distances. For the point estimation approach, we study several alternative methods for computing the degree of affinity by making use of point estimates for co-occurrence frequency. We propose two new point estimators for co-occurrence and evaluate the measures and the estimation procedures with synonym questions. In our evaluation, synonyms are checked directly by their co-occurrence and also by comparing them indirectly, using other lexical units as supporting evidence. For the parametric approach, we address the creation of lexical affinity models by using two parametric models for distance co-occurrence: an independence model and an affinity model. The independence model is based on the geometric distribution; the affinity model is based on the gamma distribution. Both fit the data by maximizing likelihood. Two measures of affinity are derived from these parametric models and applied to the synonym questions, resulting in the best absolute performance on these questions by a method not trained to the task. We also explore the use of lexical affinity in information retrieval tasks. A new method to score missing terms by using lexical affinities is proposed. In particular, we adapt two probabilistic scoring functions for information retrieval to allow all query terms to be scored. One is a document retrieval method and the other is a passage retrieval method. Our new method, using replacement terms, shows significant improvement over the original methods.