A Study of Using Chinese Restaurant Process Mixture Models in Information Retrieval
Retrieval systems help users to isolate relevant information from massive data collections. Usually, a user obtains useful information by submitting a query to such a system. One critical issue is that a query could have many subtopics. A Web query ``apple products" is a case. The query may indicate that a user wants to find Web pages related to iPhones or products made from the fruit ``apple". Determining which is relevant is difficult without feedback from the user. Query-specific clustering is one approach used to discover relevant aspects of a query by grouping relevant documents into clusters. In this approach, each cluster represents a relevant aspect of the query. We study Chinese restaurant process mixture models as clustering algorithms in this approach. To the best of our knowledge, our work is the first that studies such models in this context. Classical clustering models such as K-means and K-mixture Gaussian models have to first guess the number of clusters, K, and then estimate clusters from data. Chinese restaurant process mixture models can simultaneously learn the number of clusters and the actual clusters from data. This thesis first reviews K-means, K-mixture Gaussian models and Bayesian K-mixture models. Then we review Chinese restaurant process mixture models. The Chinese restaurant process mixture models are extensions of the Bayesian models where K is not required to be finite. Among these mixture models, we pay attention to distance-dependent Chinese restaurant process mixture models since external pairwise measures can be used in modeling. Then, we propose two similarity-like measures used for the Chinese restaurant process mixture models in information retrieval. Finally, a Gibbs sampling scheme for both types of models is reviewed. Then the models' performance in the pseudo-relevance feedback via query expansion tasks is tested through experiments. In this task, top-retrieved documents are considered as relevant documents, and here we use a collection of documents from the Robust track of TREC 2004. We investigate the effectiveness of these Chinese restaurant process mixture models in three query sets, each of which contains 50 queries and relevance judgments. To confirm the robustness of these models, sensitivity analysis of the hyper-parameters is conducted. Results show that the Chinese restaurant process mixture models perform better than baseline models used in the feedback task, and are not sensitive when their hyper-parameters are reasonably selected. The proposed measures used in the distance-dependent Chinese restaurant process mixture models perform comparably. On the other hand, the proposed measures barely help these models to outperform the standard Chinese restaurant process mixture models.