High-dimensional discriminant analysis and covariance matrix estimation
MetadataShow full item record
Statistical analysis in high-dimensional settings, where the data dimension p is close to or larger than the sample size n, has been an intriguing area of research. Applications include gene expression data analysis, financial economics, text mining, and many others. Estimating large covariance matrices is an essential part of high-dimensional data analysis because of the ubiquity of covariance matrices in statistical procedures. The estimation is also a challenging part, since the sample covariance matrix is no longer an accurate estimator of the population covariance matrix in high dimensions. In this thesis, a series of matrix structures, that facilitate the covariance matrix estimation, are studied. Firstly, we develop a set of innovative quadratic discriminant rules by applying the compound symmetry structure. For each class, we construct an estimator, by pooling the diagonal elements as well as the off-diagonal elements of the sample covariance matrix, and substitute the estimator for the covariance matrix in the normal quadratic discriminant rule. Furthermore, we develop a more general rule to deal with nonnormal data by incorporating an additional data transformation. Theoretically, as long as the population covariance matrices loosely conform to the compound symmetry structure, our specialized quadratic discriminant rules enjoy low asymptotic classification error. Computationally, they are easy to implement and do not require large-scale mathematical programming. Then, we generalize the compound symmetry structure by considering the assumption that the population covariance matrix (or equivalently its inverse, the precision matrix) can be decomposed into a diagonal component and a low-rank component. The rank of the low-rank component governs to what extent the decomposition can simplify the covariance/precision matrix and reduce the number of unknown parameters. In the estimation, this rank can either be pre-selected to be small or controlled by a penalty function. Under moderate conditions on the population covariance/precision matrix itself and on the penalty function, we prove some consistency results for our estimator. A blockwise coordinate descent algorithm, which iteratively updates the diagonal component and the low-rank component, is then proposed to obtain the estimator in practice. In the end, we consider jointly estimating large covariance matrices of multiple categories. In addition to the aforementioned diagonal and low-rank matrix decomposition, it is further assumed that there is some common matrix structure shared across the categories. We assume that the population precision matrix of category k can be decomposed into a diagonal matrix D, a shared low-rank matrix L, and a category-specific low-rank matrix Lk. The assumption can be understood under the framework of factor models --- some latent factors affect all categories alike while others are specific to only one of these categories. We propose a method that jointly estimates the precision matrices (therefore, the covariance matrices) --- D and L are estimated with the entire dataset whereas Lk is estimated solely with the data of category k. An AIC-type penalty is applied to encourage the decomposition, especially the shared component. Under certain conditions on the population covariance matrices, some consistency results are developed for the estimators. The performances in finite dimensions are shown through numerical experiments. Using simulated data, we demonstrate certain advantages of our methods over existing ones, in terms of classification error for the discriminant rules and Kullback--Leibler loss for the covariance matrix estimators. The proposed methods are also applied to real life datasets, including microarray data, stock return data and text data, to perform tasks, such as distinguishing normal from diseased tissues, portfolio selection and classifying webpages.
Cite this work
Yilei Wu (2017). High-dimensional discriminant analysis and covariance matrix estimation. UWSpace. http://hdl.handle.net/10012/12774