|dc.description.abstract||Different types of correlated data arise commonly in many studies and present considerable challenges in modeling and characterizing complex dependence structures. This thesis considers statistical issues in analyzing such kinds of data. Chapters 2-4 of the thesis aim to develop models to account for complex dependence structures and propose new statistical inference methods. In particular, our attention focuses on using copula models and their variants to delineate association structures for dependent data. As ``big data" has increasingly versatile applications in many fields, more and more data with irregular distributions emerge, which calls for more flexible and robust nonparametric statistical methods. Chapters 5 and 6 of the thesis develop novel Bayesian nonparametric methods on sampling algorithms and regression models.
More specifically, in Chapter 2, we consider longitudinal data with a time-span, of which common examples include temperature and precipitation data. We utilize a vine copula model to account for the dependence among longitudinal responses; the joint distribution of responses is factorized as a product of marginal distributions and bivariate conditional copulas. To release the computational burden and concentrate on the structure of interest, we propose composite likelihood methods which divide the responses into time blocks and leave the connecting structure between time blocks unspecified. We explore the efficiency, robustness, model selection and prediction of our proposed methods by simulation studies. The proposed model is applied to analyze an Ontario temperature dataset.
In Chapter 3, we consider dependent data with a hierarchical structure. Analysis of such data is often challenging due to the complexity in modeling different dependence structures as well as the demand of intensive computation sources. To alleviate these issues, we propose a Bayesian hierarchical copula model (BHCM) to accommodate the hierarchical structures of the dependent data, where the subject-level dependence is facilitated by the copula-based model and the hierarchical structure is described using random dependence parameters. We introduce a layer-by-layer sampling scheme for conducting inferences. Our proposed BHCM enjoys the flexibility of modeling various complex association structures, while retaining manageable computation. Extensive simulation studies show that our proposed estimators outperform conventional likelihood-based estimators in finite sample settings. We apply the BHCM to analyze the Vertebral Column dataset from the UCI Machine Learning Repository.
In Chapter 4, we consider dependent data coming from multiple sources where we aim to group similar dependence structures together and then conduct model selection and parameter estimation based on copula models. We propose a mixture of Dirichlet process mixture copula model (M-DPM-CM) to identify similar dependence structures and select copula models, in which the model selection parameters and copula parameters are assigned a Dirichlet process prior. Simulation studies and data analysis are conducted to compare the M-DPM-CM to the conventional copula selection method using the AIC criterion. The results show that the M-DPM-CM can accurately recover the true grouping structure with a moderate sample size, and achieve a more accurate model selection results than the conventional AIC method. The M-DPM-CM is also applied to analyze the Vertebral Column dataset used in Chapter 3 to obtain more insights into the dependence structures.
In Chapter 5, we focus on developing sampling algorithms from a complex distribution. To remedy the limitations of Markov Chain Monte Carlo (MCMC) algorithms, we propose a novel sampling method, called Polya tree Monte Carlo (PTMC). Our proposed PTMC method can feasibly approximate the posterior Polya tree by the Monte Carlo method, which is justified theoretically that the approximated Polya tree posterior converges to the target distribution under regularity conditions.
We further propose a series of simple and efficient sampling algorithms which are useful for different scenarios. Extensive numerical studies are conducted to demonstrate the appealing performance of the proposed method, including its superiority to the usual MCMC algorithms, under various settings. The evaluation and comparison are carried out in terms of sampling efficiency, computational speed and the capacity of identifying distribution modes.
In Chapter 6, we consider the topic of nonparametric regression models. The Polya tree (PT) based nearest neighbor regression model is introduced as a fully nonparametric regression method.
To approximate the true conditional probability measure of the response given the covariate value, we construct a PT-distributed probability measure of the response in the nearest neighborhood of the covariate value of interest. Our proposed method gives consistent and robust estimators, and has a faster convergence rate than the kernel density estimation. We conduct extensive simulation studies and analyze the Combined Cycle Power Plant dataset to compare the performance of our method to other nonparametric or semi-parametric methods. %The studies suggest that the proposed method exhibits the superiority to the kernel and PT density estimation methods in terms of the estimation accuracy and convergence rate and to LDTFP in terms of robustness.
Summary remarks and discussion of future research topics are presented in Chapter 7.||en