Che, Menglu2020-12-182020-12-182020-12-182020-12-17http://hdl.handle.net/10012/16578Incomplete data often brings difficulty to estimations and inferences. A complete case (CC) analysis, in most cases, leads to biased estimates, or it may not have the desired estimation efficiency. In this thesis, we develop statistical methods addressing the estimation of regression parameters with missing covariates. We are interested in improving the estimation efficiency by incorporating the information from the partially observed cases. Chapter 1 is an introduction to incomplete data problems and some existing estimation frameworks. We present the major tool we utilize to improve the estimation efficiency, i.e., empirical likelihood for general estimating functions. A brief introduction to the problems we solve in the subsequent chapters is also provided. Chapter 2 considers a regression problem with covariates missing not at random, where the missingness depends on the missing covariate values. For this type of missingness, CC analysis leads to consistent estimation when the missingness is independent of the response given all covariates, but it may not have the desired level of efficiency. We propose a general empirical likelihood framework to improve the estimation efficiency upon CC analysis. We expand on methods in Bartlett, Carpenter, Tilling & Vansteelandt (2014) and Xie & Zhang (2017) Instead of improving the efficiency by modelling the missingness probability conditional on the response and fully observed covariates, our method allows the possibility of modelling other data distribution-related quantities. We also give guidelines on what quantities to model and demonstrate that our proposal has the potential to yield smaller biases than existing methods when the missingness probability model is incorrect. Simulation studies are presented, as well as an application to data collected from the US National Health and Nutrition Examination Survey. Chapters 3 and 4 concern another type of incomplete data, namely the two-phase, response-dependent or outcome-dependent sample. This type of sampling is often used in regression settings that involve expensive covariate measurements. Conditional maximum likelihood (CML) is an attractive approach in many cases as it avoids modelling the covariate distribution, unlike full maximum likelihood. Moreover, it handles zero selection probabilities of the Phase 2 sampling. In Chapter 3, we consider general regression models with either a discrete or continuous response. We show that the estimator of covariate effects proposed by Scott & Wild (2011) has the same asymptotic efficiency as two empirical likelihood estimators, and that these estimators dominate the CML estimator. Chapter 4 proposes a more general empirical likelihood method within the CML framework to incorporate the information in the Phase 1 sample and improve estimation efficiency. The proposed method exploits a model which only involves the fully observed variates. It maintains the ability to handle zero selection probability and avoids modelling the covariate distribution. The proposed methods exhibit improvement upon CML as well as the estimator by Scott & Wild (2011) considered in Chapter 3. In these two chapters, we compare the efficiencies of various estimators in simulation studies and illustrate the methodologies in a two-phase genetics study. Chapter 5 presents some additional discussion and some topics for future research. We summarize the key points in our framework utilizing auxiliary information to improve estimation efficiency. Some additional remarks are given on the issues of numerical implementation, model diagnosis, and model compatibility. Finally, we discuss some topics for future research that are related to the methods considered in the thesis.enempirical likelihoodmissing dataestimating equationstwo-phase samplesEmpirical Likelihood Methods for Some Incomplete Data ProblemsDoctoral Thesis