Statistical methods for analyzing missing covariate data

Date of Completion

January 2004






Missing covariate data often arise in various settings, including surveys, clinical trials, epidemiological studies, biological studies and environmental studies. Large scale studies often have large fractions of missing data, which can present serious problems to the data analyst. Motivated by real data applications, this dissertation addresses several aspects in modeling and analyzing data with missing covariates. ^ First, we propose Bayesian methods for estimating parameters in generalized linear models (GLM's) with nonignorably missing covariate data. We specify a parametric distribution for the response variable given the covariates (GLM), a parametric distribution for the missing covariates, and a parametric multinomial selection model for the missing data mechanism. Then we characterize general conditions for the propriety of the joint posterior distribution of the parameters and extend two model selection criteria, weighted L measure and Deviance Information Criterion for model comparison in the presence of missing covariates. ^ Second, we develop a novel modeling strategy for analyzing data with repeated binary responses over time as well as with time-dependent missing covariates. We use the generalized linear mixed logistic model for the repeated binary responses and then propose a joint model for time-dependent missing covariates using information from different sources. The Monte Carlo EM algorithm is developed for computing the maximum likelihood estimates. An extended version of the AIC criterion is proposed to identify factors of interest that may disrupt the cyclical pattern of flowering. ^ Third, we develop an efficient Gibbs sampling algorithm to sample from the joint posterior distribution for the generalized linear mixed logistic model. Moreover, we propose a novel Monte Carlo method to compute a Bayesian model comparison criterion, DIC, for any variable subset model using a single Markov Chain Monte Carlo sample from the full model without sampling from the posterior distribution under each subset model. In the end, we provide a brief discussion of future research. ^