Incomplete data commonly arise in applications, and research on this topic has received extensive attention over the past few decades. Numerous inference methods have been developed to address various issues related to incomplete data, such as different types of missing observations and distinct missing data mechanisms, which are often classified as missing completely at random, missing at random, and missing not at random. However, research gaps still remain.
Assessing a plausible missing data mechanism is typically difficult due to the lack of validation data, and the presence of spurious variables in covariates further complicates the challenge. Prediction in the presence of incomplete data is another area worth exploring. By utilizing newly emerging techniques, we explore new avenues in the analysis of incomplete data. This thesis aims to contribute fresh insights into statistical inference within the context of incomplete data and provide valid methods to address a few existing research gaps.
Focusing on missingness in the response variable, the first project proposes a unified framework to address the effects of missing data. By leveraging the generalized linear model to facilitate the dependence of the response on associated covariates, we develop concurrent estimation and variable selection procedures using regularized likelihood. We rigorously establish the asymptotic properties of the resultant estimators. The proposed methods offer flexibility and generality, eliminating the need to assume a specific missing data mechanism – a requirement in most available methods. Empirical studies demonstrate the satisfactory performance of the proposed methods in finite sample settings. Furthermore, the project outlines extensions to accommodate missingness in both the response and covariates.
The second problem of interest approaches missing data from a different perspective by placing it within the framework of statistical machine learning, with a specific emphasis on exploring boosting techniques; two projects are generated accordingly. Despite the increasing attention gained by boosting, many advancements in this area have primarily focused on numerical implementation procedures, with relatively limited theoretical work. Moreover, existing boosting approaches are predominantly designed to handle datasets with complete observations, and their validity is hampered by the presence of missing data. In this thesis, we employ semiparametric estimation approaches to develop unbiased boosting estimation methods for data with missing responses. We investigate several strategies to account for the missingness effects. The proposed methods are implemented using the functional gradient descent algorithm and are justified by the establishment of theoretical properties. Numerical studies confirm the satisfactory performance of the proposed methods in finite sample settings.
The third topic further explores different boosting procedures in the context of interval censored data, where the exact observed value for the response variable is unavailable but only known to fall within an interval. Such data commonly arise in survival analysis and fields involving time-to-events, and they present a unique challenge in data analysis. In this project, we develop boosting methods for both regression and classification problems with interval censored data. We address the censoring effects by adjusting the loss functions or imputing transformed responses. The proposed methods are implemented using a functional gradient descent algorithm, and we rigorously establish their theoretical properties, including mean squared error tradeoffs and the optimality of the proposed estimators. Numerical studies are conducted to assess the performance of the proposed methods in finite sample settings.