Statistical inference and learning with incomplete data


Date
Apr 3, 2024
Event
PhD Public Lecture
Location
University of Western Ontario

Abstract
Missing data arise commonly in applications, and research on this topic has received extensive attention over the past few decades. Numerous inference methods have been developed under different missing data mechanisms, including missing completely at random, missing at random, and missing not at random. However, assessing a plausible missing data mechanism is difficult due to the lack of validation data. The challenge is further complicated by the presence of spurious variables in covariates. By utilizing newly emerging techniques, we explore new avenues concerning missing data analysis. This thesis aims to contribute fresh insights into statistical inference within the context of incomplete data and provide valid methods to address some existing research gaps.

Focusing on missingness in the response variable, the first project proposes a unified framework to address the missingness effects. Leveraging the generalized linear model to facilitate the dependence of the response on the associated covariates, we develop concurrent estimation and variable selection procedures using regularized likelihood, and we establish the asymptotic properties for the resultant estimators rigorously. The proposed methods offer flexibility and generality, eliminating the need of assuming a specific missing data mechanism - a requirement in most available methods. Empirical studies demonstrate the satisfactory performance of the proposed methods in finite sample settings. Furthermore, the project outlines extensions to accommodating missingness in both the response and covariates.

The second project approaches missing data from a different perspective by putting the problem within the framework of statistical machine learning, with a specific emphasis on exploring boosting techniques. Despite the increasing attention gained by boosting, many advancements in this area have primarily focused on numerical implementation procedures, with relatively limited theoretical work. Moreover, existing boosting approaches are predominantly designed to handle datasets with complete observations, and their validity is hampered by the presence of missing data. In this project, we employ semiparametric estimation approaches to develop unbiased boosting estimation methods for data with missing responses. We investigate several strategies to account for the missingness effects. The proposed methods are implemented using the functional gradient descent algorithm, and justified by the establishment of theoretical properties, including convergence and consistency of the proposed estimators. Numerical studies confirm the satisfactory performance of the proposed methods in finite sample settings.

The third project delves deeper into boosting procedures in the context of interval-censored data, where the exact observed value for the response variable is unavailable but only known to fall within an interval. Such data arise commonly in survival analysis and fields involving time-to-events, and they present a unique challenge in data analysis. Capitalizing the censoring unbiased transformation, we propose a method based on interval censored recursive forests, which allows unbiased boosting for handling both regression and classification problems. Preliminary numerical results show the promise of our proposed method. This work is currently ongoing.

Yuan Bian
Yuan Bian
Incoming Postdoc in Biostatistics