Research Projects

I am broadly interested in statistics and biostatistics, with a current focus on statistical learning, precision medicine, integrative analysis, behavioral task analysis, missing data, survival data, conformal inference, nonparametric and semiparametric statistics, and applications in mental health and diabetes. This page provides an overview of the main projects I have worked on and am currently pursuing. Click on each project title for more details, including related papers and presentations. If you prefer to view my papers by year, please see this link.


Learning Individualized Treatment Rules from Multi-Study and Multi-Site
This work introduces statistical methodologies for integrating data from multiple randomized controlled trials, which may share a common treatment arm but differ in their alternative treatment options, to learn more robust and reliable individualized treatment rules. In addition, this work develops methods to integrate electronic health records from multiple hospitals while respecting data privacy and institutional sharing constraints, enabling more comprehensive and generalizable analyses across heterogeneous healthcare systems.
Learning Individualized Treatment Rules from Multi-Study and Multi-Site
Characterizing Decision-Making Dynamics and Brain Activities
This work develops advanced statistical modeling techniques to investigate decision-making in behavioral tasks, integrate data across multiple tasks, uncover latent neural dynamics, and elucidate brain–behavior relationships. The overarching goal is to identify and quantify distinctive decision-making patterns associated with mental health conditions and varied cognitive strategies. Our primary focus is on the Establishing Moderators and Biosignatures of Antidepressant Response in Clinical Care (EMBARC) and Adolescent Brain Cognitive Development (ABCD) studies.
Characterizing Decision-Making Dynamics and Brain Activities
Boosting Learning with Incomplete Data
This work extends boosting methods by developing multiple strategies to handle missing and interval-censored response variables. These strategies are implemented using functional gradient descent, and we establish rigorous theoretical guarantees. Numerical studies demonstrate that the proposed methods perform well in finite-sample settings.
Boosting Learning with Incomplete Data
Hypothesis Testing with Genetic Variants in Longitudinal and Time-to-Event Data
This work investigates joint models for genetic association involving longitudinal biomarkers and time-to-event outcomes. We develop and validate a closed-form sample size formula for assessing the overall effects of single-nucleotide polymorphisms. To improve robustness against model misspecification due to nonlinear trajectories in the longitudinal traits, we incorporate spline functions to capture subject-specific nonlinear evolutions. This study is motivated by and applied to data from the Diabetes Control and Complications Trial (DCCT).
Hypothesis Testing with Genetic Variants in Longitudinal and Time-to-Event Data
Inference and Variable Selection with Missing Data
This work addresses key challenges in missing data analysis by proposing a unified modeling framework based on generalized additive models. The framework flexibly accommodates a wide range of missing data mechanisms without relying on strong parametric assumptions. To enable simultaneous estimation and variable selection, it incorporates a regularized likelihood approach. The proposed method is supported by rigorous theoretical guarantees and demonstrates strong empirical performance across a variety of settings.
Inference and Variable Selection with Missing Data
Causal Inference with Artificial Intelligence
This work examines the causal impact of stroke and myocardial infarction on self-rated health, revealing a negative effect on individuals’ perceptions of having very good or excellent health and highlighting the importance of interventions to support subjective health in aging populations.
Causal Inference with Artificial Intelligence
Assessing Physician Performance in Critical Care
This work evaluates ICU physician performance by applying tree ensemble methods with TreeSHAP, and addresses confounding, achieves covariate balance, and assesses physician effects using propensity weighting with parametric models and super learning methods.
Assessing Physician Performance in Critical Care
Examining COVID-19 Incubation Times
This work explores different models to analyze incubation times of COVID19 from different angles, finding that the current recommended 14­ day quarantine time is not long enough to control the probability of an early release of infected individuals to be small.
Examining COVID-19 Incubation Times
Identifying the Popularity of TED Talks
This work uses network analysis, principal component analysis, and LASSO to predict TED Talk popularity, revealing that Number of Languages significantly influences popularity and that publishing on weekends, particularly Saturdays, enhances idea dissemination.
Identifying the Popularity of TED Talks
Analysing Randomized Controlled Trials with Missing Response
This work evaluates the efficacy of Diclectin for nausea and vomiting during pregnancy (NVP) using a double-blind randomized controlled trial, finding that statistical inferences about its effectiveness depend on the choice of missing data methods and models, with results suggesting its benefit is not clinically significant under the pre-specified minimal clinically important difference.
Analysing Randomized Controlled Trials with Missing Response