Research

I am broadly interested in Statistics and Biostatistics, with a current focus on statistical learning, personalized medicine, computational psychiatry, integrative analysis, neuroimaging data modeling, and their applications in mental health and diabetes. My research has included work on incomplete data, noisy data, and conformal inference. This page provides an overview of the main projects I have worked on and am currently pursuing. If you’re interested, please click on each project title for more details, including related papers and presentations. Feel free to connect with me for further discussion. If you prefer to view my papers by year, please see this link.


Characterizing Decision-Making Dynamics and Brain Activities
This work develops advanced statistical modeling techniques to build efficient and scalable computational frameworks for investigating decision-making in behavioral tasks, integrating data across multiple tasks, uncovering latent neural dynamics, and elucidating brain–behavior relationships. The overarching goals are to identify and quantify distinctive decision-making patterns associated with mental health conditions and diverse cognitive strategies, to localize brain regions activated by specific tasks, and to determine the causal effects of brain activity on behavior.
Characterizing Decision-Making Dynamics and Brain Activities
Learning Individualized Treatment Rules from Multi-Study and Multi-Site
This work introduces statistical methodologies for integrating data from multiple randomized controlled trials, which may share a common treatment arm but differ in their alternative treatment options, to learn more robust and reliable individualized treatment rules. In addition, this work develops methods to integrate electronic health records from multiple hospitals while respecting data privacy and institutional sharing constraints, enabling more comprehensive and generalizable analyses across heterogeneous healthcare systems.
Learning Individualized Treatment Rules from Multi-Study and Multi-Site
Boosting Learning with Incomplete Data
This work extends boosting methods by developing multiple strategies to handle missing and interval-censored response variables. These strategies are implemented using functional gradient descent, and we establish rigorous theoretical guarantees. Numerical studies demonstrate that the proposed methods perform well in finite-sample settings.
Boosting Learning with Incomplete Data
Hypothesis Testing with Genetic Variants in Longitudinal and Time-to-Event Data
This work investigates joint models for genetic association involving longitudinal biomarkers and time-to-event outcomes. We develop and validate a closed-form sample size formula for assessing the overall effects of single-nucleotide polymorphisms. To improve robustness against model misspecification due to nonlinear trajectories in the longitudinal traits, we incorporate spline functions to capture subject-specific nonlinear evolutions.
Hypothesis Testing with Genetic Variants in Longitudinal and Time-to-Event Data
Inference and Variable Selection with Missing Data
This work addresses key challenges in missing data analysis by proposing a unified modeling framework based on generalized additive models. The framework flexibly accommodates a wide range of missing data mechanisms without relying on strong parametric assumptions. To enable simultaneous estimation and variable selection, it incorporates a regularized likelihood approach. The proposed method is supported by rigorous theoretical guarantees and demonstrates strong empirical performance across a variety of settings.
Inference and Variable Selection with Missing Data
Causal Inference with Artificial Intelligence
This work examines the causal impact of stroke and myocardial infarction on self-rated health, revealing a negative effect on individuals’ perceptions of having very good or excellent health and highlighting the importance of interventions to support subjective health in aging populations.
Causal Inference with Artificial Intelligence
Assessing Physician Performance in Critical Care
This work evaluates ICU physician performance by applying tree ensemble methods with TreeSHAP, and addresses confounding, achieves covariate balance, and assesses physician effects using propensity weighting with parametric models and super learning methods.
Assessing Physician Performance in Critical Care
Examining COVID-19 Incubation Times
This work explores different models to analyze incubation times of COVID19 from different angles, finding that the current recommended 14­ day quarantine time is not long enough to control the probability of an early release of infected individuals to be small.
Examining COVID-19 Incubation Times
Identifying the Popularity of TED Talks
This work uses network analysis, principal component analysis, and LASSO to predict TED Talk popularity, revealing that Number of Languages significantly influences popularity and that publishing on weekends, particularly Saturdays, enhances idea dissemination.
Identifying the Popularity of TED Talks
Analyzing Randomized Controlled Trials with Missing Response
This work evaluates the efficacy of Diclectin for nausea and vomiting during pregnancy (NVP) using a double-blind randomized controlled trial, finding that statistical inferences about its effectiveness depend on the choice of missing data methods and models, with results suggesting its benefit is not clinically significant under the pre-specified minimal clinically important difference.
Analyzing Randomized Controlled Trials with Missing Response