Research

My research lies at the intersection of statistics and machine learning. I develop rigorous theoretical methods and scalable computational tools to tackle complex, noisy datasets arising from public health challenges. For more details, including related papers and presentations, click on each project below. You can also explore my papers by year here and my presentations by year here.

Modeling Decision-Making Dynamics and Brain Activities

This work develops advanced statistical modeling techniques to build efficient and scalable computational frameworks for investigating decision-making in behavioral tasks, integrating data across multiple tasks, uncovering latent dynamics, and elucidating brain–behavior relationships. We aim to identify and quantify distinctive decision-making patterns associated with mental health conditions and diverse cognitive strategies, localize brain regions activated by specific tasks, and determine the causal effects of brain activity on behavior.

Advancing Individualized Treatment Rules from Multiple Studies

This work introduces statistical methodologies for integrating data from multiple randomized controlled trials, which may share a common treatment arm but differ in their alternative treatment options, to learn more robust and reliable individualized treatment rules. In addition, this work develops methods to integrate electronic health records from multiple hospitals while respecting data privacy and institutional sharing constraints, enabling more comprehensive and generalizable analyses across heterogeneous healthcare systems.

Boosting Learning with Incomplete Data

This work extends boosting methods by developing multiple strategies to handle missing and interval-censored response variables. These strategies are implemented using functional gradient descent, and we establish rigorous theoretical guarantees. Numerical studies demonstrate that the proposed methods perform well in finite-sample settings.

Hypothesis Testing with Genetic Variants in Longitudinal and Time-to-Event Data

This work investigates joint models for genetic association involving longitudinal biomarkers and time-to-event outcomes. We develop and validate a closed-form sample size formula for assessing the overall effects of single-nucleotide polymorphisms. To improve robustness against model misspecification due to nonlinear trajectories in the longitudinal traits, we incorporate spline functions to capture subject-specific nonlinear evolutions.

Inference and Variable Selection with Missing Data

This work addresses key challenges in missing data analysis by proposing a unified modeling framework based on generalized additive models. The framework flexibly accommodates a wide range of missing data mechanisms without relying on strong parametric assumptions. To enable simultaneous estimation and variable selection, it incorporates a regularized likelihood approach. The proposed method is supported by rigorous theoretical guarantees and demonstrates strong empirical performance across a variety of settings.

Examining COVID-19 Incubation Times

This work explores different models to analyze incubation times of COVID19 from different angles, finding that the current recommended 14 day quarantine time is not long enough to control the probability of an early release of infected individuals to be small.

Machine Learning Applications

This work applies machine learning and statistical modeling to real-world prediction and inference problems across diverse domains, including social media influence, infrastructure reliability, historical survival analysis, pharmacological evaluation, and healthcare diagnostics. The studies identify key predictive and causal factors, incorporate uncertainty quantification, cost-aware decision-making, and robust missing-data handling, and systematically compare model performance across a range of methodological approaches. Results demonstrate how data-driven methods can uncover important drivers of outcomes, such as language accessibility in TED talk popularity, meteorological conditions in pipeline failures, treatment effects in nausea and vomiting during pregnancy, demographic and socioeconomic factors in Titanic survival, and multimodal clinical indicators in Parkinson’s disease diagnosis. Collectively, the work highlights the importance of interpretability, reliability, sensitivity to modeling choices, and practical efficiency in deploying statistical and machine learning methods in real-world settings.