Machine Learning Applications

Last updated on Jun 24, 2026

Created with ChatGPT Images 2.0.

This work presents seven studies that apply machine learning and statistical methods to address challenges in diverse real-world contexts.

The first study investigates factors influencing the popularity of TED talks. Using network analysis, principal component analysis (PCA), and LASSO regression, a predictive model is developed. Results show that the number of available languages is the most influential factor, contributing at least three times more than other variables, while publishing talks on weekends (especially Saturdays) is associated with higher audience reach.

The second study focuses on predicting passenger survival in the 1912 Titanic disaster using multiple machine learning methods, including logistic regression, ridge regression, LASSO, decision trees, random forests, conditional forests, support vector machines, and k-nearest neighbors. The analysis confirms that women and passengers with higher socioeconomic status had substantially higher survival probabilities. A simple rule-based model achieved 75.60% accuracy, while the conditional forest achieved the best performance at 81.34%.

The third study evaluates the effectiveness of Diclectin versus placebo for treating nausea and vomiting during pregnancy in a double-blind randomized controlled trial with substantial missing longitudinal data. Results show that statistical inferences about the treatment effect vary markedly depending on the missing-data method and modeling approach. A statistically significant greater improvement with Diclectin (based on the PUQE score) is observed only under last observation carried forward (LOCF), but not under complete-case or mean imputation analyses. Moreover, the estimated effect does not reach the pre-specified minimal clinically important difference of 3 points, suggesting limited clinical relevance. These findings highlight the sensitivity of conclusions to missing-data assumptions and underscore the importance of robust inference in clinical trials.

The fourth study examines the impact of geographical and meteorological conditions on pipeline failures. Neural networks and random forests are used to identify key predictors. Meteorological factors dominate, with snow on ground, total snow, and total rainfall emerging as the most important variables. Snow on ground consistently ranks as the top predictor, while geographical characteristics contribute relatively little.

The fifth study develops a data-driven framework to evaluate physician performance in intensive care units (ICUs). Using tree-based ensemble models (XGBoost, random forests, and tree boosting mixed models) together with explainable AI tools such as TreeSHAP, the study quantifies contributions of physician practice patterns and ICU departments to patient outcomes. Propensity weighting is used to adjust for patient heterogeneity, enabling more fair comparisons across physicians, while super learner–based approaches improve robustness under model misspecification.

The sixth study estimates the causal effects of myocardial infarction and ischemic stroke on self-rated health using propensity weighting and matching methods. Results indicate that both events significantly reduce the probability of reporting very good or excellent health.

The seventh study introduces a cost-aware conformal cascade triage model for interpretable diagnosis of Parkinson’s disease, augmented with large language model (LLM)-guided reasoning. The framework uses a staged cascade architecture that routes patients through progressively more expensive diagnostic stages. Early stages rely on low-cost features (e.g., demographics, basic motor tests, wearable summaries), with conformal prediction providing formal uncertainty guarantees. More complex and costly data (such as speech analysis, detailed kinematics, and imaging) are incorporated only when necessary, improving efficiency while controlling diagnostic cost. The model produces valid prediction sets under conformal inference, while maintaining interpretability through transparent decision pathways and feature attributions. An LLM component generates natural-language explanations, contextualizes predictions using clinical knowledge, and supports differential diagnostic reasoning. The approach demonstrates strong performance in accuracy, calibration, cost reduction, and interpretability compared with traditional end-to-end models.

Overall, these seven studies demonstrate how machine learning and statistical modeling can generate actionable insights across domains ranging from social media analytics and infrastructure reliability to clinical decision support and causal inference in health outcomes. Collectively, the work emphasizes robustness, reliability, interpretability, uncertainty quantification, and practical efficiency in real-world applications.

Case Study