Machine Learning Applications
Created with ChatGPT Images 2.0.This work presents five studies that apply machine learning and statistical techniques to understand and predict outcomes in diverse real-world contexts.
The first study investigates the factors influencing the popularity of TED talks. Using network analysis, principal component analysis (PCA), and LASSO, a predictive model is developed. Results show that the number of available languages is the most influential factor, contributing at least three times more than other variables, while publishing talks on weekends (especially Saturdays) significantly enhances audience reach.
The second study focuses on predicting passenger survival in the 1912 Titanic disaster using multiple machine learning methods, including logistic regression, ridge regression, LASSO, decision trees, random forests, conditional forests, support vector machines, and k-nearest neighbors. The analysis confirms that women and passengers with higher socioeconomic status had substantially higher survival probabilities. A simple rule-based model achieved 75.60% accuracy, while the conditional forest reached the highest performance at 81.34%.
The third study evaluates the effectiveness of Diclectin versus placebo for treating nausea and vomiting during pregnancy in a double-blind randomized controlled trial with substantial missing longitudinal data. Results showed that statistical inferences about treatment effect varied markedly depending on the missing-data method and modeling approach. A statistically significant greater improvement with Diclectin (under the PUQE score) was observed only when using LOCF, but not with CC or MEAN imputation. Moreover, the magnitude of the difference did not reach the pre-specified minimal clinically important difference of 3 points, indicating that any benefit is not clinically meaningful. These findings provide evidence supporting the inefficacy of Diclectin and offer valuable guidance for Health Canada in supporting pregnant women and newborns.
The fourth study examines the impact of geographical and meteorological conditions on pipeline failures. Neural networks and random forests are employed to identify key predictors. Meteorological factors dominate, with snow on ground, total snow, and total rainfall emerging as the most important variables. Snow on ground consistently ranks as the top predictor, while geographical characteristics show limited contribution.
The fifth study introduces a cost-aware conformal cascade triage model for the interpretable diagnosis of Parkinson’s disease, augmented with LLM-guided reasoning. This framework uses a staged cascade architecture that routes patients through progressively more expensive diagnostic stages. Early stages rely on low-cost features (e.g., demographics, basic motor tests, wearable summaries) with conformal prediction providing rigorous uncertainty guarantees. More complex and costly data (speech analysis, detailed kinematics, imaging, etc.) are used only when needed, optimizing diagnostic cost and efficiency. Conformal prediction ensures reliable prediction sets with statistical validity guarantees, while interpretability is maintained through transparent decision pathways and feature attributions. An LLM layer generates natural-language explanations, contextualizes predictions with clinical knowledge, and supports differential diagnosis reasoning. The model demonstrates strong performance in accuracy, uncertainty calibration, cost reduction, and clinical usability compared to traditional end-to-end approaches.
Overall, these five studies demonstrate how machine learning and statistical modeling can provide actionable insights and improve predictive performance across domains ranging from social media analytics and infrastructure management to historical analysis, clinical healthcare decision support, and evidence-based evaluation of pharmacological interventions. The progression of work shows an increasing emphasis on reliability, interpretability, uncertainty awareness (including missing data handling), and practical efficiency in real-world AI and statistical deployment.





