Hybrid Ensemble Learning and Explainable AI for Accurate and Interpretable Cardiovascular Risk Prediction

A hybrid ensemble learning model combining Gradient Boosting, CatBoost, and Neural Networks predicts cardiovascular risk with 82% accuracy. Explainable AI tools like SHAP enhance transparency for clinical use.

Categorized in: AI News Science and Research
Published on: May 24, 2025
Hybrid Ensemble Learning and Explainable AI for Accurate and Interpretable Cardiovascular Risk Prediction

Predicting Cardiovascular Risk with Hybrid Ensemble Learning and Explainable AI

Abstract

Cardiovascular diseases (CVDs) remain a leading cause of death worldwide, highlighting the need for accurate early risk prediction to guide prevention and treatment. This study introduces a hybrid ensemble learning framework that combines advanced machine learning models with explainable AI techniques for predicting cardiovascular risk.

The framework integrates Gradient Boosting, CatBoost, and Neural Networks within a stacked ensemble architecture, improving prediction accuracy beyond individual models. Visualization tools like SHAP values, t-SNE, and PCA reveal complex relationships among key risk factors such as blood pressure, BMI, and cholesterol-glucose ratios, alongside lifestyle variables.

Explainable AI methods enable clinicians to understand each feature’s contribution to the predictions, supporting transparent decision-making. The hybrid model achieved an AUC-ROC score of 0.82, with balanced classification performance: Precision at 81%, Recall at 83%, and F1-Score at 82% on a test dataset. These results demonstrate the effectiveness of ensemble learning in medical prediction tasks and underline the importance of interpretable models for clinical trust.

Introduction

Cardiovascular diseases pose a significant public health challenge as the top cause of global mortality. Early prediction combined with timely intervention can reduce patient risk and improve outcomes. Traditional risk models often rely on linear assumptions and pre-defined variable relationships, limiting their ability to capture complex feature interactions.

Modern ensemble learning methods leverage diverse models to improve prediction accuracy but often lack interpretability, restricting their clinical adoption. This work proposes a hybrid ensemble framework that combines leading algorithms such as LightGBM, XGBoost, CatBoost, and neural networks, while incorporating explainability via SHAP (SHapley Additive exPlanations).

By capturing data multidimensionality through engineered features and clustering, the approach achieves stable and interpretable predictions. Visualizations using PCA, t-SNE, and SHAP enhance understanding of the underlying data patterns, bridging the gap between performance and clinical relevance.

Motivation

CVDs cause approximately 17.9 million deaths annually, accounting for 32% of global mortality and imposing substantial healthcare costs. Despite advances in research, early detection remains challenging due to the disease’s multifactorial nature and the complexity of clinical data.

Conventional models often fail to capture nonlinear and high-dimensional relationships inherent in medical datasets. Ensemble machine learning methods offer higher accuracy but tend to be “black boxes,” which limits their use in clinical settings where transparency is vital.

Explainable AI provides a pathway to make complex models interpretable and actionable for healthcare professionals. However, few existing studies integrate high-performing ensemble models with scalable XAI frameworks tailored for cardiovascular risk prediction. This gap motivates the development of a hybrid ensemble model combining best-in-class classifiers with explainability tools like SHAP, PCA, and t-SNE, aiming for both accuracy and clinical trust.

Literature Survey

Recent research has applied deep learning to electronic health records for early CVD prediction, focusing on integrating clinical and imaging features to forecast adverse events. Predictive models have ranged from decision trees and support vector machines to neural networks.

Stacked ensemble architectures have gained attention for enhancing predictive accuracy by combining multiple base models with a meta-model such as XGBoost. These methods improve generalizability but often lack interpretability, a critical factor in healthcare decision-making.

Explainable AI techniques like SHAP have made it possible to interpret complex models, allowing clinicians to understand how features such as BMI, blood pressure, and cholesterol impact predictions. Unlike studies centered on single models or limited explanation capacity, this work leverages stacked ensembles alongside XAI to balance predictive power with transparency.

Methodology

Data Collection and Preprocessing

The study uses a comprehensive dataset compiled from three publicly available sources: IEEE Dataport Cardiovascular Disease Dataset, Cleveland Heart Disease Dataset, and Hungarian Dataset. Together, they provide 70,000 instances and 12 clinical features including age, gender, blood pressure, BMI, cholesterol, and glucose.

To address class imbalance—more healthy subjects than cardiovascular cases—SMOTE (Synthetic Minority Over-sampling Technique) was employed alongside random undersampling. Missing numerical values were imputed using mean values, and categorical features were filled with the mode. Outliers were detected and removed using the interquartile range (IQR) method.

Continuous features were normalized via Min-Max scaling to a 0–1 range for consistent model input. The data was split into 80% training and 20% testing sets, with stratified sampling to maintain class distribution.

Model Selection and Hybrid Ensemble Architecture

This study adopts a hybrid ensemble framework to capture the multidimensional nature of cardiovascular risk factors. Base classifiers including Gradient Boosting, CatBoost, and Neural Networks form a stacked ensemble, with XGBoost acting as the meta-model to combine their predictions.

A 5-fold stratified cross-validation protocol was used to ensure balanced class representation during training and evaluation. This method enhances reliability when handling imbalanced data. Final testing was performed on a separate 20% holdout set to provide an unbiased performance estimate.

Stacking leverages the strengths of individual models while compensating for their weaknesses, resulting in improved predictive accuracy and generalizability. Explainable AI tools are integrated to maintain transparency of the decision process.

Results and Discussion

The hybrid ensemble model outperformed individual base classifiers across multiple performance metrics. Confusion matrices showed balanced classification between positive and negative cases. The ROC-AUC score reached 0.82, with Precision at 81%, Recall at 83%, and an F1-Score of 82% on unseen test data.

Explainability through SHAP values and dimensionality reduction techniques like PCA and t-SNE provided insight into how features interact and influence risk predictions. These visualizations not only validated model behavior but also offered clinically meaningful interpretations of risk factors.

Combining predictive strength with interpretability makes the model suitable for practical healthcare applications where understanding the rationale behind predictions is essential.

Conclusion

This work presents a hybrid ensemble learning model that integrates multiple base classifiers optimized for heterogeneous cardiovascular data. Using XGBoost as a meta-model, the approach enhances both accuracy and generalizability.

By embedding Explainable AI methods such as SHAP, the model maintains transparency, making it clinically applicable and interpretable. This balance between performance and explainability addresses critical needs in cardiovascular risk prediction and supports more informed healthcare decisions.

For those interested in expanding their skills in AI and machine learning applied to healthcare, exploring courses on Complete AI Training can provide valuable knowledge and practical tools.