Import Libraries and Data
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Data cleaning/Preparation
Run to view results
Run to view results
Rename the columns for a better understanding
Run to view results
Exploring Data
Display the frequency distribution of the data
Run to view results
Run to view results
Display the distribution of the continuous data columns
Run to view results
Display the Correlation matrix of the columns
Run to view results
Run to view results
Prediction
Split the data with scaled features
Run to view results
Run to view results
Run to view results
Run to view results
Create an Evaluation Function and split the features into categories
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
After progressively testing various models such as Random Forest, Gradient Boosting Classifier, and Logistic Regression (with and without cross-validation), and conducting grid search to fine-tune the parameters for both Random Forest and Gradient Boosting Classifier, the chosen model was GBC . This approach was strategic and aimed at leveraging the strengths of each individual model to handle the complexities of predicting cardiovascular diseases. These conditions often present nonlinear relationships and interactions between features, necessitating robust models that can interpret such complexities effectively. The initial standalone models showed promising results, with Random Forest achieving a high training accuracy but a lower test accuracy, indicating overfitting. The Gradient Boosting Classifier demonstrated a more balanced performance, with closer training and test accuracies. Logistic Regression provided baseline comparison, showing the necessity for more finetuned methods to capture the nuanced patterns of cardiovascular risk factors. Subsequently, the fine-tuning of Gradient Boosting Classifier aimed to harness the individual predictive powers while mitigating overfitting and enhancing generalization to unseen data. The performance metrics chosen for evaluation, AUC and F1 Score, were critical in providing a comprehensive assessment of each model’s ability to accurately classify individuals in terms of their cardiovascular disease risk. These metrics were specifically selected to balance the importance of both precision and recall in medical predictions.
Benchmark for Success:
AUC: The goal is ≥ 0.75, reflecting the combined model's ability to accurately predict cardiovascular disease occurrence. F1 Score: A target of ≥ 0.70, indicating effective balance in classification performance from both constituent models. The fine-tuned model, is well-tailored for predicting cardiovascular diseases and adept at navigating the nonlinear relationships and complex interactions typical of medical data. The evaluation of these models using AUC and F1 Score metrics offers a thorough analysis of their classification accuracy concerning cardiovascular disease risk. These metrics are crucial, as they encapsulate both precision and recall, providing a balanced view of model performance in medical diagnostics.
The results from the Gradient Boosting Classifier underscore this suitability. For the combined feature set, the model achieved an AUC of 0.8026 and an F1 Score of 0.7222, indicative of its strong predictive capability and balanced precision-recall trade-off. Similarly, examination features alone resulted in an AUC of 0.7746 and an F1 Score of 0.7071, further affirming the model's effectiveness. In contrast, objective and subjective features yielded lower performance, with AUCs of 0.6716 and 0.5200 respectively, highlighting the increased predictive power when leveraging a comprehensive set of features. These benchmarks validate the model's efficacy, particularly when utilizing a holistic approach that integrates various data types, leading to superior prediction accuracy. Thus, the model not only excels in individual assessments but also demonstrates enhanced performance of the fine-tuned Gradient Boosting Classifier, promising, reliable and actionable insights for cardiovascular disease prediction and management.
Practical Significance
Clinical Impact: The model's practical significance will be evaluated by its ability to enhance early detection of cardiovascular diseases, thereby facilitating timely medical interventions. A significant reduction in late-stage diagnosis rates of CVD among the screened population would demonstrate the model's practical value. For example, if the model is integrated into routine health check-ups, its effectiveness can be measured by the increased rate of early-stage CVD detection and the corresponding improvement in patient management and treatment outcomes. Healthcare Cost Reduction: Another crucial aspect of practical significance is the model's impact on healthcare costs. By preventing advanced stages of cardiovascular diseases through early intervention, the model should lead to a noticeable decrease in the financial burden associated with CVD treatments, such as hospital admissions, surgeries, and long-term care. This cost reduction can be quantified by comparing the healthcare expenses incurred before and after implementing the predictive model in clinical practice.