Heart Attack Prediction Report
Team identification
Seminar day and time: Thursday 12:45 Team number: B Names of team members: Matěj Krones, Matěj Krček, Michael Ay, Michal Klukas, Martin Košťálek
Introduction
Introduction
In today's rapidly evolving healthcare landscape, the integration of machine learning is playing an increasingly important role in revolutionising patient care and medical decision-making. As a critical component of our team project, we developed a machine learning model to address a critical problem in the healthcare industry - the early assessment of cardiovascular disease risk. Cardiovascular disease (CVD) remains a leading cause of morbidity and mortality worldwide, making it a significant public health concern. "Cardiovascular diseases (CVDs) are the leading cause of death globally, taking an estimated 17.9 million lives each year." states WHO. It is a broad term that covers a range of conditions that affect the heart and blood vessels, including coronary heart disease, stroke and heart failure. CVD has many risk factors, including age, gender, family history, lifestyle and underlying medical conditions. Early diagnosis and effective treatment are essential to reduce CVD's impact on individual patients and healthcare systems. In this context, our project aims to create a tailored solution for healthcare providers, particularly doctors, to harness the power of machine learning. Our ML model will be trained on a comprehensive dataset of patient information, including age, gender, medical history, and clinical measurements. The model will analyse these data points to identify patterns and associations that predict CVD risk. This will enable us to identify individuals at an elevated risk of developing CVD, even if they do not exhibit any apparent symptoms. The model may provide value for a business by reducing costs of unnecessary or additional testing by predicting whether a patient is at a risk of cardiovascular disease or not. Ideal example A doctor may use the ML model to assess the risk of CVD for a 55-year-old male patient with a history of high blood pressure and high cholesterol. The model may then generate a binary value of 0 or 1 to tell if the patient is at risk of contracting a cardiovascular disease. Chosen customization Target attribute: HeartDisease Instance of interest: Individual patient Attribute of interest: Patient data (age, gender, medical history, clinical measurements…) Subset of interest: High-risk patients Cost matrix: The cost matrix should be designed to minimize the number of false negatives and false positives. False negatives can lead to missed diagnoses and delayed treatment, which can have serious consequences for the patient's health. While false positives can lead to unnecessary investigations and treatments. Target variable: Binary. [1: Heart disease, 0: Normal] Dataset: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction
Dataset Explanation
ChestPainType: [TA, ATA, NAP, ASY] Impact on Heart Disease: The type of chest pain can be indicative of the nature of discomfort the patient is experiencing. For instance, typical angina might be more associated with heart-related issues. Understanding the type of chest pain can contribute to diagnosing the underlying cause of symptoms. TA: Typical Angina - Predictable chest pain related to exertion or stress, relieved by rest or medication. -More directly associated with CVD ATA: Atypical Angina - Chest discomfort not fitting the typical pattern, may include non-specific symptoms. NAP: Non-Anginal Pain - Chest discomfort unrelated to reduced blood flow to the heart, stemming from various causes. ASY: Asymptomatic - Absence of noticeable chest pain or discomfort, though other heart-related indicators may be present. FastingBS: [1: if FastingBS > 120 mg/dl, 0: otherwise] Impact on Heart Disease: Elevated fasting blood sugar may indicate diabetes or prediabetes. Diabetes is a risk factor for heart disease, and managing blood sugar levels is crucial for cardiovascular health. The 120 mg/dL threshold for Fasting Blood Sugar (FastingBS) is a diagnostic marker for diabetes and prediabetes. Elevated levels signal an increased risk of cardiovascular disease. Managing blood sugar below this threshold is crucial for maintaining heart health. RestingECG: [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria] ExerciseAngina: [Y: Yes, N: No] Impact on Heart Disease: Angina during exercise suggests that the heart is not receiving enough blood flow. This can be a symptom of coronary artery disease, a common cause of heart disease. Oldpeak: [Numeric value measured in depression] Impact on Heart Disease: Oldpeak is a measure of the extent of ST depression induced by exercise relative to rest. Significant ST depression can indicate myocardial ischemia, suggesting a compromised blood supply to the heart. ST stands for "segment" on an electrocardiogram (ECG). Specifically, it refers to the ST segment, a portion of the ECG waveform that represents the time between ventricular depolarization and repolarization. The ST segment is important in assessing cardiac health, and abnormalities in this segment can indicate various heart conditions. ST_Slope: [Up: upsloping, Flat: flat, Down: downsloping] Impact on Heart Disease: The slope can provide additional information about the response of the heart to exercise. An abnormal slope may indicate ischemia or other heart-related issues. ST_Slope refers to the slope of the peak exercise ST segment on an electrocardiogram (ECG). Here's a brief explanation: Upsloping: The ST segment slopes upwards during peak exercise. This is a normal response and is generally considered less concerning. Downsloping: The ST segment slopes downwards during peak exercise. This can be more concerning and may suggest myocardial ischemia or other heart-related issues.
Run to view results
Data exploration
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Result interpretation 1. Histograms and Boxplots: These plots provide a basic overview of the distribution of numerical variables but don't reveal detailed insights. 2. Histogram with Hue: Higher cholesterol appears to be associated with a higher frequency of heart diseases, as suggested by the color-coded histogram. 3.Violin Plot: Older individuals seem to have a higher likelihood of heart disease, as indicated by the violin plot. 4.Scatterplots: Scatterplots show a slight correlation between age, cholesterol, and restingBP with a higher chance of heart disease. 5.Scatterplots with Color Coding: Age: Higher age is associated with an increased likelihood of heart disease. RestingBP: Extremely high resting BP often indicates a higher chance of heart disease. MaxHR: Higher MaxHR seems to be associated with a lower likelihood of heart disease. Oldpeak: Higher oldpeak values are connected to an increased chance of having heart disease.
Preprocessing for supervised machine learning
Run to view results
Our target attribute is already binary and as such does not need to be converted to binary. Positive value means that the input has or has had Heart disease. False means that the input does not have or did not have heart disease.
Run to view results
Split data
Here we split data into training and test sets.
Run to view results
Missing values
We need to check if we have any missing values inside our dataset.
Run to view results
We have no missing values in this dataset.
Zero values
We need to check for values that do not make sense like cholesterol - 0 and RestingBP - 0.
Run to view results
Run to view results
Numerical variables
Tree based algorithms do not need to be rescaled and as such we did not transform or modify our numerical variables.
Ordinal variables
Run to view results
Here we encode our ordinal values into numbers for them to work in algorithms.
First we encode our ordinal variables with Ordinal Encoder.
Run to view results
Nominal variables
Run to view results
Here we encode nominal values of our dataset to numbers for them to work in algorithms from sci-kit learn.
Run to view results
Feature selection
Here we figure out which features have low variance and then drop them from our dataset.
Run to view results
Heat map can help us visualize the features.
Run to view results
Here we figured out that RestingBP could be dropped out, but after testing this, we found out that it would worsen our models. So we decided to let it be as it was.
Run to view results
Run to view results
Apply same steps on test data
Here we apply all the preprocessing steps that we applied to training set, to test set.
Run to view results
Run to view results
Run to view results
Preprocessing for unsupervised model (clustering)
Run to view results
Run to view results
Run to view results
Modeling
Supervised model that predicts the target attribute
Decision Tree
Here we are searching for best hyperparameters for our decision tree model.
Run to view results
Quick evaluation of results
Run to view results
Plotting graph of our tree
Run to view results
Random Forest
Finding best hyperparameters of random forest
Run to view results
Quick evaluation of results
Run to view results
Run to view results
Run to view results
Run to view results
Clustering model for chosen subset of data - KMeans
Run to view results
Clustering model for chosen subset of data - hierarchical
Run to view results
Evaluation
Supervised
Decision Tree
Evaluation on training data
Here we control if our model isn't overfitted.
Run to view results
Run to view results
Evaluation on test data
Here we evaluate our decision tree model with metrics such as accuracy, recall, precision, f1-score, support and AUC score.
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Random Forest
Evaluation on training data
Here we look for signs of overfitting or underfitting.
Run to view results
Run to view results
Evaluation on test data
Here we evaluate our random forest model with metrics such as accuracy, recall, precision, f1-score, support and AUC score.
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
a. Which metric is most suitable for use for the current problem (accuracy, F-measure)?
The most important metric for us is recall for positive class, where we do not want people to go undiagnosed with heart disease. Then it becomes costlier or even deadly. But other metrics such as F-measure, precision or AUC score are still really important and should not be overlooked.
b. Compare the performance metrics for all types of models (e.g,. decision tree and forest). Which model is the best one?
They are not that far away from each other and both could be used for different cases. But overall random forest model is the better one because it got better scores on almost everything and where it was worse it was only slightly worse than decision tree.
c. Combine (multiply) the predefined costs matrix with the values in the confusion matrix for each model. Which model is the best one?
Now we add costs to TN, FP, FN and TP to see which model would cost less
Run to view results
Run to view results
Run to view results
When it comes to cost, random forest model is slightly better.
Unsupervised
KMeans
Run to view results
Hierarchical
Run to view results
Explanation
Supervised models
Global explanation
Run to view results
Local explanation
Run to view results
Explain how the decision tree model reached its conclusion (which branches of the tree/decision nodes were activated).
The decision process of the model, as described by the attribute values, indicates a specific path within the decision tree: moving right based on ST_Slope equal to 1, further right based on ChestPainType equal to 3, then left based on FastingBS equal to 0, and finally left again based on OldPeak equal to 0.295455.
Unsupervised model (clustering - KMeans)
Run to view results
Unsupervised model (clustering - Hierarchical)
Run to view results
Conclusion
1. Which machine learning result has the highest value and is most interesting
In terms of the cost matrix, the Random Forest (RF) achieves the highest value, indicating its effectiveness in minimizing misclassifications with respect to the specified costs. For Precision, the Random Forest (RF) demonstrates the superior ability to correctly identify positive instances among those predicted as positive. The Random Forest also leads in terms of Recall, signifying its proficiency in capturing a higher proportion of actual positive instances. Furthermore, when considering the F1 score, a measure that balances precision and recall, the Random Forest outperforms other models. Lastly, the Area Under the Curve (AUC) Score, which assesses the classifier's ability to distinguish between classes, attains the highest value for the Random Forest, emphasizing its overall discriminative power.
2. What setting provided the best result?
DecisionTreeClassifier(criterion='entropy', max_depth=4, random_state=42)
RandomForestClassifier(class_weight='balanced_subsample', max_depth=110, max_features=2, max_samples=0.5, min_samples_leaf=2, min_samples_split=5, n_estimators=183)
3. Which attributes are the most important?
Decision Tree: ST_Slope, Oldpeak, ChestPainType
Random Forest: ST_Slope, Oldpeak, ExerciseAngina