Heart Attack Prediction Report

Team identification

Seminar day and time: Thursday 12:45 Team number: B Names of team members: Matěj Krones, Matěj Krček, Michael Ay, Michal Klukas, Martin Košťálek

Introduction

In today's rapidly evolving healthcare landscape, the integration of machine learning is playing an increasingly important role in revolutionising patient care and medical decision-making. As a critical component of our team project, we developed a machine learning model to address a critical problem in the healthcare industry - the early assessment of cardiovascular disease risk. Cardiovascular disease (CVD) remains a leading cause of morbidity and mortality worldwide, making it a significant public health concern. "Cardiovascular diseases (CVDs) are the leading cause of death globally, taking an estimated 17.9 million lives each year." states WHO. It is a broad term that covers a range of conditions that affect the heart and blood vessels, including coronary heart disease, stroke and heart failure. CVD has many risk factors, including age, gender, family history, lifestyle and underlying medical conditions. Early diagnosis and effective treatment are essential to reduce CVD's impact on individual patients and healthcare systems. In this context, our project aims to create a tailored solution for healthcare providers, particularly doctors, to harness the power of machine learning. Our ML model will be trained on a comprehensive dataset of patient information, including age, gender, medical history, and clinical measurements. The model will analyse these data points to identify patterns and associations that predict CVD risk. This will enable us to identify individuals at an elevated risk of developing CVD, even if they do not exhibit any apparent symptoms. The model may provide value for a business by reducing costs of unnecessary or additional testing by predicting whether a patient is at a risk of cardiovascular disease or not. Ideal example A doctor may use the ML model to assess the risk of CVD for a 55-year-old male patient with a history of high blood pressure and high cholesterol. The model may then generate a binary value of 0 or 1 to tell if the patient is at risk of contracting a cardiovascular disease. Chosen customization Target attribute: HeartDisease Instance of interest: Individual patient Attribute of interest: Patient data (age, gender, medical history, clinical measurements…) Subset of interest: High-risk patients Cost matrix: The cost matrix should be designed to minimize the number of false negatives and false positives. False negatives can lead to missed diagnoses and delayed treatment, which can have serious consequences for the patient's health. While false positives can lead to unnecessary investigations and treatments. Target variable: Binary. [1: Heart disease, 0: Normal] Dataset: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction

Dataset Explanation

ChestPainType: [TA, ATA, NAP, ASY] Impact on Heart Disease: The type of chest pain can be indicative of the nature of discomfort the patient is experiencing. For instance, typical angina might be more associated with heart-related issues. Understanding the type of chest pain can contribute to diagnosing the underlying cause of symptoms. TA: Typical Angina - Predictable chest pain related to exertion or stress, relieved by rest or medication. -More directly associated with CVD ATA: Atypical Angina - Chest discomfort not fitting the typical pattern, may include non-specific symptoms. NAP: Non-Anginal Pain - Chest discomfort unrelated to reduced blood flow to the heart, stemming from various causes. ASY: Asymptomatic - Absence of noticeable chest pain or discomfort, though other heart-related indicators may be present. FastingBS: [1: if FastingBS > 120 mg/dl, 0: otherwise] Impact on Heart Disease: Elevated fasting blood sugar may indicate diabetes or prediabetes. Diabetes is a risk factor for heart disease, and managing blood sugar levels is crucial for cardiovascular health. The 120 mg/dL threshold for Fasting Blood Sugar (FastingBS) is a diagnostic marker for diabetes and prediabetes. Elevated levels signal an increased risk of cardiovascular disease. Managing blood sugar below this threshold is crucial for maintaining heart health. RestingECG: [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria] ExerciseAngina: [Y: Yes, N: No] Impact on Heart Disease: Angina during exercise suggests that the heart is not receiving enough blood flow. This can be a symptom of coronary artery disease, a common cause of heart disease. Oldpeak: [Numeric value measured in depression] Impact on Heart Disease: Oldpeak is a measure of the extent of ST depression induced by exercise relative to rest. Significant ST depression can indicate myocardial ischemia, suggesting a compromised blood supply to the heart. ST stands for "segment" on an electrocardiogram (ECG). Specifically, it refers to the ST segment, a portion of the ECG waveform that represents the time between ventricular depolarization and repolarization. The ST segment is important in assessing cardiac health, and abnormalities in this segment can indicate various heart conditions. ST_Slope: [Up: upsloping, Flat: flat, Down: downsloping] Impact on Heart Disease: The slope can provide additional information about the response of the heart to exercise. An abnormal slope may indicate ischemia or other heart-related issues. ST_Slope refers to the slope of the peak exercise ST segment on an electrocardiogram (ECG). Here's a brief explanation: Upsloping: The ST segment slopes upwards during peak exercise. This is a normal response and is generally considered less concerning. Downsloping: The ST segment slopes downwards during peak exercise. This can be more concerning and may suggest myocardial ischemia or other heart-related issues.

#importing libraries import pandas as pd import numpy as np import plotly.express as px from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import confusion_matrix from sklearn.metrics import roc_curve from sklearn.metrics import RocCurveDisplay from sklearn.metrics import ConfusionMatrixDisplay from sklearn.metrics import roc_auc_score from sklearn.metrics import accuracy_score from sklearn.dummy import DummyClassifier from sklearn.model_selection import GridSearchCV from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score from sklearn.tree import plot_tree import matplotlib.pyplot as plt import matplotlib import random import seaborn as sns from sklearn.preprocessing import MinMaxScaler from yellowbrick.cluster import KElbowVisualizer from sklearn.preprocessing import normalize import scipy.cluster.hierarchy as shc from sklearn.cluster import AgglomerativeClustering from sklearn.impute import KNNImputer from sklearn.preprocessing import OneHotEncoder from sklearn.ensemble import RandomForestClassifier from sklearn.cluster import KMeans from sklearn.metrics import classification_report from sklearn import metrics from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report from sklearn.metrics import precision_score from sklearn.metrics import recall_score from sklearn.model_selection import KFold from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import OrdinalEncoder from sklearn.feature_selection import VarianceThreshold plt.rcParams['font.family'] = 'DejaVu Sans'

Run to view results

Data exploration

#loading dataset df = pd.read_csv("heart.csv") df

Run to view results

# Plot histograms and boxplots num_df = df.loc[:, df.nunique() > 4] nrows, ncols = len(num_df.columns), 2 fig, ax = plt.subplots(nrows, ncols, figsize=(8, 8)) for idx, col in enumerate(num_df.columns): plt.subplot(nrows, ncols, 2*idx+1) sns.histplot(data=num_df, x=col, bins=25, kde=True) plt.title(f'Histogram of {col}') plt.subplot(nrows, ncols, 2*idx+2) sns.boxplot(data=num_df, x=col, orient='h') plt.title(f'Boxplot of {col}') fig.tight_layout() plt.show()

Run to view results

# Plot histogram of cholesterol with hue based on heartdisease sns.histplot(data=df, x = "Cholesterol", hue = "HeartDisease") plt.title('Histogram of Cholesterol with hue') plt.show()

Run to view results

# Plot Countplot of people by sex and heartdisease df_sex_heart = df.groupby(['Sex','HeartDisease'])['Sex'].count().reset_index(name='Total') ax = sns.barplot(data=df_sex_heart, x='Sex', y='Total', hue='HeartDisease') # Add count annotations to each bar for p in ax.patches: height = p.get_height() ax.text(p.get_x() + p.get_width() / 2., height + 0.1, f'{height:.0f}', ha="center") plt.title('Total People by Sex and HeartDisease')

Run to view results

# Plot countplots of not numeric variables (except binary fastingBS) cat_df = df.loc[:, df.nunique() <= 4] fig, ax = plt.subplots(len(cat_df.columns), 1, figsize=(5, 20)) for i in range(len(cat_df.columns)): sns.countplot(data=cat_df, x=cat_df.columns[i], ax=ax[i]) ax[i].set_title(f'Countplot of {cat_df.columns[i]}') # Add count annotations to each bar for p in ax[i].patches: height = p.get_height() ax[i].text(p.get_x() + p.get_width() / 2., height + 0.1, f'{height:.0f}', ha="center") plt.tight_layout()

Run to view results

# Plot violinplot of heartdisease depending on age sns.violinplot(data=df, y= "Age", x="HeartDisease") plt.title(f'Violinplot of Age and HeartDisease') plt.show()

Run to view results

# Plot scatterplots selected_columns = ['Age', 'Cholesterol', 'RestingBP'] target_variable = 'HeartDisease' for column in selected_columns: plt.figure(figsize=(8, 6)) plt.scatter(df[column], df[target_variable], alpha=0.05) plt.title(f'Scatterplot of {column} vs {target_variable}') plt.xlabel(column) plt.ylabel(target_variable) plt.grid(True) plt.show()

Run to view results

# Plot scatterplots of age and other variables using color coding based on heartdisease def plot_age_with_x(x): fig = px.scatter(df,x=x,y='Age',color='HeartDisease') fig.update_layout(title=f'Scatterplot of Age vs {x} with Color Coding') fig.show() plot_age_with_x('Cholesterol') plot_age_with_x('RestingBP') plot_age_with_x('MaxHR') plot_age_with_x('Oldpeak')

Run to view results

# Plot heatmap of numerical variables plt.figure(figsize=(15,8)) sns.heatmap(df.select_dtypes('number').corr(),annot=True,cmap = 'Greens') plt.title('Heatmap showing correlation of numerical columns');

Run to view results

Result interpretation 1. Histograms and Boxplots: These plots provide a basic overview of the distribution of numerical variables but don't reveal detailed insights. 2. Histogram with Hue: Higher cholesterol appears to be associated with a higher frequency of heart diseases, as suggested by the color-coded histogram. 3.Violin Plot: Older individuals seem to have a higher likelihood of heart disease, as indicated by the violin plot. 4.Scatterplots: Scatterplots show a slight correlation between age, cholesterol, and restingBP with a higher chance of heart disease. 5.Scatterplots with Color Coding: Age: Higher age is associated with an increased likelihood of heart disease. RestingBP: Extremely high resting BP often indicates a higher chance of heart disease. MaxHR: Higher MaxHR seems to be associated with a lower likelihood of heart disease. Oldpeak: Higher oldpeak values are connected to an increased chance of having heart disease.

Preprocessing for supervised machine learning

Run to view results

Our target attribute is already binary and as such does not need to be converted to binary. Positive value means that the input has or has had Heart disease. False means that the input does not have or did not have heart disease.

#Splitting data into data types numerical_variables = ["Age", "RestingBP", "Cholesterol", "FastingBS", "MaxHR", "Oldpeak"] nominal_variables = ["Sex", "ExerciseAngina"] ordinal_variables = ["ST_Slope", "RestingECG", "ChestPainType"] target_variable = ["HeartDisease"]

Run to view results

Split data

Here we split data into training and test sets.

X, y = df.drop(target_variable, axis = 1).copy(), df[target_variable].copy() X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

Run to view results

Missing values

We need to check if we have any missing values inside our dataset.

df.isna().sum()

Run to view results

We have no missing values in this dataset.

Zero values

We need to check for values that do not make sense like cholesterol - 0 and RestingBP - 0.

#KNN imputation for 0 value cholesterol X_train['Cholesterol'].replace(0, np.nan, inplace=True) knn_imputer = KNNImputer(missing_values = np.nan, n_neighbors = 5) X_train[['Cholesterol']] = knn_imputer.fit_transform(X_train[['Cholesterol']]) X_train[['Cholesterol']]

Run to view results

#Changing 0 value RestingBP to mean of the RestingBP column RestingBP_train = X_train['RestingBP'] RestingBP_train.replace(0, RestingBP_train.median(), inplace = True) X_train[['RestingBP']]

Run to view results

Numerical variables

Tree based algorithms do not need to be rescaled and as such we did not transform or modify our numerical variables.

Ordinal variables

X_train

Run to view results

Here we encode our ordinal values into numbers for them to work in algorithms.

First we encode our ordinal variables with Ordinal Encoder.

ordinal_enc = OrdinalEncoder(categories = [["Up", "Flat", "Down"], ["Normal", "ST", "LVH"], ["TA", "ATA", "NAP", "ASY"]]) display(X[ordinal_variables][5:10]) X_train[ordinal_variables] = ordinal_enc.fit_transform(X_train[ordinal_variables]) X_train[ordinal_variables][5:10]

Run to view results

Nominal variables

X_train

Run to view results

Here we encode nominal values of our dataset to numbers for them to work in algorithms from sci-kit learn.

encoder = OneHotEncoder(drop='if_binary',sparse_output=False) encoder.set_output(transform="pandas") one_hot_encoded_nominal =encoder.fit_transform(X_train[nominal_variables]) one_hot_encoded_nominal X_train = pd.concat([X_train,one_hot_encoded_nominal],axis=1) X_train.drop(columns=nominal_variables,inplace=True) X_train

Run to view results

Feature selection

Here we figure out which features have low variance and then drop them from our dataset.

input_vars = numerical_variables + ordinal_variables input_vars

Run to view results

Heat map can help us visualize the features.

heatmap_data = X_train.copy() heatmap_data['HeartDisease'] = y_train plt.figure(figsize=(15,8)) sns.heatmap(heatmap_data.select_dtypes('number').corr(),annot=True,cmap = 'Greens') plt.title('Correlation of "number" columns');

Run to view results

Here we figured out that RestingBP could be dropped out, but after testing this, we found out that it would worsen our models. So we decided to let it be as it was.

threshold = 0.02 selector = VarianceThreshold(threshold) selector.fit(X_train[input_vars]) selector.get_feature_names_out() #X_train = X_train.drop(['RestingBP'], axis = 1)

Run to view results

X_train

Run to view results

Apply same steps on test data

Here we apply all the preprocessing steps that we applied to training set, to test set.

X_test

Run to view results

X_test['Cholesterol'].replace(0, np.nan, inplace=True) #We need to use knn_imputer from train dataset and use it here. X_test[['Cholesterol']] = knn_imputer.transform(X_test[['Cholesterol']]) #We need to use median of training RestingBP. RestingBP_test = X_test['RestingBP'] RestingBP_test.replace(0, RestingBP_train.median(), inplace = True) one_hot_encoded_nominal = encoder.transform(X_test[nominal_variables]) X_test[ordinal_variables]= ordinal_enc.transform(X_test[ordinal_variables]) X_test = pd.concat([X_test,one_hot_encoded_nominal],axis=1) X_test.drop(columns=nominal_variables,inplace=True)

Run to view results

X_test

Run to view results

Preprocessing for unsupervised model (clustering)

target_var = ['HeartDisease'] X_num_values = df.drop(nominal_variables + ordinal_variables, axis=1) #Getting a random subset from dataset subset_size = 200 # Choose the number of rows for the subset X_subset_tv = X_num_values.sample(n=subset_size, random_state=42) X_subset = X_subset_tv.drop(target_var, axis=1) #Rescaling scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X_subset) #Creating an array that has no values from sample array so as to choose a random instance (KMeans) X_diff = pd.merge(X_num_values, X_subset_tv, how='left', indicator=True).query('_merge == "left_only"').drop('_merge', axis=1) X_diff = X_diff.drop(["HeartDisease"],axis=1) rand_inst = X_diff.sample(n=1, random_state=42) #Rescaling the random instance for KMeans rand_inst_scaled = scaler.fit_transform(rand_inst) #Creating two instances for Hierarchical/Agglomerative rand_instaggl = X_diff.sample(n=2, random_state= 50) #Normalizing the random instances for Hierarchical/Agglomerative, both instances have HD value of 1 rand_instnorm = normalize(rand_instaggl) rand_inst_agglo = pd.DataFrame(rand_instnorm,columns=X_diff.columns)

Run to view results

scaler

Run to view results

X_scaled

Run to view results

Modeling

Supervised model that predicts the target attribute

Decision Tree

Here we are searching for best hyperparameters for our decision tree model.

decision_tree = DecisionTreeClassifier(random_state=42) #paramaters to try param_dist = { "criterion":["gini", "entropy"], "max_depth":[1,2,3,4,5,6,7,None] } #cross validation based on GridSearchCV grid = GridSearchCV(decision_tree, param_grid=param_dist, cv=10, scoring='accuracy') grid.fit(X_train, y_train) decision_tree = grid.best_estimator_ decision_tree

Run to view results

Quick evaluation of results

accuracy_score(decision_tree.predict(X_test), y_test)

Run to view results

Plotting graph of our tree

plt.figure(figsize=(18,18)) class_names = [str(label) for label in decision_tree.classes_] plot_tree(decision_tree, class_names=[str(label) for label in decision_tree.classes_], filled=True, feature_names=X_train.columns) plt.show()

Run to view results

Random Forest

Finding best hyperparameters of random forest

from sklearn.model_selection import RandomizedSearchCV from sklearn.ensemble import RandomForestClassifier import numpy as np #setting random seed for replicability seed = np.random.seed(47) # Define the parameter grid with ranges or values for each hyperparameter param_dist = { 'n_estimators': [int(x) for x in np.linspace(start=50, stop=200, num=10)], # Number of trees 'max_depth': [None] + [int(x) for x in np.linspace(10, 110, num=11)], # Maximum depth of trees 'min_samples_split': [2, 5, 10], # Minimum samples required to split an internal node 'min_samples_leaf': [1, 2, 4], # Minimum samples required to be at a leaf node 'max_features': [2,3,4], # Number of features to consider when looking for the best split #'bootstrap': [True, False], # Method for sampling data points (with or without replacement) 'criterion': ['gini', 'entropy'], # Function to measure the quality of a split 'class_weight': [None, 'balanced', 'balanced_subsample'], # Weights associated with classes 'max_samples': [None, 0.5, 0.7, 0.9] # Maximum number or proportion of samples to consider for bootstrapping } # Create a Random Forest classifier random_forest = RandomForestClassifier() # Use RandomizedSearchCV random_search = RandomizedSearchCV( random_forest, param_distributions=param_dist, n_iter=10, cv=10, scoring='accuracy' ) # Fit the RandomizedSearchCV to the data random_search.fit(X_train, y_train.values.ravel())

Run to view results

Quick evaluation of results

random_forest = random_search.best_estimator_ random_forest

Run to view results

accuracy_score(y_test, random_forest.predict(X_test))

Run to view results

accuracy_score(y_train, random_forest.predict(X_train))

Run to view results

print(classification_report(y_test, random_forest.predict(X_test)))

Run to view results

Clustering model for chosen subset of data - KMeans

#Number of clusters = 2 as we have 2 states of the target variable kmeans = KMeans(n_clusters=2,random_state=42,n_init=10) kmeans.fit(X_scaled) X_subset['Cluster'] = kmeans.labels_ #Cluster visualization kmeans labels plt.scatter(X_subset['MaxHR'], X_subset['Age'], c=X_subset['Cluster'], cmap='viridis') plt.xlabel('MaxHR') plt.ylabel('Age') plt.title('K-means Clustering of HeartDisease data Cluster visualization') plt.show() #Cluster visualization target variable plt.scatter(X_subset['MaxHR'], X_subset['Age'], c=X_subset_tv['HeartDisease'], cmap='viridis') plt.xlabel('MaxHR') plt.ylabel('Age') plt.title('K-means Clustering of HeartDisease data target variable visualization') plt.show()

Run to view results

Clustering model for chosen subset of data - hierarchical

#Normalizing data X_subset_hier = X_subset.drop(["Cluster"], axis=1) X_subset_norm = normalize(X_subset_hier) X_norm = pd.DataFrame(X_subset_norm,columns=X_subset_hier.columns) #Plotting a dendrogram plt.figure(figsize=(10, 7)) plt.title("Dendrogram") dend = shc.dendrogram(shc.linkage(X_norm, method='ward')) #Threshold = 2, n_clusters = 2 (see dendrogram) cluster = AgglomerativeClustering(n_clusters=2, metric='euclidean', linkage='ward') labels = cluster.fit_predict(X_norm) X_norm["Cluster"] = labels cdict = {0: 'yellow', 1: 'purple'} label_color = [cdict[l] for l in X_norm["Cluster"]] plt.figure(figsize=(10, 7)) plt.scatter(X_norm['Age'], X_norm['MaxHR'], c=label_color) plt.ylabel('Age') plt.xlabel('MaxHR') plt.title("Hierarchical clustering cluster labels visualization") plt.show() #Visualizing target variable cdict = {0: 'yellow', 1: 'purple'} label_color_tv = [cdict[l] for l in X_subset_tv['HeartDisease']] plt.figure(figsize=(10, 7)) plt.scatter(X_norm['Age'], X_norm['MaxHR'], c=label_color_tv) plt.ylabel('Age') plt.xlabel('MaxHR') plt.title("Hierarchical clustering target variable visualization") plt.show()

Run to view results

Evaluation

Supervised

Decision Tree

Evaluation on training data

Here we control if our model isn't overfitted.

y_pred_tree_train = decision_tree.predict(X_train)

Run to view results

accuracy_score(y_train, y_pred_tree_train)

Run to view results

Evaluation on test data

Here we evaluate our decision tree model with metrics such as accuracy, recall, precision, f1-score, support and AUC score.

y_pred_tree = decision_tree.predict(X_test)

Run to view results

accuracy_score(y_test, y_pred_tree)

Run to view results

print(classification_report(y_test, y_pred_tree))

Run to view results

cm_tree = confusion_matrix(y_test, y_pred_tree)

Run to view results

tn_tree, fp_tree, fn_tree, tp_tree = cm_tree.ravel() print("True Negatives:"+str(tn_tree), "False Positives:"+ str(fp_tree),"False Negatives:" + str(fn_tree), "True Positives:" + str(tp_tree))

Run to view results

cm_tree_display = ConfusionMatrixDisplay(cm_tree).plot()

Run to view results

decision_tree.predict_proba(X_test)[0:5]

Run to view results

y_pred_decision_tree_prob = decision_tree.predict_proba(X_test)[:,1] y_pred_decision_tree_prob

Run to view results

fpr, tpr, _thresholds = roc_curve(y_test, y_pred_decision_tree_prob, pos_label=decision_tree.classes_[1]) roc_display = RocCurveDisplay(fpr=fpr,tpr=tpr,estimator_name="decision_tree").plot() plt.plot([0, 1], [0, 1], 'k--', label='dummy_model (AUC = 0.50)') plt.title("ROC curve") plt.legend(loc = 'lower right')

Run to view results

roc_auc_score(y_test, y_pred_decision_tree_prob)

Run to view results

Random Forest

Evaluation on training data

Here we look for signs of overfitting or underfitting.

y_pred_forest_train = random_forest.predict(X_train)

Run to view results

accuracy_score(y_train, y_pred_forest_train)

Run to view results

Evaluation on test data

Here we evaluate our random forest model with metrics such as accuracy, recall, precision, f1-score, support and AUC score.

y_pred_forest = random_forest.predict(X_test)

Run to view results

print(classification_report(y_test, y_pred_forest))

Run to view results

accuracy_score(y_test, y_pred_forest)

Run to view results

cm_forest = confusion_matrix(y_test, y_pred_forest)

Run to view results

tn_forest, fp_forest, fn_forest, tp_forest = cm_forest.ravel() print("True Negatives:"+str(tn_forest), "False Positives:"+ str(fp_forest),"False Negatives:" + str(fn_forest), "True Positives:" + str(tp_forest))

Run to view results

cm_forest_display = ConfusionMatrixDisplay(cm_forest).plot()

Run to view results

random_forest.predict_proba(X_test)[0:5]

Run to view results

y_pred_forest_prob = random_forest.predict_proba(X_test)[:,1] y_pred_forest_prob

Run to view results

fpr, tpr, _thresholds = roc_curve(y_test, y_pred_forest_prob, pos_label=random_forest.classes_[1]) roc_display = RocCurveDisplay(fpr=fpr,tpr=tpr,estimator_name="random_forest").plot() plt.plot([0, 1], [0, 1], 'k--', label='dummy_model (AUC = 0.50)') plt.title("ROC curve") plt.legend(loc = 'lower right')

Run to view results

roc_auc_score(y_test, y_pred_forest_prob)

Run to view results

a. Which metric is most suitable for use for the current problem (accuracy, F-measure)?

The most important metric for us is recall for positive class, where we do not want people to go undiagnosed with heart disease. Then it becomes costlier or even deadly. But other metrics such as F-measure, precision or AUC score are still really important and should not be overlooked.

b. Compare the performance metrics for all types of models (e.g,. decision tree and forest). Which model is the best one?

They are not that far away from each other and both could be used for different cases. But overall random forest model is the better one because it got better scores on almost everything and where it was worse it was only slightly worse than decision tree.

c. Combine (multiply) the predefined costs matrix with the values in the confusion matrix for each model. Which model is the best one?

Now we add costs to TN, FP, FN and TP to see which model would cost less

costmatrix=[[0,10],[100,0]] #TN,FP,FN,TP tree_confusion_matrix = [[tn_tree, fp_tree], [fn_tree, tp_tree]] forest_confusion_matrix = [[tn_forest, fp_forest],[fn_forest, tp_forest]]

Run to view results

cost_of_tree = np.multiply(tree_confusion_matrix, costmatrix).sum() cost_of_tree

Run to view results

cost_of_forest = np.multiply(forest_confusion_matrix, costmatrix).sum() cost_of_forest

Run to view results

When it comes to cost, random forest model is slightly better.

Unsupervised

KMeans

# a. Justify the chosen k using the elbow curve graph (WCSS value comparison). #Elbow method to get optimal amount of clusters, elbow at k=3 kmeans_tune = KMeans(random_state=42,n_init=2,max_iter=10) visualizer = KElbowVisualizer(kmeans_tune,k=(1,12),metric="distortion") visualizer.fit(X_scaled) visualizer.show() # b. Evaluate the quality of clustering metrics.rand_score(X_subset['Cluster'],X_subset_tv['HeartDisease'])

Run to view results

Hierarchical

# a. Justify the chosen k using the elbow curve graph (WCSS value comparison). agglo_tune = AgglomerativeClustering(n_clusters=2, metric='euclidean', linkage='ward') visualizer = KElbowVisualizer(agglo_tune, k=(1,12),metric="distortion") visualizer.fit(X_norm) visualizer.show() # b. Evaluate the quality of clustering X_subset_tv_norm = normalize(X_subset_tv) X_norm_tv = pd.DataFrame(X_subset_tv_norm,columns=X_subset_tv.columns) metrics.rand_score(X_norm['Cluster'],X_norm_tv['HeartDisease'])

Run to view results

Explanation

Supervised models

Global explanation

# 1) Visualize the decision tree plt.figure(figsize=(18,18)) class_names = [str(label) for label in decision_tree.classes_] plot_tree(decision_tree, class_names=[str(label) for label in decision_tree.classes_], filled=True, feature_names=X_train.columns) plt.show() # 2) Looking at the tree, list the most important attributes # ST_Slope, Oldpeak, ChestPainType # 3) Show the feature importance of variables in the forest feature_importances = random_forest.feature_importances_ feature_importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importances}) feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False) plt.figure(figsize=(12, 8)) plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'], color='skyblue') plt.xlabel('Importance Score') plt.title('Feature Importance in Random Forest') plt.show()

Run to view results

Local explanation

# 1) Show the instance of interest - a row in the dataframe instance_of_interest = X_test.sample(n=1, random_state=42) print(instance_of_interest) # 2) Use both models to classify the chosen instance X_instance = instance_of_interest.values.reshape(1, -1) feature_names_instance = instance_of_interest.columns.tolist() decision_tree_prediction_instance = decision_tree.predict(X_instance) random_forest_prediction_instance = random_forest.predict(X_instance) print("Decision Tree Prediction:", decision_tree_prediction_instance[0]) print("Random Forest Prediction:", random_forest_prediction_instance[0]) # 3) Do both models assign the same class? # Yes, the both models assign the same class: 1. # 4) What is the confidence (probability) of the prediction? dt_instance_prob = decision_tree.predict_proba(X_instance) rf_instance_prob = random_forest.predict_proba(X_instance) print("Decision Tree confidence: ",dt_instance_prob[0][1]) print("Random Forest confidence: ",rf_instance_prob[0][1])

Run to view results

Explain how the decision tree model reached its conclusion (which branches of the tree/decision nodes were activated).

The decision process of the model, as described by the attribute values, indicates a specific path within the decision tree: moving right based on ST_Slope equal to 1, further right based on ChestPainType equal to 3, then left based on FastingBS equal to 0, and finally left again based on OldPeak equal to 0.295455.

Unsupervised model (clustering - KMeans)

# a. Interpret the final clusters based on their centroids and the number of instances in each cluster. centroids = kmeans.cluster_centers_ n_instances = X_subset['Cluster'].value_counts() print("Cluster counts:") print(n_instances) print("\n") print("Cluster centroids:") print(pd.DataFrame(centroids, columns=X_subset.columns[:-1])) print("\n") # b. Use the model to classify the chosen instance into a cluster # i. Inspect the assigned cluster. Does the value of the target class in the data, for # instance, match with the mode (average) of the assigned cluster? KMeans_predict = kmeans.predict(rand_inst_scaled)[0] rand_inst["Prediction"] = KMeans_predict print("Randomly chosen instance with prediction of heart disease:") print(rand_inst) print("\n") same_cluster = X_subset[X_subset["Cluster"] == KMeans_predict] print("Assigned cluster:") print(same_cluster) print("\n") #Mode calculation, since our clusters are binary and the model correctly predicted the instance's cluster, it will #match the cluster's mode as it has a value of 0 cluster_mode = same_cluster['Cluster'].mode()[0] print("Assigned cluster's mode value:", cluster_mode) #The model correctly predicted the chosen instance's HD value of 0 which matches the mode of its assigned cluster as #the cluster the instance was assigned to only has HD values of 0

Run to view results

Unsupervised model (clustering - Hierarchical)

# a. Interpret the final clusters based on their centroids and the number of instances in each cluster. centroids = kmeans.cluster_centers_ n_instances = X_norm['Cluster'].value_counts() print("Cluster counts:") print(n_instances) print("\n") print("Cluster centroids:") print(pd.DataFrame(centroids, columns=X_norm.columns[:-1])) print("\n") # b. Use the model to classify the chosen instance into a cluster # i. Inspect the assigned cluster. Does the value of the target class in the data, for # instance, match with the mode (average) of the assigned cluster? # Agglo clustering is unable to predict from one instance only, so this task is impossible with a single instance. Agglo_predict = cluster.fit_predict(rand_inst_agglo)[0] same_cluster = X_norm[X_norm["Cluster"] == Agglo_predict] print("Randomly chosen instance with HeartDisease prediction:") rand_instaggl["Prediction"] = Agglo_predict print(rand_instaggl) print("\n") print("Assigned cluster:") print(same_cluster) print("\n") #Mode calculation, since our clusters are binary and the model correctly predicted the instance's cluster, it will #match the cluster's mode as it has a value of 1 cluster_mode = same_cluster['Cluster'].mode()[0] print("Assigned cluster's mode value:", cluster_mode) #Both instances were correctly predicted as having a HD value of 1 and they match the mode of their assigned cluster #as the cluster they were assigned to is binary and only has HD values of 1

Run to view results

Conclusion

1. Which machine learning result has the highest value and is most interesting

In terms of the cost matrix, the Random Forest (RF) achieves the highest value, indicating its effectiveness in minimizing misclassifications with respect to the specified costs. For Precision, the Random Forest (RF) demonstrates the superior ability to correctly identify positive instances among those predicted as positive. The Random Forest also leads in terms of Recall, signifying its proficiency in capturing a higher proportion of actual positive instances. Furthermore, when considering the F1 score, a measure that balances precision and recall, the Random Forest outperforms other models. Lastly, the Area Under the Curve (AUC) Score, which assesses the classifier's ability to distinguish between classes, attains the highest value for the Random Forest, emphasizing its overall discriminative power.

2. What setting provided the best result?

DecisionTreeClassifier(criterion='entropy', max_depth=4, random_state=42)

RandomForestClassifier(class_weight='balanced_subsample', max_depth=110, max_features=2, max_samples=0.5, min_samples_leaf=2, min_samples_split=5, n_estimators=183)

3. Which attributes are the most important?

Decision Tree: ST_Slope, Oldpeak, ChestPainType

Random Forest: ST_Slope, Oldpeak, ExerciseAngina