From Data to Decisions: Transforming Cardiovascular Care through Predictive Analytics

Import Libraries and Data

# Import the necessary Libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, ConfusionMatrixDisplay,roc_auc_score,roc_curve from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier from sklearn.linear_model import LogisticRegression, SGDClassifier from sklearn.preprocessing import StandardScaler, MinMaxScaler from sklearn.inspection import permutation_importance %matplotlib inline

Run to view results

#Read the csv into pandas dataframe data = pd.read_csv('cardio_train.csv', sep=';')

Run to view results

#check the first contents of the data data.head()

Run to view results

#check the data types and completeness of the columns data data.info()

Run to view results

#Show the statistics of the data data.iloc[:,1:].describe()

Run to view results

Data cleaning/Preparation

# Converting age from days to years for better understanding and analysis data['age'] = data['age'] / 365.25

Run to view results

# Display age column data.age

Run to view results

Rename the columns for a better understanding

data = data.rename(columns={'ap_hi': 'systolic_b_pressure'}) data = data.rename(columns={'ap_lo': 'diastolic_b_pressure'}) data = data.rename(columns={'gluc': 'glucose'}) data = data.rename(columns={'alco': 'alcohol'}) data = data.rename(columns={'active': 'physically_active'}) data = data.rename(columns={'cardio': 'cardio_disease'})

Run to view results

Exploring Data

Display the frequency distribution of the data

data.iloc[:,1:].hist(figsize=(15,15));

Run to view results

# Display the columns of data data.columns

Run to view results

Display the distribution of the continuous data columns

cols = ['age', 'height', 'weight', 'systolic_b_pressure', 'diastolic_b_pressure', 'cholesterol'] ncols=3 nrows=2 fig, ax = plt.subplots(nrows, ncols, figsize=(12, 12)) cont = 0 for i in range(nrows): for j in range(ncols): sns.boxplot(data[cols[cont]], ax=ax[i][j], color='blue') ax[i][j].set_title(cols[cont]) cont = cont + 1

Run to view results

Display the Correlation matrix of the columns

# Create a correlation matrix corr_matrix = data[['age', 'gender', 'height', 'weight', 'systolic_b_pressure', 'diastolic_b_pressure', 'cholesterol', 'glucose', 'smoke', 'alcohol', 'physically_active', 'cardio_disease']].corr()

Run to view results

# Plotting the heatmap plt.figure(figsize=(8, 5)) sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', cbar=True) plt.title('Correlation Matrix of Dataset');

Run to view results

Prediction

Split the data with scaled features

# Splitting the dataset into features and target variable X = data.drop(columns=['id', 'cardio_disease']) # excluding 'id' as it is not a feature y = data['cardio_disease'] # Standardization scaler_std = StandardScaler() X_standardized = scaler_std.fit_transform(X) # Keep the column names for later reference feature_names = X.columns.tolist() # Convert standardized data back to DataFrame for interpretability X_standardized_df = pd.DataFrame(X_standardized, columns=feature_names) # Splitting the standardized data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X_standardized_df, y, test_size=0.2, random_state=0)

Run to view results

# load and train the model rf = RandomForestClassifier() rf.fit(X_train,y_train) # Display the train and test scores print('Train Accuracy: ', rf.score(X_train,y_train)) print('Test Accuracy: ', rf.score(X_test,y_test))

Run to view results

# load and train the model gbc = GradientBoostingClassifier() gbc.fit(X_train,y_train) # Display the train and test scores print('Train Accuracy: ',gbc.score(X_train,y_train)) print('Test Accuracy: ', gbc.score(X_test,y_test))

Run to view results

# load and train the model lr = LogisticRegression(max_iter=1000) lr.fit(X_train,y_train) # Display the train and test scores print('Train Accuracy: ',lr.score(X_train,y_train)) print('Test Accuracy: ', lr.score(X_test,y_test))

Run to view results

Create an Evaluation Function and split the features into categories

# Defining the feature categories objective_features = ['age', 'height', 'weight', 'gender'] examination_features = ['systolic_b_pressure', 'diastolic_b_pressure', 'cholesterol', 'glucose'] subjective_features = ['smoke', 'alcohol', 'physically_active']

Run to view results

# Function to evaluate a model def evaluate_model(features, model): """ Evaluate the performance of a machine learning model on a specified set of features. Parameters: features (list): A list of column names from the dataset to be used as features for the model. model (model object): The machine learning model to be evaluated, instantiated outside this function. The function trains the model on a subset of the dataset defined by the specified features and then evaluates its performance on a separate test set. Returns: dict: A dictionary containing the following key-value pairs representing the model's performance metrics: - 'accuracy': The accuracy of the model on the test set. - 'precision': The precision of the model on the test set. - 'recall': The recall of the model on the test set. - 'f1': The F1 score of the model on the test set. - 'auc': The area under the ROC curve for the model on the test set. """ model = model model.fit(X_train[features], y_train) predictions = model.predict(X_test[features]) #probability scores of the positive class probabilities = model.predict_proba(X_test[features])[:, 1] return { 'accuracy': accuracy_score(y_test, predictions), 'precision': precision_score(y_test, predictions), 'recall': recall_score(y_test, predictions), 'f1': f1_score(y_test, predictions), 'auc': roc_auc_score(y_test, probabilities), }

Run to view results

# Function to evaluate a model def print_score(model): """ This function evaluates the given model on different sets of features: objective, examination, subjective, and a combination of all. It then prints out the performance metrics for each feature set. Parameters: model (model object): The machine learning model to be evaluated. It should already be instantiated and capable of fitting data and making predictions. The function calls `evaluate_model` for each set of features and prints the results, which include accuracy, precision, recall, F1 score, and AUC metrics. Returns: None: This function does not return anything but directly prints the evaluation results. """ # Evaluating models based on different feature sets objective_results = evaluate_model(objective_features,model) examination_results = evaluate_model(examination_features,model) subjective_results = evaluate_model(subjective_features, model) combined_results = evaluate_model(objective_features + examination_features + subjective_features, model) # Display the scores print(f"'Objective_results:\n',{objective_results}'\n\n', 'Subjective_results:\n'{subjective_results}'\n\n', 'Examination_results:\n'{examination_results}'\n\n', 'Combined_results:\n'{combined_results}")

Run to view results

# Hyper-parameter Tuning with GridSearchCV param_grid = {'n_estimators': [100, 150, 200], 'max_depth': [5, 10, 20]}

Run to view results

# Choosing best parameters grid_search = GridSearchCV(rf, param_grid, cv=5) grid_search.fit(X_train, y_train) best_params = grid_search.best_params_ print('Best Parameters: ', best_params)

Run to view results

# Define the parameter grid param_grid = { 'n_estimators': [100, 150, 200], 'learning_rate': [0.01, 0.1, 0.2], 'max_depth': [3, 5, 7], }

Run to view results

# Choosing best parameters grid_search = GridSearchCV(gbc, param_grid, cv=5, n_jobs=-1) grid_search.fit(X_train, y_train) best_params = grid_search.best_params_ print('Best Parameters: ', best_params)

Run to view results

# Initiate model rf = RandomForestClassifier(random_state=0, max_depth=10, n_estimators=150) # Display results print_score(rf)

Run to view results

# Initiate Model gbc = GradientBoostingClassifier(learning_rate=0.1, max_depth=3, n_estimators=200) # Display Results print_score(gbc)

Run to view results

# Train model model = rf model.fit(X_train, y_train) # Perform permutation importance to assess the importance of each feature results = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=0, n_jobs=-1) # Organize the results in a DataFrame importance_df = pd.DataFrame({'feature': X_train.columns, 'importance_mean': results.importances_mean}) # Sort the features by their importance importance_df = importance_df.sort_values(by='importance_mean', ascending=False) # Plotting the feature importances plt.figure(figsize=(10, 6)) plt.barh(importance_df['feature'], importance_df['importance_mean'], color='skyblue') plt.xlabel('Mean Importance') plt.title('Feature Importance') plt.gca().invert_yaxis() plt.show()

Run to view results

# Train model model = gbc model.fit(X_train, y_train) # Perform permutation importance to assess the importance of each feature results = permutation_importance(gbc, X_test, y_test, n_repeats=10, random_state=0, n_jobs=-1) # Organize the results in a DataFrame importance_df = pd.DataFrame({'feature': X_train.columns, 'importance_mean': results.importances_mean}) # Sort the features by their importance importance_df = importance_df.sort_values(by='importance_mean', ascending=False) # Plotting the feature importances plt.figure(figsize=(10, 6)) plt.barh(importance_df['feature'], importance_df['importance_mean'], color='skyblue') plt.xlabel('Mean Importance') plt.title('Feature Importance') plt.gca().invert_yaxis() plt.show()

Run to view results

# Function to get the probability scores def probability_score(features, X_train, X_test, y_train, y_test, model): model.fit(X_train[features], y_train) probabilities = model.predict_proba(X_test[features])[:, 1] # Get probabilities for the positive class fpr, tpr, thresholds = roc_curve(y_test, probabilities) # Calculate ROC curve auc_score = roc_auc_score(y_test, probabilities) # Calculate AUC score return fpr, tpr, auc_score

Run to view results

combined_features = objective_features + examination_features + subjective_features # Evaluate your model fpr, tpr, auc_score = probability_score(combined_features, X_train, X_test, y_train, y_test, gbc)

Run to view results

# Plot the ROC curve plt.figure(figsize=(10, 6)) plt.plot(fpr, tpr, color='darkorange', label=f'ROC Curve (area = {auc_score:.2f})') plt.plot([0, 1], [0, 1], color='navy', linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver Operating Characteristic (ROC) Curve') plt.legend(loc="lower right") plt.show()

Run to view results

After progressively testing various models such as Random Forest, Gradient Boosting Classifier, and Logistic Regression (with and without cross-validation), and conducting grid search to fine-tune the parameters for both Random Forest and Gradient Boosting Classifier, the chosen model was GBC . This approach was strategic and aimed at leveraging the strengths of each individual model to handle the complexities of predicting cardiovascular diseases. These conditions often present nonlinear relationships and interactions between features, necessitating robust models that can interpret such complexities effectively. The initial standalone models showed promising results, with Random Forest achieving a high training accuracy but a lower test accuracy, indicating overfitting. The Gradient Boosting Classifier demonstrated a more balanced performance, with closer training and test accuracies. Logistic Regression provided baseline comparison, showing the necessity for more finetuned methods to capture the nuanced patterns of cardiovascular risk factors. Subsequently, the fine-tuning of Gradient Boosting Classifier aimed to harness the individual predictive powers while mitigating overfitting and enhancing generalization to unseen data. The performance metrics chosen for evaluation, AUC and F1 Score, were critical in providing a comprehensive assessment of each model’s ability to accurately classify individuals in terms of their cardiovascular disease risk. These metrics were specifically selected to balance the importance of both precision and recall in medical predictions.

Benchmark for Success:

AUC: The goal is ≥ 0.75, reflecting the combined model's ability to accurately predict cardiovascular disease occurrence. F1 Score: A target of ≥ 0.70, indicating effective balance in classification performance from both constituent models. The fine-tuned model, is well-tailored for predicting cardiovascular diseases and adept at navigating the nonlinear relationships and complex interactions typical of medical data. The evaluation of these models using AUC and F1 Score metrics offers a thorough analysis of their classification accuracy concerning cardiovascular disease risk. These metrics are crucial, as they encapsulate both precision and recall, providing a balanced view of model performance in medical diagnostics.

The results from the Gradient Boosting Classifier underscore this suitability. For the combined feature set, the model achieved an AUC of 0.8026 and an F1 Score of 0.7222, indicative of its strong predictive capability and balanced precision-recall trade-off. Similarly, examination features alone resulted in an AUC of 0.7746 and an F1 Score of 0.7071, further affirming the model's effectiveness. In contrast, objective and subjective features yielded lower performance, with AUCs of 0.6716 and 0.5200 respectively, highlighting the increased predictive power when leveraging a comprehensive set of features. These benchmarks validate the model's efficacy, particularly when utilizing a holistic approach that integrates various data types, leading to superior prediction accuracy. Thus, the model not only excels in individual assessments but also demonstrates enhanced performance of the fine-tuned Gradient Boosting Classifier, promising, reliable and actionable insights for cardiovascular disease prediction and management.

Practical Significance

Clinical Impact: The model's practical significance will be evaluated by its ability to enhance early detection of cardiovascular diseases, thereby facilitating timely medical interventions. A significant reduction in late-stage diagnosis rates of CVD among the screened population would demonstrate the model's practical value. For example, if the model is integrated into routine health check-ups, its effectiveness can be measured by the increased rate of early-stage CVD detection and the corresponding improvement in patient management and treatment outcomes. Healthcare Cost Reduction: Another crucial aspect of practical significance is the model's impact on healthcare costs. By preventing advanced stages of cardiovascular diseases through early intervention, the model should lead to a noticeable decrease in the financial burden associated with CVD treatments, such as hospital admissions, surgeries, and long-term care. This cost reduction can be quantified by comparing the healthcare expenses incurred before and after implementing the predictive model in clinical practice.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Import Libraries and Data

Data cleaning/Preparation

Rename the columns for a better understanding

Exploring Data

Display the frequency distribution of the data

Display the distribution of the continuous data columns

Display the Correlation matrix of the columns

Prediction

Split the data with scaled features

Create an Evaluation Function and split the features into categories

Benchmark for Success:

Practical Significance

Import Libraries and Data