Diabetes Prediction using Logistic Regression

Import libraries

import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline, make_pipeline

Read the data. The data was obtained from Kagle.

https://www.kaggle.com/datasets/vikasukani/diabetes-data-set

All people in the dataset are women. In the Outcome column, 1 means the person was diagnosed with diabetes, 0 means the woman do not have diabetes.

Explore

df=pd.read_csv("/work/diabetes-dataset.csv")

df.head()

df.info()

There are no null values. In the outcome column is the information about if a person has or not diabetes. The number 1 is for a person who have diabetes and a number 0 is for a person who doesn't have.

We are going to analyze the multicollinearity because our model is going to be lineal. Therefore, we are going to analyze the correlation between variables.

correlation = df.drop(columns = "Outcome").corr() sns.heatmap(correlation);

There is correlation between pregnancies and age, but is not strong.

Now we can see the relation between the variable Glucose with the Outcome with a box plot.

# Create boxplot sns.boxplot(x="Outcome", y="Glucose", data=df ) # Label axes plt.xlabel("Have Diabetes") plt.ylabel("Glucose") plt.title("Distribution of Glucosa by Outcome");

We can see that when people have diabetes, the glucose is higher than when they don't have diabetes.

Now it is necessary to see the balance between the two classes.

df["Outcome"].value_counts(normalize = True).plot( kind="bar", xlabel="Class", ylabel="Relative Frequency", title="Class Balance" );

We can work with this balance. Now we are going to keep the values of the majority class and the minority class.

majority_class_prop, minority_class_prop = df["Outcome"].value_counts(normalize = True) print(majority_class_prop, minority_class_prop)

We are going to make a pivot table to see if women with more pregnancies are more likely to suffer diabetes.

pregnancies_pivot = pd.pivot_table( df, index = "Pregnancies", values = "Outcome", aggfunc=np.mean ).sort_values(by = "Outcome") pregnancies_pivot

We can see that the women with more than 7 pregnancies are more likely to have diabetes. We are going to plot a bar chart to analyze this information. In this bar chart, we are going to include the majority and minority classes.

# Plot bar chart of `foundation_pivot` pregnancies_pivot.plot(kind="barh", legend = None) plt.axvline( majority_class_prop, linestyle = "--", color = "red", label="majority class" ) plt.axvline( minority_class_prop, linestyle = "--", color = "green", label="minority class" ) plt.legend(loc = "lower right") ;

We can see that women with 17, 14 and 15 pregnancies are with a much higher diagnosis of diabetes. In contrast, women with 1 or 2 pregnancies have a lower diagnosis of diabetes.

Split

We are going to split the data frame in X and y and then we are going to use train test plit to obtain X train, X test, y train and y test.

target = "Outcome" X = df.drop(columns=target) y = df[target]

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 )

Model

Baseline

We calculate the baseline accuracy score for our model.

acc_baseline = y_train.value_counts(normalize=True).max() print("Baseline Accuracy:", round(acc_baseline, 2))

Iterate

We build the model and fit the model.

model = LogisticRegression(max_iter=1000) model.fit(X_train, y_train)

Evaluate

Now we evaluate the model. First, we calculate the training accuracy, then the test accuracy.

acc_train = accuracy_score(y_train, model.predict(X_train)) acc_test = model.score(X_test, y_test) print("Training Accuracy:", round(acc_train, 2)) print("Test Accuracy:", round(acc_test, 2))

We can see that both are higher than the baseline accuracy.

Communicate

We are going to print the first five predictions using the X train.

y_train_pred_proba = model.predict_proba(X_train) print(y_train_pred_proba[:5])

We obtain the features and coefficients of our model.

features=model.feature_names_in_ features

importances = model.coef_[0] importances

Finally, we calculate the odds ratios and we plot them.

odds_ratios = pd.Series(np.exp(importances), index = features).sort_values() odds_ratios

odds_ratios.plot(kind="barh") plt.xlabel("Odds Ratios");

We can see that Diabetes Pedigree Function is the feature that affect more the model.