Diabetes Prediction using Logistic Regression
Import libraries
Read the data. The data was obtained from Kagle.
https://www.kaggle.com/datasets/vikasukani/diabetes-data-set
All people in the dataset are women. In the Outcome column, 1 means the person was diagnosed with diabetes, 0 means the woman do not have diabetes.
Explore
0
2
138
1
0
84
2
0
145
3
0
135
4
1
139
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 2000 non-null int64
1 Glucose 2000 non-null int64
2 BloodPressure 2000 non-null int64
3 SkinThickness 2000 non-null int64
4 Insulin 2000 non-null int64
5 BMI 2000 non-null float64
6 DiabetesPedigreeFunction 2000 non-null float64
7 Age 2000 non-null int64
8 Outcome 2000 non-null int64
dtypes: float64(2), int64(7)
memory usage: 140.8 KB
There are no null values. In the outcome column is the information about if a person has or not diabetes. The number 1 is for a person who have diabetes and a number 0 is for a person who doesn't have.
We are going to analyze the multicollinearity because our model is going to be lineal. Therefore, we are going to analyze the correlation between variables.
There is correlation between pregnancies and age, but is not strong.
Now we can see the relation between the variable Glucose with the Outcome with a box plot.
We can see that when people have diabetes, the glucose is higher than when they don't have diabetes.
Now it is necessary to see the balance between the two classes.
We can work with this balance. Now we are going to keep the values of the majority class and the minority class.
0.658 0.342
We are going to make a pivot table to see if women with more pregnancies are more likely to suffer diabetes.
2
0.16901408450704225
1
0.22191011235955055
0
0.33222591362126247
6
0.33587786259541985
4
0.3507853403141361
3
0.358974358974359
5
0.36879432624113473
10
0.4074074074074074
12
0.43478260869565216
8
0.5520833333333334
We can see that the women with more than 7 pregnancies are more likely to have diabetes. We are going to plot a bar chart to analyze this information. In this bar chart, we are going to include the majority and minority classes.
We can see that women with 17, 14 and 15 pregnancies are with a much higher diagnosis of diabetes. In contrast, women with 1 or 2 pregnancies have a lower diagnosis of diabetes.
Split
We are going to split the data frame in X and y and then we are going to use train test plit to obtain X train, X test, y train and y test.
Model
Baseline
We calculate the baseline accuracy score for our model.
Baseline Accuracy: 0.66
Iterate
We build the model and fit the model.
Evaluate
Now we evaluate the model. First, we calculate the training accuracy, then the test accuracy.
Training Accuracy: 0.77
Test Accuracy: 0.8
We can see that both are higher than the baseline accuracy.
Communicate
We are going to print the first five predictions using the X train.
[[0.18740709 0.81259291]
[0.93259789 0.06740211]
[0.98258797 0.01741203]
[0.60292581 0.39707419]
[0.63980051 0.36019949]]
We obtain the features and coefficients of our model.
Finally, we calculate the odds ratios and we plot them.
We can see that Diabetes Pedigree Function is the feature that affect more the model.