Diabetes Prediction using Logistic Regression
Import libraries
Read the data. The data was obtained from Kagle.
https://www.kaggle.com/datasets/vikasukani/diabetes-data-set
All people in the dataset are women. In the Outcome column, 1 means the person was diagnosed with diabetes, 0 means the woman do not have diabetes.
Explore
There are no null values. In the outcome column is the information about if a person has or not diabetes. The number 1 is for a person who have diabetes and a number 0 is for a person who doesn't have.
We are going to analyze the multicollinearity because our model is going to be lineal. Therefore, we are going to analyze the correlation between variables.
There is correlation between pregnancies and age, but is not strong.
Now we can see the relation between the variable Glucose with the Outcome with a box plot.
We can see that when people have diabetes, the glucose is higher than when they don't have diabetes.
Now it is necessary to see the balance between the two classes.
We can work with this balance. Now we are going to keep the values of the majority class and the minority class.
We are going to make a pivot table to see if women with more pregnancies are more likely to suffer diabetes.
We can see that the women with more than 7 pregnancies are more likely to have diabetes. We are going to plot a bar chart to analyze this information. In this bar chart, we are going to include the majority and minority classes.
We can see that women with 17, 14 and 15 pregnancies are with a much higher diagnosis of diabetes. In contrast, women with 1 or 2 pregnancies have a lower diagnosis of diabetes.
Split
We are going to split the data frame in X and y and then we are going to use train test plit to obtain X train, X test, y train and y test.
Model
Baseline
We calculate the baseline accuracy score for our model.
Iterate
We build the model and fit the model.
Evaluate
Now we evaluate the model. First, we calculate the training accuracy, then the test accuracy.
We can see that both are higher than the baseline accuracy.
Communicate
We are going to print the first five predictions using the X train.
We obtain the features and coefficients of our model.
Finally, we calculate the odds ratios and we plot them.
We can see that Diabetes Pedigree Function is the feature that affect more the model.