Diabetes Prediction using Logistic Regression
Read the data. The data was obtained from Kagle.
All people in the dataset are women. In the Outcome column, 1 means the person was diagnosed with diabetes, 0 means the woman do not have diabetes.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2000 entries, 0 to 1999 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 2000 non-null int64 1 Glucose 2000 non-null int64 2 BloodPressure 2000 non-null int64 3 SkinThickness 2000 non-null int64 4 Insulin 2000 non-null int64 5 BMI 2000 non-null float64 6 DiabetesPedigreeFunction 2000 non-null float64 7 Age 2000 non-null int64 8 Outcome 2000 non-null int64 dtypes: float64(2), int64(7) memory usage: 140.8 KB
There are no null values. In the outcome column is the information about if a person has or not diabetes. The number 1 is for a person who have diabetes and a number 0 is for a person who doesn't have.
We are going to analyze the multicollinearity because our model is going to be lineal. Therefore, we are going to analyze the correlation between variables.
There is correlation between pregnancies and age, but is not strong.
Now we can see the relation between the variable Glucose with the Outcome with a box plot.
We can see that when people have diabetes, the glucose is higher than when they don't have diabetes.
Now it is necessary to see the balance between the two classes.
We can work with this balance. Now we are going to keep the values of the majority class and the minority class.
We are going to make a pivot table to see if women with more pregnancies are more likely to suffer diabetes.
We can see that the women with more than 7 pregnancies are more likely to have diabetes. We are going to plot a bar chart to analyze this information. In this bar chart, we are going to include the majority and minority classes.
We can see that women with 17, 14 and 15 pregnancies are with a much higher diagnosis of diabetes. In contrast, women with 1 or 2 pregnancies have a lower diagnosis of diabetes.
We are going to split the data frame in X and y and then we are going to use train test plit to obtain X train, X test, y train and y test.
We calculate the baseline accuracy score for our model.
Baseline Accuracy: 0.66
We build the model and fit the model.
Now we evaluate the model. First, we calculate the training accuracy, then the test accuracy.
Training Accuracy: 0.77 Test Accuracy: 0.8
We can see that both are higher than the baseline accuracy.
We are going to print the first five predictions using the X train.
[[0.18740709 0.81259291] [0.93259789 0.06740211] [0.98258797 0.01741203] [0.60292581 0.39707419] [0.63980051 0.36019949]]
We obtain the features and coefficients of our model.
Finally, we calculate the odds ratios and we plot them.
We can see that Diabetes Pedigree Function is the feature that affect more the model.