Medical data Visualizer
The idea of this project is to analyze the correlation between different diseases, in a medical examination Data Frame. The data represent patients and the columns represent information like body measurements, results from various blood tests, and lifestyle choices.
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np
The Data Frame looks in the following way:
df = pd.read_csv('medical_examination.csv') df.head()
We add the 'overweight' column, where we set values 0 or 1, if the pacient is not overweight (0) or if it is overweight (1). It looks in the following way.
height2Meters = (df['height']/100)*(df['height']/100) IMC = df['weight'] / height2Meters df.loc[IMC <= 25, 'overweight'] = 0 df.loc[IMC > 25, 'overweight'] = 1 df.head()
We also normalize the values of Cholesterol and Glucosa.
df.loc[df['cholesterol']==1, 'cholesterol'] = 0 df.loc[df['cholesterol']>1, 'cholesterol'] = 1 df.loc[df['gluc']==1, 'gluc'] = 0 df.loc[df['gluc']>1, 'gluc'] = 1
Now we will prepare the data frame in a long format, to plot by the different variables (diseases).
df_cat = pd.melt(df, id_vars=['cardio'], value_vars=['active','alco','cholesterol','gluc','overweight','smoke'])
The count plot by different diseases for two different cardio values is the following:
We see that the most of the diseases are strongly correlated with 'heart disease' (cardio). Where Cholesterol and overweight the most dramatic cases. Now we will filter wrong data, especially those cases where the 'high value of blood pressure' is lower than the 'low value of blood pressure'. We also filter the height and weight values which are over the 97.5% and the data and below the 2.5% of the values, in order to avoid possible wrong data.
df_filtered = df[(df['ap_lo'] <= df['ap_hi']) & (df['height'] >= df['height'].quantile(0.025)) & (df['height'] < df['height'].quantile(0.975)) & (df['weight'] >= df['weight'].quantile(0.025)) & (df['weight'] < df['weight'].quantile(0.975))]
With this filtered data frame we create a correlation matrix, that gives the connection between differnet deseases.
Corr = df_filtered.corr()
Finally, we crate a heatmap whit the correlation matrix, that shows numerically which are the most correlated deseases.
mask = np.zeros_like(Corr) mask[np.triu_indices_from(mask)] = True with sns.axes_style("white"): f, fig = plt.subplots(figsize=(12, 7)) fig = sns.heatmap(Corr,vmin=0,vmax=.25,square=True,annot=True,linewidths=.5,fmt=".1f",mask=mask)
As we spect, the most correlated variables are: weight - overweight, gender - smoke, cholesterol - glucosa and gender - height.