# Start writing code here...
a) Use the seaborn library with default parameters to visualize the distribution of those values using Kernel Density Estimation. Based on the visualization, at which location(s) do you think the underlying distribution has a mode (peak)?
import seaborn as sns import pandas as pd import matplotlib.pyplot as plt from scipy.misc import electrocardiogram from scipy.signal import find_peaks import numpy as np values = pd.read_csv('values.txt', sep='\\s+') values.columns = ['values']
sns.displot(values, x="values", kind="kde")
The peak of the distribution is between 1.8 and 2.
b) Produce alternative KDE plots by adjusting the bandwidth to higher and lower values. Briefly describe in your own words how this changes the shape of the estimated distribution. Visualize the same data in a different way to help you decide which setting most faithfully reflects the distribution which generated the data. In particular, at which location(s) would you assume it has modes? (3P)
sns.displot(values, x="values", kind="kde",bw_adjust=0.1)
With bandwidth 0.1, A small bandwidth leads to under-smoothing, and thus the graph looks like a combination of multiple individual peaks. Here we have 7. The curve has strong spikes.
sns.displot(values, x="values", kind="kde",bw_adjust=.25)
With bandwidth increases to 0.25, the curves smoothes and now looks like a combination of 4 multiple peaks. The curve started to merge here and the highest peak is around 2.
sns.displot(values, x="values", kind="kde",bw_adjust=.5)
With bandwidth increases to 0.5, in the curve the second and third peaks are almost merged.
sns.displot(values, x="values", kind="kde",bw_adjust=.75)
With bandwidth increases to 0.75, the curves more merged and looks like part of a single distribution. the middle peak is highest around 2.
sns.displot(values, x="values", kind="kde",bw_adjust=0.9)
With bandwidth increases to 0.9, the curves smoothes further and now looks like a combination of three peaks. These are more merged and looks like part of a single distribution. Here also the peak is around 2
sns.displot(values, x="values", kind="kde",bw_adjust=1)
at bandwidth 1, no signs of multimodality, we have a wide and smooth unimodal distribution.
sns.displot(values, x="values", kind="kde",bw_adjust=1.5)
at bandwidth 1.5, the curve has perfect unimodal distribution. A big bandwidth can lead to over-smoothing. It means that the density plot look like a unimodal distribution and hide all non-unimodal distribution properties.
We think the bandwidth of 0.75 is the optimal one as it avoids over-smoothing and under-smoothing and reflects underlying properties better. The modes are at around 0, 2 and 3.
d) Read the dataset
import pandas as pd data = pd.read_excel('chronic_kidney_disease_numerical.xls', index_col=0)
pip install xlrd
Requirement already satisfied: xlrd in /usr/local/lib/python3.7/site-packages (2.0.1) WARNING: You are using pip version 21.0.1; however, version 21.1 is available. You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command. Note: you may need to restart the kernel to use updated packages.
<class 'pandas.core.frame.DataFrame'> Float64Index: 400 entries, 48.0 to 58.0 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 blood pressure 388 non-null float64 1 specific gravity 353 non-null float64 2 albumin 354 non-null float64 3 sugar 351 non-null float64 4 blood glucose random 356 non-null float64 5 blood urea 381 non-null float64 6 serum creatinine 383 non-null float64 7 sodium 313 non-null float64 8 potassium 312 non-null float64 9 hemoglobin 348 non-null float64 10 packed cell volume 329 non-null float64 11 white blood cell count 294 non-null float64 12 red blood cell count 269 non-null float64 13 class 400 non-null object dtypes: float64(13), object(1) memory usage: 46.9+ KB
Use pandas.melt to transform the dataform wide to long format
data_melted = pd.melt(data, id_vars=["class"]) data_melted.info() data_melted.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5200 entries, 0 to 5199 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 class 5200 non-null object 1 variable 5200 non-null object 2 value 4431 non-null float64 dtypes: float64(1), object(2) memory usage: 122.0+ KB
|5 rows × 3 columns|
e) create two boxplots side-by-side
#data['blood pressure'][data['class']=='ckd'].boxplot() import seaborn as sns import matplotlib.pyplot as plt column_list =['blood pressure','specific gravity', 'albumin','sugar','blood glucose random','blood urea', 'serum creatinine','sodium','potassium','hemoglobin','packed cell volume','white blood cell count', 'red blood cell count'] data_ckd = data['class']=='ckd' data_nockd = data['class'] !='ckd' fig, axes = plt.subplots(nrows=13, ncols=2) fig.set_size_inches(15.5, 60.5) for i, column in enumerate(column_list): data.loc[data_ckd].boxplot(column=column, meanline=True, showmeans=True, showcaps=True, showbox=True, showfliers=False, ax=axes[i]) data.loc[data_nockd].boxplot(column=column, meanline=True, showmeans=True, showcaps=True, showbox=True, showfliers=False, ax=axes[i])