Introduction to Exploratory Data Analysis EDA
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df_wine_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/Wine_Dataset/winequality-red.csv',sep=";")
Quantitative EDA
1° Head ()
df_wine_data.head(5)
This chart is empty
Chart was probably not set up properly in the notebook
2 ° shape
df_wine_data.shape
3° describe()
df_wine_data.describe()
Exploring the features
1° columns
df_wine_data.columns
2° Unique Values of Quality(Target Variable)
unique()
df_wine_data['quality'].unique()
3° Frequency Counts of each Quality Value
value_counts()
df_wine_data['quality'].value_counts()
Renaming columns
df_wine_data.rename(
columns = {
'fixed acidity':'fixed_acidity',
'volatile acidity':'volatile_acidity',
'citric acid':'citric_acid',
'residual sugar':'residual_sugar',
'free sulfur dioxide':'free_sulfur_dioxide',
'total sulfur dioxide':'total_sulfur_dioxide',
},
inplace = True
)
df_wine_data.columns
Identify Missing Values
1° isna()
df_wine_data.isna().sum()
3° info()
df_wine_data.info()
Data duplicates
duplicated()
duplicate_data = df_wine_data[df_wine_data.duplicated()]
print("Duplicate data:",duplicate_data.shape)
print("Raw data:",df_wine_data.shape)
Graphical EDA
Separating Input and Target Variables ,the targets and inputs depend of case to analyze
y = df_wine_data.quality # quality is variable target
x = df_wine_data.drop('quality',axis=1) # all columns are inputs except 'quality'
df_wine_data.hist(bins=8,figsize=(16,12))
plt.show()