作業一:判斷資料趨勢(正偏/負偏/對稱)
1. 安裝套件
# Pandas and numpy for data manipulation
import pandas as pd
import numpy as np
# matplotlib for plotting
import matplotlib.pyplot as plt
import matplotlib
from scipy import stats
from scipy.stats import norm
# Set text size
matplotlib.rcParams['font.size'] = 18
# Seaborn
import seaborn as sns
sns.set_context('talk', font_scale=1.2);
2. 顯示資料集
df = pd.read_csv('hw1常態分配資料集.csv')
df.columns = ['Object','A1','A2','A3']
df
3. 判斷各欄位的分佈趨勢
data = df.A1
sns.distplot(data, fit=norm, rug=True, hist=False)
# sns.displot(data,kind="kde",rug=True)
(mu, sigma) = norm.fit(data)
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],loc='best') #圖中的藍色線標題
plt.ylabel('Frequency')
plt.title('Distribution')
plt.show()
data = df.A2
sns.distplot(data, fit=norm, rug=True, hist=False)
# sns.displot(data,kind="kde",rug=True)
(mu, sigma) = norm.fit(data)
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],loc='best')
plt.ylabel('Frequency')
plt.title('Distribution')
plt.show()
data = df.A3
sns.distplot(data, fit=norm, rug=True, hist=False)
# sns.displot(data,kind="kde",rug=True)
(mu, sigma) = norm.fit(data)
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],loc='best')
plt.ylabel('Frequency')
plt.title('Distribution')
plt.show()
4. 各欄位是否符合常態分佈?
Yes, 因為pvalue > 0.05
a1 = stats.kstest(df.A1,'norm',(mu,sigma))
a2 = stats.kstest(df.A2,'norm',(mu,sigma))
a3 = stats.kstest(df.A3,'norm',(mu,sigma))
print(a1)
print(a2)
print(a3)
5. 結論
A1, A2, A3從圖中可以觀察到都是正偏斜資料,三個欄位各自的平均數皆大於中位數。