An explanation and application of Principal component analysis (PCA)
PCA is a statistical technique for reducing the dimensionality of a dataset. This is accomplished by linearly transforming the data into a new coordinate system where (most of) the variation in the data can be described with fewer dimensions than the initial data.
Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)
Definition and Derivation of Principal Components
Application of PCA to a data set
The libraries and load the csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#from sklearn.preprocessing import StandardScaler
df = pd.read_csv('gym_crowdedness.csv')
df.head()
We see a general glance at the data
df.dtypes
df.describe()
df.shape
Then we see if there is any null data element
df.isnull().any()
In this example we will drop the column 'date' for simplicity
gym = df.drop('date',axis = 1)
median = gym.mean()
standard_deviation = gym.std()
scaled = (gym - median) / standard_deviation
scaled
covariance_matrix = np.cov(scaled.T)
covariance_matrix
plt.figure(figsize =(10,10))
sns.set(font_scale = 1.5)
sns.heatmap(covariance_matrix,
cbar = True,
annot = True,
square = True,
fmt = '.2f',
annot_kws={'size':12}
)
eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix)
variance_explained = []
for i in eigen_values:
variance_explained.append(i/sum(eigen_values)*100)
x = [i for i in range(len(variance_explained))]
plt.bar(x,variance_explained, width = 0.5, color = ['red', 'blue'])
plt.xticks(np.arange(10),('pca_1','pca_2', 'pca_3', 'pca_4','pca_5','pca_6','pca_7','pca_8','pca_9','pca_10'),rotation = 45)
plt.title('Variance by principal component')
plt.xlabel('Principal component')
plt.ylabel('Percentage of variance explained')
plt.show()
From the previous graphic, we can select for example PCA_1, PCA2, PCA_6, PCA_7 and PCA_10 and we captured 78.79% of the variance.
variance_captured = variance_explained[0]+variance_explained[1]+variance_explained[5]+variance_explained[6]+variance_explained[9]
variance_captured
Now we build the feature vector
feature_vector = [eigen_vectors.T[0],eigen_vectors.T[1],eigen_vectors.T[5],eigen_vectors.T[6],eigen_vectors.T[9]]
feature_vector=np.array(feature_vector)
scaled.shape
feature_vector.T.shape
gym_recast = np.dot(scaled,feature_vector.T)
gym_recast = pd.DataFrame(gym_recast)
sns.pairplot(gym_recast)