An explanation and application of Principal component analysis (PCA)

PCA is a statistical technique for reducing the dimensionality of a dataset. This is accomplished by linearly transforming the data into a new coordinate system where (most of) the variation in the data can be described with fewer dimensions than the initial data.

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)

Definition and Derivation of Principal Components

Application of PCA to a data set

The libraries and load the csv

https://www.kaggle.com/datasets/nsrose7224/crowdedness-at-the-campus-gym

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns #from sklearn.preprocessing import StandardScaler df = pd.read_csv('gym_crowdedness.csv') df.head()

We see a general glance at the data

df.dtypes

df.describe()

df.shape

Then we see if there is any null data element

df.isnull().any()

In this example we will drop the column 'date' for simplicity

gym = df.drop('date',axis = 1)

median = gym.mean() standard_deviation = gym.std() scaled = (gym - median) / standard_deviation

scaled

covariance_matrix = np.cov(scaled.T) covariance_matrix

plt.figure(figsize =(10,10)) sns.set(font_scale = 1.5) sns.heatmap(covariance_matrix, cbar = True, annot = True, square = True, fmt = '.2f', annot_kws={'size':12} )

eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix)

variance_explained = [] for i in eigen_values: variance_explained.append(i/sum(eigen_values)*100)

x = [i for i in range(len(variance_explained))] plt.bar(x,variance_explained, width = 0.5, color = ['red', 'blue']) plt.xticks(np.arange(10),('pca_1','pca_2', 'pca_3', 'pca_4','pca_5','pca_6','pca_7','pca_8','pca_9','pca_10'),rotation = 45) plt.title('Variance by principal component') plt.xlabel('Principal component') plt.ylabel('Percentage of variance explained') plt.show()

From the previous graphic, we can select for example PCA_1, PCA2, PCA_6, PCA_7 and PCA_10 and we captured 78.79% of the variance.

variance_captured = variance_explained[0]+variance_explained[1]+variance_explained[5]+variance_explained[6]+variance_explained[9] variance_captured

Now we build the feature vector

feature_vector = [eigen_vectors.T[0],eigen_vectors.T[1],eigen_vectors.T[5],eigen_vectors.T[6],eigen_vectors.T[9]] feature_vector=np.array(feature_vector)

scaled.shape

feature_vector.T.shape

gym_recast = np.dot(scaled,feature_vector.T)

gym_recast = pd.DataFrame(gym_recast)

sns.pairplot(gym_recast)

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}An explanation and application of Principal component analysis (PCA)

Definition and Derivation of Principal Components

Application of PCA to a data set

An explanation and application of Principal component analysis (PCA)