2nd Assignment - UML with Pokemon

# Standard packaging import pandas as pd import seaborn as sns import matplotlib.pyplot as plt

# Load the dataset pokemon = pd.read_csv("https://sds-aau.github.io/SDS-master/00_data/pokemon.csv")

1. Give a brief overview of data, what variables are there, how are the variables scaled and variation of the data columns

To give an overview of the data, we use the general functions: head, info and describe. We can see that the dataset contains integers, objects/strings and boolean type of data. The variables important for the dimensionality reduction are between the columns Total - Generation. We can also see that the variables have different scales and we don't know the metrics for each column. The mean and standard deviation for each variable is very different.

# Check the data pokemon.head()

# Check the data pokemon.info()

#Check the data pokemon.describe()

2. Execute a PCA analysis on all numerical variables in the dataset. Hint: Don’t forget to scale them first. Use 4 components. What is the cumulative explained variance ratio? Hint: I am not sure this terminology and code was introduced during class, but try and look into cumulative explained variance and sklearn(package) and see if you can figure out the code needed.

# Scale the data # Load up Standard Scaler from sklearn from sklearn.preprocessing import StandardScaler # And scale all relevant variables into a new matrix scaled_pokemon = StandardScaler().fit_transform(pokemon.loc[:,'Total':'Generation'])

# All variables now have a mean of 0 and std of 1 for i in range(8): sns.distplot(scaled_pokemon[:,i], hist=False)

# Execute PCA analysis on all numerical variables # Import PCA from sklearn.decomposition import PCA pca = PCA(n_components=4)

# Use PCA to transform the data pca_pokemon = pca.fit_transform(scaled_pokemon)

pca_pokemon.shape

sns.scatterplot(pca_pokemon[:,0], pca_pokemon[:,1], hue = pokemon['Type1'] )

We can compare the variance in the overall dataset to what was captured from the two primary components using .explained_variance_ratio_. We can see that the first four principal components explain the majority of the dataset: 82,31%. This indicates the total information represented compared to the original data.

#Variance of each component print('Variance of each component:', pca.explained_variance_ratio_) #Total variance explained print('Total variance explained:', sum(list(pca.explained_variance_ratio_))*100)

3. Use a different dimensionality reduction method (eg. UMAP/NMF) – do the findings differ?

UMAP

We can see on the scatter plots that it looks different from the above one.

!pip install -q umap-learn

import umap reducer = umap.UMAP()

umap_pokemon = reducer.fit_transform(scaled_pokemon) umap_pokemon.shape

sns.scatterplot(umap_pokemon[:,0], umap_pokemon[:,1], hue = pokemon['Type1'] )

4. Perform a cluster analysis (KMeans) on all numerical variables (scaled & before PCA). Pick a realistic number of clusters (up to you where the large clusters remain mostly stable).

from sklearn.cluster import KMeans

# Check the ideal number of clusters, use the 'elbow rule' for deciding ks = range(1, 6) inertias = [] for k in ks: # Create a KMeans instance with k clusters: model model = KMeans(n_clusters=k) # Fit model to samples model.fit(scaled_pokemon) # Append the inertia to the list of inertias inertias.append(model.inertia_) # Plot ks vs inertias plt.plot(ks, inertias, '-o') plt.xlabel('number of clusters, k') plt.ylabel('inertia') plt.xticks(ks) plt.show() # --> 3 looks like a good number of clusters

clusterer = KMeans(n_clusters=3) clusterer.fit(scaled_pokemon)

sns.scatterplot(scaled_pokemon[:,0], scaled_pokemon[:,1], hue = clusterer.labels_)

5. Visualise the first 2 principal components and color the datapoints by cluster.

sns.set(color_codes=True, rc={'figure.figsize':(10,8)})

sns.scatterplot(pca_pokemon[:,0], pca_pokemon[:,1], hue = clusterer.labels_)

6. Inspect the distribution of the variable Type1 across clusters. Does the algorithm separate the different types of pokemon?

The type of the Pokemon does not define the stats of itself, thus it makes sense that the distribution of the type is not good across clusters and the types of Pokemon is not separated so well.

# we can check out a cross-tab pd.crosstab(clusterer.labels_, pokemon['Type1'])

7. Perform a cluster analysis on all numerical variables scaled and AFTER dimensionality reduction and visualize the first 2 principal components.

clusterer = KMeans(n_clusters=2) clusterer.fit(umap_pokemon)

sns.scatterplot(umap_pokemon[:,0], umap_pokemon[:,1], hue = clusterer.labels_)

8. Again, inspect the distribution of the variable “Type 1” across clusters, does it differ from the distribution before dimensionality reduction?

It is somewhat different than above, but as mentioned before, the variables do not define or rank the type of Pokemon so well, thus it is not separated well across the clusters.

pd.crosstab(clusterer.labels_, pokemon['Type1'])

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}2nd Assignment - UML with Pokemon