2nd Assignment - UML with Pokemon
1. Give a brief overview of data, what variables are there, how are the variables scaled and variation of the data columns
To give an overview of the data, we use the general functions: head, info and describe. We can see that the dataset contains integers, objects/strings and boolean type of data. The variables important for the dimensionality reduction are between the columns Total - Generation. We can also see that the variables have different scales and we don't know the metrics for each column. The mean and standard deviation for each variable is very different.
2. Execute a PCA analysis on all numerical variables in the dataset. Hint: Don’t forget to scale them first. Use 4 components. What is the cumulative explained variance ratio? Hint: I am not sure this terminology and code was introduced during class, but try and look into cumulative explained variance and sklearn(package) and see if you can figure out the code needed.
We can compare the variance in the overall dataset to what was captured from the two primary components using .explained_variance_ratio_. We can see that the first four principal components explain the majority of the dataset: 82,31%. This indicates the total information represented compared to the original data.
3. Use a different dimensionality reduction method (eg. UMAP/NMF) – do the findings differ?
UMAP
We can see on the scatter plots that it looks different from the above one.
4. Perform a cluster analysis (KMeans) on all numerical variables (scaled & before PCA). Pick a realistic number of clusters (up to you where the large clusters remain mostly stable).
5. Visualise the first 2 principal components and color the datapoints by cluster.
6. Inspect the distribution of the variable Type1 across clusters. Does the algorithm separate the different types of pokemon?
The type of the Pokemon does not define the stats of itself, thus it makes sense that the distribution of the type is not good across clusters and the types of Pokemon is not separated so well.
7. Perform a cluster analysis on all numerical variables scaled and AFTER dimensionality reduction and visualize the first 2 principal components.
8. Again, inspect the distribution of the variable “Type 1” across clusters, does it differ from the distribution before dimensionality reduction?
It is somewhat different than above, but as mentioned before, the variables do not define or rank the type of Pokemon so well, thus it is not separated well across the clusters.