Assignment 3: Clustering
by Cleo Matzken & Matthieu Moutot - Group 80 (~12 hrs each)
Exercise 1
Distribution of phi and psi combinations using scatter plot
Let's plot a scatter graph representing every sample with their phi angle on the x-axis and psi angle on the y-axis.
On this graph, we can roughly identify 4 clusters: on the top left (the largest one), on the middle left, on the lower left and on the middle right. It seems like the data in the lower left is actually part of the top left cluster, which makes sense considering that the angles are periodic. We'll look more into that in question 2-d.
Using a heat map
Another way to display the different combinations is to use a heat map: the number of samples having the same combination is represented through a color scale.
Unlike the scatter plot, we can clearly identify only 2 clusters on the heat map (a third one is very faint on the middle right).
Exercise 2
Now, let's carry on with the clustering process. First, we import the libraries allowing us to use the clustering tools.
Choosing the right number of clusters
The first step is to feed the algorithm with an estimated number of clusters. According to our observations above, we could estimate that there are between 2 and 5 clusters. In order to have a better estimation, we can use the Elbow method.
On the y-axis, we plot the distortion score (sum of squared distances from each point to its assigned center) depending on the number of clusters fed to the K-means algorithm. When the curve is bending a lot, we can identify it as the "elbow", and thus deduce the optimal number of clusters.
Clustering using K-means method
According to the Elbow method, the optimal number of clusters in our case is 3. We can now split the data into 3 clusters thanks to the KMeans tool. The 'k-means++' algorithm is a seeding technique introduced by David Arthur and Sergei Vassilvitskii (http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf) that provides a smarter centroid initialization than the random initialization.
In order to have a graphical representation of the clustering, we can plot a scatter graph and color every sample belonging to the same cluster, as well as the centroids.
Validation
In order to verify that our clustering is stable, we consider a subset of the initial data set by randomly removing a certain proportion of the data, and then see if the clustering remains mostly the same. First we will remove 25% of the data, and then 40%.
We can clearly see that the shape of the clusters does not change when reducing the proportion of samples, so we can deduce that our clustering with K=3 is stable and relevant for this data set.
According to the additional information provided in the links, there are only 3 combinations of phi and psi angles that are actually possible. Hence, the two clusters identified on the heat map are most likely relevant (we can also slightly distinguish the third one), as well as the three biggest clusters found on the scatter plot (but not the fourth one).
Modification of the data set
The sample distribution looks more consistent now, and we can clearly identify three major clusters and maybe a smaller one on the top right.
We can now repeat the whole clustering procedure in order to see if there is any major change:
The clustering procedure shows similar results with three clusters, but we can argue that the samples are distributed in the three clusters in a more consistent way thanks to the modifications that we made on the data set.
Exercise 3
Definition of parameters for a neighborhood
Clustering with DBSCAN
We can proceed to the clustering with the DBSCAN method using the optimal parameters that we found: eps=10 and min_samples=100
Bar chart for outliers' origins
Let's plot a bar chart in order to find what kind of amino acid residue types are producing the most outliers.
We can clearly deduce that most of the outliers are of the GLY type.
Comparison of DBSCAN results with K-means results
Robustness of DBSCAN to eps and min_samples variations
Exercise 4
Clustering samples with residue type PRO
First, we plot the data in order to see where the PRO residue samples are mainly located.
We can then proceed to the DBSCAN clustering with the optimal parameters that we just found.
Clustering samples with residue type GLY
We first plot the data in order to have an overview.
Once we have find the relevant values for eps and min_samples, we can proceed with the DBSCAN clustering.