This plot illustrates how the clustering is affected by the removal of random data points. The results show a strong consistency in the clustering, using k=4. The groups remain consistent through out the iterations and keep the overarching structure, where you can clearly see that points which were together in a previous iteration remain clustered together.
The plot displays an incredible consistency in the clustering, when using 4 groups, regardless of initial position for the centroids in K-means algorithm. This speaks very highly of the consistency of k=4 and the validity of the clusters.
The silhoutte score ranges from -1 to 1. Values closer to 1 indicate a good clustering method. Looking for the different suggestions for k (ranging from 2 to 8) we find that k=3 and k=4 provides the best sihoutte scores, as they are so similar it is hard to make a judgement. Using the findings from the 'elbow method' and visual inspection of plots, we choose k=4 as the best fit.
It appears as though the most intuitive fit is now instead 3 clusters. Let's see how this looks for the silhoutte score. Let's compare k=3 and k=4 (which was the most effective before shifting the data set).
The silhoutte score is now clearly optimal for k=3 instead. Which is consistent with the graphical displays above.
The initial parameters seem to produce consistent results, even varying them slightly does not impact the solution.
The clustering using only the residue type PRO differs from the general DBSCAN clustering by not having any clusters with positive Phi values. Furthermore, it produces two well defined clusters, and does not find any values in the top left corner, which was very prevalent in previous DBSCAN clusters. This is interesting, as DBSCAN never seems to cluster these exact spots, however, for large k, the k means algorithm seems to find these clusters (found in residue type PRO) more accurately.
The initial parameters seem to produce consistent results, even varying them slightly does not impact the solution.
The residue type GLY seems to represent somewhat more of the clusters found in the general case. We see one cluster with phi>0, we find clusters both in the upper left and the middle left. However, some data points fall in the remaining clusters found in the general case, however, these are deemed outliers by the DBSCAN method.
It is important to consider that in previous tasks we found that the GLY residue had the highest number of outliers, by multiple factors. This can be displayed in the clustering of only GLY residues also, as we can see there are no clear clusters, there seems to be data points in each quadrant of the graph, and some almost randomly scattered.