This plot illustrates how the clustering is affected by the removal of random data points. The results show a strong consistency in the clustering, using k=4. The groups remain consistent through out the iterations and keep the overarching structure, where you can clearly see that points which were together in a previous iteration remain clustered together.
The plot displays an incredible consistency in the clustering, when using 4 groups, regardless of initial position for the centroids in K-means algorithm. This speaks very highly of the consistency of k=4 and the validity of the clusters.
Silhoutte score for k = 2 is: 0.6328209708884562
Silhoutte score for k = 3 is: 0.6724895253169637
Silhoutte score for k = 4 is: 0.6674392423283723
Silhoutte score for k = 5 is: 0.5095212375670435
Silhoutte score for k = 6 is: 0.4698172916742672
Silhoutte score for k = 7 is: 0.48149562265196355
The optimal silhoutte score is for k = 3 and is: 0.6724895253169637
The silhoutte score ranges from -1 to 1. Values closer to 1 indicate a good clustering method. Looking for the different suggestions for k (ranging from 2 to 8) we find that k=3 and k=4 provides the best sihoutte scores, as they are so similar it is hard to make a judgement. Using the findings from the 'elbow method' and visual inspection of plots, we choose k=4 as the best fit.
It appears as though the most intuitive fit is now instead 3 clusters. Let's see how this looks for the silhoutte score. Let's compare k=3 and k=4 (which was the most effective before shifting the data set).
Silhoutte score for k = 3 is: 0.6797531685533981
Silhoutte score for k = 4 is: 0.6045983974626823
The optimal silhoutte score is for k = 3 and is: 0.6797531685533981
The silhoutte score is now clearly optimal for k=3 instead. Which is consistent with the graphical displays above.
Estimated number of clusters: 1
Estimated number of noise points: 0
Estimated number of clusters: 5
Estimated number of noise points: 172
Estimated number of clusters: 2
Estimated number of noise points: 1590
DBSCAN with eps = 0.3 and various values for min_samples
Estimated number of clusters: 4
Estimated number of noise points: 1252
Estimated number of clusters: 2
Estimated number of noise points: 1977
Estimated number of clusters: 2
Estimated number of noise points: 2083
DBSCAN with eps = 0.4 and various values for min_samples
Estimated number of clusters: 3
Estimated number of noise points: 826
Estimated number of clusters: 3
Estimated number of noise points: 897
Estimated number of clusters: 2
Estimated number of noise points: 1700
DBSCAN with eps = 0.5 and various values for min_samples
Estimated number of clusters: 2
Estimated number of noise points: 693
Estimated number of clusters: 2
Estimated number of noise points: 710
Estimated number of clusters: 2
Estimated number of noise points: 733
Estimated number of clusters: 3
Estimated number of noise points: 826
For non-translated data, k = 4 is optimal
Estimated number of clusters: 3
Estimated number of noise points: 826
Estimated number of clusters: 2
Estimated number of noise points: 226
The initial parameters seem to produce consistent results, even varying them slightly does not impact the solution.
The clustering using only the residue type PRO differs from the general DBSCAN clustering by not having any clusters with positive Phi values. Furthermore, it produces two well defined clusters, and does not find any values in the top left corner, which was very prevalent in previous DBSCAN clusters. This is interesting, as DBSCAN never seems to cluster these exact spots, however, for large k, the k means algorithm seems to find these clusters (found in residue type PRO) more accurately.
Estimated number of clusters: 3
Estimated number of noise points: 612
The initial parameters seem to produce consistent results, even varying them slightly does not impact the solution.
The residue type GLY seems to represent somewhat more of the clusters found in the general case. We see one cluster with phi>0, we find clusters both in the upper left and the middle left. However, some data points fall in the remaining clusters found in the general case, however, these are deemed outliers by the DBSCAN method.
It is important to consider that in previous tasks we found that the GLY residue had the highest number of outliers, by multiple factors. This can be displayed in the clustering of only GLY residues also, as we can see there are no clear clusters, there seems to be data points in each quadrant of the graph, and some almost randomly scattered.