Assignment 3, question 3 & 4

Johanna Wiberg (jwiberg):18 hours in total both this file and the other one

Oscar Forsberg (oscfors): 18 hours in total both this file and the other one

import numpy as np import pandas as p import math import matplotlib.cm as cm import matplotlib.pyplot as plt import matplotlib from sklearn.cluster import DBSCAN from sklearn.neighbors import NearestNeighbors from collections import Counter p.options.mode.chained_assignment = None d700 = p.read_csv('/work/data_all.csv')

3a)

To start of this part of the assignment we used a k-distance graph in order to get the optimal value for epsilon. (source 1, which can be find further down)

#Source for this section https://www.analyticsvidhya.com/blog/2020/09/how-dbscan-clustering-works/ #Calculates the distance between the all data points and its closes data point. nbrs = NearestNeighbors(n_neighbors=2).fit(d700[['phi','psi']]) distances, indices = nbrs.kneighbors(d700[['phi','psi']]) #Sort the distance array distances = np.sort(distances, axis=0) #Takes all the column on index 1, which is all the distances distances = distances[:,1] #print the graph plt.figure(figsize=(20,10)) plt.plot(distances) plt.title('K-distance Graph',fontsize=20) plt.xlabel('Data Points sorted by distance',fontsize=14) plt.ylabel('Epsilon',fontsize=14) plt.show()

As you can see on the graph, the slope is increasing the most at ~3.5. So an epsilon around there is optimal according to this method. (source 1 & 2) Choosing a minPts is a bit harder since it requires domain knowledge. But we also found sources (3 & 4) saying that the starting point is to pick minPts as at least 4 for 2-dimensional data, but also that minPts can be higher when dealing with large datasets. We assumed our dataset to be large so in the end we went for a minPts = 5. So we start of with testing epsilon = 3.5 and min_Sample = 5.

Source 1: https://www.analyticsvidhya.com/blog/2020/09/how-dbscan-clustering-works/

Source 2: https://towardsdatascience.com/explaining-dbscan-clustering-18eaf5c83b31

Source 3: http://www.sefidian.com/2020/12/18/how-to-determine-epsilon-and-minpts-parameters-of-dbscan-clustering/

Source 4: https://medium.com/@tarammullin/dbscan-parameter-estimation-ff8330e3a3bd

3b)

#Source: https://www.youtube.com/watch?v=eq1zKgCFwkk&t=458s #method that lets you chose epsilon and min sample on data when clustering def do(eps,min_samples,dat,string): #use of dbscan and sorting of outliers and clusters dbscan = DBSCAN(eps=eps,min_samples=min_samples).fit(dat[['phi','psi']]) dat['DBSCAN_labels']= dbscan.labels_ outliers = dat[dat['DBSCAN_labels'] == -1] clusters = dat[dat['DBSCAN_labels'] != -1] #colors, all the data points that are positive is not outliers. We color the outliers(-1) black. colors = dat['DBSCAN_labels'] colors_clusters = colors[colors != -1] colors_outliers = 'black' #plot and titles fig = plt.figure() ax = fig.add_axes([.1,.1,1,1]) ax.scatter(clusters['phi'],clusters['psi'], c=colors_clusters,s=30, alpha = 0.5) ax.scatter(outliers['phi'],outliers['psi'], c=colors_outliers,s=30, alpha = 0.5) ax.set_xlabel('Phi') ax.set_ylabel('Psi') plt.title('epsilon: {0}, min_samples: {1}'.format(eps,min_samples)) plt.xlim(-200,200) plt.ylim(-200,200) plt.show #prints print(string) n_noise_ = list(dbscan.labels_).count(-1) print('Number of outliers: %d' % n_noise_) print('Number of clusters: {}'.format(len(Counter(dbscan.labels_))-1)) print('Clusters (colored) and outliers (black)')

do(3.5,5,d700,"Iteration 1")

As you can see on the graph and text above we have 93 clusters, which by looking at the way the data looks doesn't seem reasonable at all. Visually you can see around 3-6 clusters. We therefore want fewer clusters, so we increase min_Sample and try to solve this iteratively.

do(3.5,30,d700,"Iteration 2")

What happens when we increase the min_samples is that we get fewer clusters but a lot of outliers => so we need to increase the epsilon in order to increase the radius of the clusters. We therefore set epsilon to 10.

do(10,30,d700,"Iteration 3")

Now we get a much more realistic result! So this will be our graph! Number of outliers are the black data points in the graph and they are 1168. Below you can see the bar graph with the outliers, the most common one by far is GLY.

dbscan = DBSCAN(eps=10,min_samples=30).fit(d700[['phi','psi']]) d700['DBSCAN_labels']= dbscan.labels_ outliers = d700[d700['DBSCAN_labels'] == -1] outliers1=outliers["residue name"].value_counts() outliers1.plot(kind='bar', color='black')

3c)

The cluster does look pretty different. The biggest difference in our opinion is that the DBSCAN considers outliers/noise. This makes the clusters more dense. The clusters for DBSCAN are also more circle-like compared to the K-means clusters. If you for example look at the graph for k-means where k=5, the "blob" in the top right corner is split half into two clusters.  Which doesn't seem totally realistic. However, with the K-means method it's pretty easy to find the right parameter using the elbow method. But for DBSCAN you have to choose two parameters and they are both quite dependent on each other.

3d)

The clusters are not robust and it seems as though the smaller the cluster the less robust it is. If we for example increase the Min_sample to 35, the dark green clusters disappear and become outliers. This is because it's not a lot of data points around that area so there will not be 35 data points within the epsilon distance. If we instead decrease the min_sample more clusters will appear. If we decrease the epsilon to 7, we get more outliers and more clusters. If we increase the epsilon to 13, we get more clusters and less outliers. The conclusion here is that the DBSCAN is very dependent on the parameters. Changing a parameter with a few steps makes pretty big differences so it's hard to know if you can trust the result.  

4)

#Sorts into datasets for specific proteins GLY = d700[d700['residue name']=='GLY'] PRO = d700[d700['residue name']=='PRO']

do(15,25,GLY,"GLY clusters")

do(15,30,PRO,"PRO clusters")

The GLY is pretty similar to the general clusters. The only big difference is that the big blob in the right top corner is split up into two blobs for GLY. As you can see on the graph with only PRO, it seems as though many of the data points who "connect" the two blobs in the right top corners are PRO data points. The clusters for PRO differ quite a lot for the general graph with clusters. They are pretty centered around -100 to -50 on the Phi axis and spread out on the psi axis, with a tendency to be between -50 to 180.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Assignment 3, question 3 &amp; 4

Johanna Wiberg (jwiberg):18 hours in total both this file and the other one

Oscar Forsberg (oscfors): 18 hours in total both this file and the other one

3a)

3b)

3c)

3d)

4)

Assignment 3, question 3 & 4