Assignment 3: Clustering

by Cleo Matzken & Matthieu Moutot - Group 80 (~12 hrs each)

import pandas as pd import numpy as np import matplotlib.pyplot as plt import warnings warnings.filterwarnings("ignore") from matplotlib.pyplot import figure

data = pd.read_csv('assignment3-data.csv')

Exercise 1

Distribution of phi and psi combinations using scatter plot

Let's plot a scatter graph representing every sample with their phi angle on the x-axis and psi angle on the y-axis.

# Select only the attributes that interest us data2 = data[['phi','psi']] # Plot the scatter graph fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10,5)) ax.scatter(data2['phi'],data2['psi'], s=1) ax.set_xlabel('Phi (in °)') ax.set_ylabel('Psi (in °)') ax.set_title('Phi and Psi angles combinations')

On this graph, we can roughly identify 4 clusters: on the top left (the largest one), on the middle left, on the lower left and on the middle right. It seems like the data in the lower left is actually part of the top left cluster, which makes sense considering that the angles are periodic. We'll look more into that in question 2-d.

Using a heat map

Another way to display the different combinations is to use a heat map: the number of samples having the same combination is represented through a color scale.

plt.figure(figsize=(10, 10)) plt.hist2d(data2['phi'],data2['psi'], bins=200, cmap='inferno') plt.title('Heat Map of Phi and Psi angles combinations') plt.xlabel('Phi (in °)') plt.ylabel('Psi (in °)') cb = plt.colorbar() cb.set_label('Number of samples with such combination') plt.show()

Unlike the scatter plot, we can clearly identify only 2 clusters on the heat map (a third one is very faint on the middle right).

Exercise 2

Now, let's carry on with the clustering process. First, we import the libraries allowing us to use the clustering tools.

# Source: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html from sklearn.cluster import KMeans # Source: https://www.scikit-yb.org/en/latest/api/cluster/elbow.html from yellowbrick.cluster import KElbowVisualizer import seaborn as sns

Choosing the right number of clusters

The first step is to feed the algorithm with an estimated number of clusters. According to our observations above, we could estimate that there are between 2 and 5 clusters. In order to have a better estimation, we can use the Elbow method.

On the y-axis, we plot the distortion score (sum of squared distances from each point to its assigned center) depending on the number of clusters fed to the K-means algorithm. When the curve is bending a lot, we can identify it as the "elbow", and thus deduce the optimal number of clusters.

model = KMeans() visualizer = KElbowVisualizer(model, k=(1,10), timings=False).fit(data2) visualizer.show()

Clustering using K-means method

According to the Elbow method, the optimal number of clusters in our case is 3. We can now split the data into 3 clusters thanks to the KMeans tool. The 'k-means++' algorithm is a seeding technique introduced by David Arthur and Sergei Vassilvitskii (http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf) that provides a smarter centroid initialization than the random initialization.

kmeans = KMeans(n_clusters=3, init='k-means++', random_state=0).fit(data2)

In order to have a graphical representation of the clustering, we can plot a scatter graph and color every sample belonging to the same cluster, as well as the centroids.

# Plot the scatter graph and color clusters sns.scatterplot(data=data2, x='phi', y='psi', hue=kmeans.labels_) # Indicate the centroids plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], marker="X", c="r", s=80, label="centroids") plt.legend(loc='upper left') plt.show()

Validation

In order to verify that our clustering is stable, we consider a subset of the initial data set by randomly removing a certain proportion of the data, and then see if the clustering remains mostly the same. First we will remove 25% of the data, and then 40%.

import math # Remove 25% of the data np.random.seed(0) indices = math.floor(0.25 * data2.shape[0]) indices_to_be_removed = np.random.choice(data2.index, indices, replace=False) data3 = data2.drop(indices_to_be_removed) # Cluster using the KMeans algorithm kmeans = KMeans(n_clusters=3, init='k-means++', random_state=0).fit(data3) # Visualize results with a scatter graph sns.scatterplot(data=data3, x='phi', y='psi', hue=kmeans.labels_) plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], marker="X", c="r", s=80, label="centroids") plt.legend(loc='lower left') plt.show()

# Remove 40% of the data np.random.seed(0) indices = math.floor(0.4 * data2.shape[0]) indices_to_be_removed = np.random.choice(data2.index, indices, replace=False) data3 = data2.drop(indices_to_be_removed) # Cluster using the KMeans algorithm kmeans = KMeans(n_clusters=3, init='k-means++', random_state=0).fit(data3) # Visualize results with a scatter graph sns.scatterplot(data=data3, x='phi', y='psi', hue=kmeans.labels_) plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], marker="X", c="r", s=80, label="centroids") plt.legend(loc='lower left') plt.show()

We can clearly see that the shape of the clusters does not change when reducing the proportion of samples, so we can deduce that our clustering with K=3 is stable and relevant for this data set.

According to the additional information provided in the links, there are only 3 combinations of phi and psi angles that are actually possible. Hence, the two clusters identified on the heat map are most likely relevant (we can also slightly distinguish the third one), as well as the three biggest clusters found on the scatter plot (but not the fourth one).

Modification of the data set

# Shift the data accordingly and then plot it: def shift_psi(x): return((x+100)% 360) def shift_phi(x): return((x+220)%360) # (360-140) -> x= 140 should be x= 360 mod_data = data2.copy() mod_data['psi'] = mod_data['psi'].apply(lambda x: shift_psi(x)) mod_data['phi'] = mod_data['phi'].apply(lambda x: shift_phi(x)) fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10,5)) ax.scatter(mod_data['phi'],mod_data['psi'], s=1) ax.set_title('Phi and Phi angles combinations after modification of data set') ax.set_xlabel('Phi (in °)') ax.set_ylabel('Psi (in °)')

The sample distribution looks more consistent now, and we can clearly identify three major clusters and maybe a smaller one on the top right.

We can now repeat the whole clustering procedure in order to see if there is any major change:

# Check if new K value is recommended model2 = KMeans() visualizer = KElbowVisualizer(model2, k=(1,10)).fit(mod_data) visualizer.show()

# Clustering algorithm kmeans = KMeans(n_clusters=3, init='k-means++', random_state=0).fit(mod_data) # Visualisation on a scatter graph sns.scatterplot(data=mod_data, x='phi', y='psi', hue=kmeans.labels_) plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], marker="X", c="r", s=80, label="centroids") plt.legend(loc='upper left') plt.show()

The clustering procedure shows similar results with three clusters, but we can argue that the samples are distributed in the three clusters in a more consistent way thanks to the modifications that we made on the data set.

Exercise 3

from sklearn.cluster import DBSCAN from sklearn.neighbors import NearestNeighbors import math

mod_data = data.copy() mod_data['psi'] = mod_data['psi'].apply(lambda x: shift_psi(x)) mod_data['phi'] = mod_data['phi'].apply(lambda x: shift_phi(x)) X = mod_data[['phi','psi']] ang = X.to_numpy()

Definition of parameters for a neighborhood

nbrs = NearestNeighbors(n_neighbors=80).fit(ang) distances, indices = nbrs.kneighbors(ang) distances.mean()

# The following function gives plots of DBscan with different hyperparameters for a defined dataset and number of desired clusters # input: # data = data to be clustered # pts = list of possible minPts value # eps = list of possible eps value # ncluster = desired number of cluster def trialerror(data, pts, eps, ncluster): res = [] n_eps = [] n_pts = [] for val in pts: for e in eps: db = DBSCAN(eps=e, min_samples=val).fit(data) core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True labels = db.labels_ # Number of clusters in labels, ignoring noise if present. n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) n_noise_ = list(labels).count(-1) # from the Ramachandran plot we know how many clusters we want to have if n_clusters_==ncluster: n_eps.append(e) n_pts.append(val) res.append(f'minpts = {val}, eps = {e}, {n_noise_} noise points') rows = math.ceil(len(res)/3) fig, axes = plt.subplots(nrows=rows, ncols=3, figsize=(16,10)) fig.tight_layout(pad=3.0) j = 0 for ep, pts, ax in zip(n_eps, n_pts, axes.flatten()): db = DBSCAN(eps=ep, min_samples=pts).fit(data) core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True labels = db.labels_ unique_labels = set(labels) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))] for k, col in zip(unique_labels, colors): if k == -1: # Black used for noise. col = [0, 0, 0, 1] class_member_mask = labels == k xy = data[class_member_mask & core_samples_mask] ax.plot( xy[:, 0], xy[:, 1], "o", markerfacecolor=tuple(col), markeredgecolor=tuple(col), markersize=5, ) xy = data[class_member_mask & ~core_samples_mask] ax.plot( xy[:, 0], xy[:, 1], "o", markerfacecolor=tuple(col), markeredgecolor=tuple(col), markersize=1, ) ax.set_xlabel('Phi') ax.set_ylabel('Psi') ax.title.set_text(res[j]) j +=1 plt.show()

minpts = [80, 100] eps = [4,5,6,7,8,9,10,11,12,13,14,15] trialerror(ang, minpts, eps, 3)

Clustering with DBSCAN

We can proceed to the clustering with the DBSCAN method using the optimal parameters that we found: eps=10 and min_samples=100

db = DBSCAN(eps=10, min_samples=100).fit(ang) core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True labels = db.labels_ # Number of clusters in labels, ignoring noise if present. n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) n_noise_ = list(labels).count(-1) unique_labels = set(labels) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))] for k, col in zip(unique_labels, colors): if k == -1: # Black used for noise. col = [0, 0, 0, 1] class_member_mask = labels == k xy = ang[class_member_mask & core_samples_mask] plt.plot( xy[:, 0], xy[:, 1], "o", markerfacecolor=tuple(col), markeredgecolor=tuple(col), markersize=5, ) xy = ang[class_member_mask & ~core_samples_mask] plt.plot( xy[:, 0], xy[:, 1], "o", markerfacecolor=tuple(col), markeredgecolor=tuple(col), markersize=1, ) plt.title("Estimated number of clusters: %d" % n_clusters_) print("Estimated number of noise: %d" % n_noise_) plt.xlabel('Phi') plt.ylabel('Psi') plt.show()

Bar chart for outliers' origins

Let's plot a bar chart in order to find what kind of amino acid residue types are producing the most outliers.

# add labels from clustering to dataframe mod_data['label'] = labels # filter dataframe on outliers with label -1 outliers = mod_data[mod_data['label']== -1] # count outliers per residue name count = outliers.groupby(by='residue name').count().reset_index() # Plot bar chart fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10,5)) ax.bar(count['residue name'], count['label']) ax.set_xlabel('Residue Type') ax.set_ylabel('Frequency') ax.title.set_text('Residue types as outliers') plt.show()

We can clearly deduce that most of the outliers are of the GLY type.

Comparison of DBSCAN results with K-means results

Robustness of DBSCAN to eps and min_samples variations

minpts = range(20,120,20) eps = range(1,11,1) res = [] cluster = [] outlier = [] pts_ = [] eps_ = [] df = pd.DataFrame() for val in minpts: for e in eps: db = DBSCAN(eps=e, min_samples=val).fit(ang) labels = db.labels_ # Number of clusters in labels, ignoring noise if present. n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) n_noise_ = list(labels).count(-1) pts_.append(val) eps_.append(e) cluster.append(n_clusters_) outlier.append(n_noise_) res.append(f'minpts = {val}, eps = {e}, {n_clusters_} Clusters, {n_noise_} noise points') df['minPts'] = pts_ df['eps'] = eps_ df['cluster'] = cluster df['outlier'] = outlier df_20 = df[df['minPts']==20] df_40 = df[df['minPts']==40] df_60 = df[df['minPts']==60] df_80 = df[df['minPts']==80] df_100 = df[df['minPts']==100]

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(18,9)) ax1.hist(df_20['cluster'], label='20') ax1.hist(df_40['cluster'], label='40') ax1.hist(df_60['cluster'], label='60') ax1.hist(df_80['cluster'], label='80') ax1.hist(df_100['cluster'], label='100') ax1.legend(title="minPts", bbox_to_anchor=(1.02, 1), loc=2) ax1.set_xlabel('Number of clusters') ax1.set_ylabel('Frequency') ax1.set_title('Frequency of number of resulting clusters depending \non minPts and eps', fontsize= 15) x = df['eps'] y = df['minPts'] s = df['cluster'] scatter = ax2.scatter(x, y, c=s, cmap='prism') ax2.legend(*scatter.legend_elements(), title="Cluster", bbox_to_anchor=(1.02, 1), loc=2) ax2.set_xlabel('eps') ax2.set_ylabel('minPts') ax2.set_title('Number of clusters depending \non parameter minPts and eps', fontsize= 15) fig.tight_layout(pad=3.0) plt.show()

Exercise 4

Clustering samples with residue type PRO

First, we plot the data in order to see where the PRO residue samples are mainly located.

# Select PRO samples only pro = mod_data[mod_data['residue name']=='PRO'][['phi','psi']] ang_pro = pro.to_numpy() # Plot scatter graph plt.scatter(mod_data['phi'], mod_data['psi'], s=3, label='All residue types') plt.scatter(pro['phi'], pro['psi'],color='r',s=3, label='Residue type PRO') plt.legend(bbox_to_anchor=(1.05, 1), loc=2) plt.title('Plot for residue type PRO') plt.xlabel('Phi') plt.ylabel('Psi')

minpts = [15, 25, 50, 75] eps = [4,5,6,7,8,9,10] trialerror(ang_pro, minpts, eps, 2)

We can then proceed to the DBSCAN clustering with the optimal parameters that we just found.

# Compute DBSCAN for residue type PRO with eps = 8 and minPts = 25 db = DBSCAN(eps=8, min_samples=25).fit(ang_pro) core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True labels = db.labels_ # Number of clusters in labels, ignoring noise if present. n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) n_noise_ = list(labels).count(-1) print('Estimated number of clusters: %d' % n_clusters_) print('Estimated number of noise points: %d' % n_noise_) unique_labels = set(labels) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))] for k, col in zip(unique_labels, colors): if k == -1: # Black used for noise. col = [0, 0, 0, 1] class_member_mask = labels == k xy = ang_pro[class_member_mask & core_samples_mask] plt.plot(xy[:, 0], xy[:, 1], "o", markerfacecolor=tuple(col), markeredgecolor=tuple(col), markersize=5,) xy = ang_pro[class_member_mask & ~core_samples_mask] plt.plot(xy[:, 0], xy[:, 1], "o", markerfacecolor=tuple(col), markeredgecolor=tuple(col), markersize=5,) plt.title("Plot of clustered data for residue type PRO\nEstimated number of clusters: %d" % n_clusters_) plt.xlabel('Phi') plt.ylabel('Psi') plt.show()

Clustering samples with residue type GLY

We first plot the data in order to have an overview.

# Select only GLY residue samples gly = mod_data[mod_data['residue name']=='GLY'][['phi', 'psi']] ang_gly = gly.to_numpy() # Plot scatter graph plt.scatter(mod_data['phi'], mod_data['psi'], s=3, label='All residue types') plt.scatter(gly['phi'], gly['psi'], color='r',s=3, label='Residue tyoe GLY') plt.legend(bbox_to_anchor=(1.05, 1), loc=2) plt.title('Plot for residue type GLY compared to complete dataset') plt.xlabel('Phi') plt.ylabel('Psi')

nbrs = NearestNeighbors(n_neighbors=15).fit(ang_gly) distances, indices = nbrs.kneighbors(ang_gly) distances.mean()

pts_gly = [15, 25, 50, 75] eps_gly = [6,12,18,24,30] res = [] n_eps = [] n_pts = [] trialerror(ang_gly,pts_gly,eps_gly,4)

Once we have find the relevant values for eps and min_samples, we can proceed with the DBSCAN clustering.

db = DBSCAN(eps=30, min_samples=15).fit(ang_gly) core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True labels = db.labels_ # Number of clusters in labels, ignoring noise if present. n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) n_noise_ = list(labels).count(-1) print('Estimated number of clusters: %d' % n_clusters_) print('Estimated number of noise points: %d' % n_noise_) unique_labels = set(labels) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))] for k, col in zip(unique_labels, colors): if k == -1: # Black used for noise. col = [0, 0, 0, 1] class_member_mask = labels == k xy = ang_gly[class_member_mask & core_samples_mask] plt.plot( xy[:, 0], xy[:, 1], "o", markerfacecolor=tuple(col), markeredgecolor=tuple(col), markersize=5, ) xy = ang_gly[class_member_mask & ~core_samples_mask] plt.plot( xy[:, 0],xy[:, 1],"o",markerfacecolor=tuple(col),markeredgecolor=tuple(col), markersize=5, ) plt.title("Plot of clustered data for residue type GLY\nEstimated number of clusters: %d" % n_clusters_) plt.xlabel('Phi') plt.ylabel('Psi') plt.show()

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Assignment 3: Clustering