Lab8 Clustering

Direct methods: consists of optimizing a criterion, such as the within cluster sums of squares or the average silhouette. The corresponding methods are named elbow and silhouette methods, respectively. Statistical testing methods: consists of comparing evidence against null hypothesis.

Common methods

The elbow method: The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.

The optimization of the silhouette coefficient: The component uses the Parameter Optimization Loop which retrains k-Means with a different k at each iteration.

The gap statistic: The idea of the Gap statistic is to compare the within-cluster dispersion to its expectation under an appropriate null reference distribution (Tibshirani et al., 2001).

The means are the centroids and outliers tend to bias the mean. To make it better, we could remove outliers before applying K-means clustering.

The algorithm can be stopped when the cluster centers no longer change significantly between iterations. This can be measured using a distance metric, such as the Euclidean distance, between the old and new cluster centers.

K-means clustering is suitable for situations where clustering is Centroid base cluttering. But is suitable in the case that cluttering is not a Centroid base cluttering such as hierarchical clustering or DBSCAN Here is some examples: Suppose we have a dataset of crime incidents in a city, and we want to identify areas where crimes are more likely to occur. The dataset contains information about the location of each incident, the type of crime, and the time and date of the incident. In this situation, DBSCAN would be better suited than k-means for reasons that DBSCAN is a density-based clustering algorithm, which means it can identify clusters based on regions of higher density in the data. In the case of crime incidents, we may be more interested in identifying areas with a higher concentration of crime, rather than areas that simply have similar characteristics. And DBSCAN does not require the user to specify the number of clusters in advance, unlike k-means, where the user must choose the number of clusters before running the algorithm. In the case of crime incidents, it may be difficult to know in advance how many clusters there are or what size they should be.

Hard clustering: One data point can belong to one cluster only ex. K-means Soft clustering: One data point can belong to multiple clusters We should use soft clustering when we want to find how similar an item is to a number of given groups.

import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import numpy as np from scipy.spatial import distance_matrix

datas = np.array([ [ 5, 7], [15, 12], [16, 18], [ 6, 6], [16, 11], [15, 11], [ 6, 4], [13, 13], [16, 17], [ 5, 5], [15, 17], [16, 17], [ 6, 8], [14, 12], [16, 15] ]) centroids = np.array([[ 5, 12], [15, 4], [10, 17]])

# Recalculate centroids position def reposition_centroid(x, centroids): dist_matrix = distance_matrix(datas, centroids, p=2) group = [[], [], []] for i in range(len(dist_matrix)): closest_centroids = np.argmin(dist_matrix[i]) group[closest_centroids].append(datas[i]) return np.array([np.average(c, axis=0) for c in group])

n_iteration = 2 iterations = np.empty(n_iteration + 1, dtype=object) for i in range(len(iterations)): iterations[i] = pd.concat([ pd.DataFrame(dict(x=datas[:, 0], y=datas[:, 1], label="data")), pd.DataFrame(dict(x=centroids[:, 0], y=centroids[:, 1], label="centroid")) ]) centroids = reposition_centroid(datas, centroids)

for i in range(len(iterations)): sns.scatterplot(data=iterations[i], x="x", y="y", hue="label")\ .set(title=f"Iteration: {i}" if i != 0 else "Initial values") plt.xlim(0, 20) plt.ylim(0, 20) plt.show()