K-Means Clustering
We will work on Iris Dataset from sklearn package.
The Iris dataset consists of measurements of sepals and petals of 3 different plant species:
- Iris setosa
- Iris versicolor
- Iris virginica
Each characteristic we are interested in is a feature.
For example, petal length is a feature of this dataset.
The features of the dataset are:
- Column 0: Sepal length
- Column 1: Sepal width
- Column 2: Petal length
- Column 3: Petal width
Now will load the dataset
What is K-Means
The K-Means algorithm:
- Place k random centroids for the initial clusters.
- Assign data samples to the nearest centroid.
Update centroids based on the above-assigned data samples.
& Repeat Steps 2 and 3 until convergence.
As there are three clusters for 3 -species of teh flower, let’s implement K-Means where the k is 3.
Using the NumPy library, we will create three random initial centroids and plot them along with our samples.
According to the metadata:
All the 0‘s are Iris-setosa All the 1‘s are Iris-versicolor All the 2‘s are Iris-virginica Let’s change these values into the corresponding species using the following code:
How to know the number of clusters
from sklearn import datasets
from sklearn.cluster import KMeans
iris = datasets.load_iris()
samples = iris.data
num_clusters = list(range(1, 9)) inertias = []
for k in numclusters: model = KMeans(n_clusters=k) model.fit(samples) inertias.append(model.inertia)
plt.plot(num_clusters, inertias, '-o')
plt.xlabel('Number of Clusters (k)') plt.ylabel('Inertia')
plt.show()
The goal is to have low inertia and the least number of clusters.
One of the ways to interpret this graph is to use the elbow method: choose an “elbow” in the inertia plot - when inertia begins to decrease more slowly.
In the graph above, 3 is the optimal number of clusters.