Flowers Classificator Using K-Means
1. Exploring the dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns
iris = sns.load_dataset('iris')
iris['species'].value_counts()
sns.pairplot(iris, hue = 'species')
sns.heatmap(iris.corr(), annot=True)
2. Constructing our ML Model
As we could see in the previous section, the most significant column is the petal_lenght.
Also, the petal length and petal width are very correlated (0.96), so we could easily choose to work with only one of them.
A candidate for improving our predictions would be the sepal_length, as we can see in the pair plot of petal_length x sepal_length.
x = iris[['petal_length', 'sepal_length']].values
2.1. Using the Elbow Method to find the best K
ks = np.arange(2,21)
inertias = []
for k in ks:
model = KMeans(n_clusters=k, max_iter=1000)
model.fit(x)
inertias.append(model.inertia_)
# print(ks)
# print(inertias)
plt.title('K vs Intertia')
plt.xlabel('Inertia')
plt.ylabel('K')
plt.plot(ks, inertias, marker='o')
2.2. Using the best k to create the model
# Construct our model
model = KMeans(n_clusters=3, max_iter=1000)
model.fit(x)
# Plotting the result
y_predicted = model.predict(x)
sns.scatterplot(iris['petal_length'], iris['sepal_length'], hue=y_predicted)
3. Accuray
y_predicted
# codes = {'virginica':2, 'versicolor':1, 'setosa':0}
# y_expected = np.array([codes[s] for s in iris['species']])
# y_expected
from sklearn import metrics
# adjusted_rand_score measure the similarity between two clusers
# for example [1,1,2,2,3,3] and [2,2,3,3,0,0] are the same since both
# arrays has same clusters
accuracy = metrics.adjusted_rand_score(y_predicted, iris['species'])
print(accuracy)