Goodreads:Analysis and Clustering Using K-Means

Background

The idea behind this project is about Goodreads datasets to know the idea gain a deeper understanding of various relationships between different columns in the data. By exploring aggregate ratings, page count, author popularity, and language distribution, you can gain insights into trends and patterns in the book industry and readership preferences. This type of analysis can help you make informed decisions about which books to read, or to recommend to others, and can also provide valuable information for publishers, authors, and booksellers.

In this case, K-Means is being used as a machine learning technique to identify patterns and relationships within the data (i.e. the book titles) and group similar items together into clusters. This allows us to visualize and analyze the data in a meaningful way, uncovering insights that might not be immediately apparent from raw data.

Load the Libraries

import numpy as np import pandas as pd import os import seaborn as sns import matplotlib.pyplot as plt import plotly.express as px plt.style.use('ggplot') import re from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans from wordcloud import WordCloud from IPython.display import Image import warnings warnings.filterwarnings("ignore")

Basic Things

# Skip Error bad lines and also removing a whitespace df = pd.read_csv("books.csv", error_bad_lines= False, skipinitialspace= True)

# Remove 2 Columns that We dont need df.drop(["publication_date","publisher"], axis = 1, inplace=True )

print("This datasets contains {} rows and columns {} ".format(df.shape[0], df.shape[1]))

df.info()

# Searching for Missing Value df.isnull().sum()

Let's do some Cleaning first, it seems to be that some Author has Visual Artist so we need to remove Visual Artist so its only Author themself.

# JK Rowling df.replace(to_replace='J.K. Rowling/Mary GrandPré', value = 'J.K. Rowling', inplace=True) # Stephen King df.replace(to_replace= "Stephen King/Bernie Wrightson",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/Ron McLarty",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/Ned Dameron",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/Jerry N. Uelsmann",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/Joachim Körber",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/ZBS Foundation",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/Peter Straub",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/John Glover",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/Campbell Scott",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/Bettina Blanch Tyroller",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/Richard Bachman",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/David Purdham",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/Eduardo Goligorsky",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/María Mir",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/Marie Milpois",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/Robin Waterfield",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/Verónica Canales",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/Josh Hamilton",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/Marta I. Guastavino",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/María Eugenia Ciocchini Suárez",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/Gregorio Vlastelica",value = "Stephen King", inplace = True) df.replace(to_replace= "Stephen King/David Palladini",value = "Stephen King", inplace = True) # Douglas Adams df.replace(to_replace= "Douglas Adams/Stephen Fry",value = "Douglas Adams", inplace = True) df.replace(to_replace= "Douglas Adams/John Lloyd",value = "Douglas Adams", inplace = True) df.replace(to_replace= "Douglas Adams/Christopher Cerf",value = "Stephen King", inplace = True) df.replace(to_replace= "Douglas Adams/Martin Freeman",value = "Stephen King", inplace = True)

bookID Contains the unique ID for each book/series

title contains the titles of the books

authors contains the author of the particular book

average_rating the average rating of the books, as decided by the users

ISBN ISBN(10) number, tells the information about a book - such as edition and publisher

ISBN 13 The new format for ISBN, implemented in 2007. 13 digits

language_code Tells the language for the books

Num_pages Contains the number of pages for the book

Ratings_count Contains the number of ratings given for the book

text_reviews_count Has the count of reviews left by users

Exploratory Data Analysis

Pages vs Avg Ratings Based on Ratings Count

Let's analyze "Does many pages means have a good ratings based on ratings count ?" we used a correlation from Seaborn package, to see relationship between these variables.

# We selected most reviews and average rating df_clean = df.nlargest(1000,["ratings_count", "average_rating"]) # The reason why we used nlargest to eliminate some few reviews but also has good reviews it makes more bias. df_clean

correlation = df_clean.corr() correlation

fig = px.scatter(df_clean, x = "num_pages", y = "average_rating",color = "ratings_count", title = "Correlation Between Number of Pages and Average Ratings Based on Ratings Count", labels= dict(num_pages = "Number of Pages", average_rating = "Average Ratings", ratings_count = "Ratings Count")) fig.update_layout(font_family = "Courier New", title_font_family = "Times New Roman") fig.show()

# Showing Heatmap between a variables plt.figure(figsize=(10,8)) sns.heatmap(correlation, annot = True, cmap= "crest", linewidths=0.4) plt.title("Correlation Between Others Variables")

In this graph as we can see that with a lot number of pages doesn't mean has a good reviews, there's a lot reviews that above 4.0 under 500 pages also has bigger ratings count. Let's check out in correlation to deep down each variables to others.

As we can see that relationship between Number of pages and Average ratings has low correlation between them, that's make sense because if we take some look at the data, within < 500 pages has good average ratings. Example: Harry Potter and the Chamber of Secrets (Harry Potter #2) has 341 pages and got reviews at 4.42 which those variables has no strong relationship and the correlation just hit 0.2, Same as Ratings Count has very low correlation in 0.0079 on num_pages and 0.06 on average_rating. Interesting thing we found that text_reviews_count has strong relationship with ratings_count with a score 0.84 that almost positive relationship it because people who given text_reviews means that they also given score ratings, wherever they given some score means they given reviews by a text so that why it has positive correlation.

The conclusion : So after we deep done in analysis we take understand that Huge pages doesnt mean always have a good review, there's pages under 500 that have good reviews and also there's few books that have low rating based on pages indicate both are weak variables. Rating count has low correlation in both variables(number of pages, average ratings) cause rating count is about where amount people given a ratings has nothing to do with number of pages and average rating cause people rated a books based on storyline, author or cover. lastly huge positive from Rating count and Text reviews count has 0.84(1 if we round up) tells that if people given rating count came up also given some text reviews that why has strong positive variables.

Distribution Of Books for All Language

sns.set_context("paper") plt.figure(figsize=(15,10)) ax = df.groupby("language_code")["title"].count().plot.bar() plt.title("Language Code") plt.xticks(fontsize = 15) for p in ax.patches: ax.annotate(str(p.get_height()), (p.get_x()-0.1, p.get_height()+100))

From this graph, as we can see that majority of the books are in English languages, with some sub categorised into English-US, English-UK and English-CA, This reflects the widespread use of the English language globally and its dominance as a medium of communication

Most Occurances Books All The Time

sns.set_context('poster') plt.figure(figsize=(15,10)) books = df['title'].value_counts()[:15] rating = df.average_rating[:15] sns.barplot(x = books, y = books.index, palette='deep') plt.title("Most Occurring Books") plt.xlabel("Number of occurances") plt.ylabel("Books") plt.show()

As you can see that The Lliad and The Brothers Karamazov have a most number of occurances with same name in the data.

From the list, we can see that most of the books from the given chart are either old, classics or books which are usually assigned to schools. Seems like some books do age well, also have braved the flow of time.

Additionally, many of these books are assigned as part of curricula in schools, which contributes to their continued popularity.

Which Are The Top 10 Most Rated Books

# Group the data by books and title and also calculate the average ratings first grouped_most_rated_books = df.groupby("title")["ratings_count"].mean().reset_index() # Sort the data in descending order based on ratings grouped_most_rated_books = grouped_most_rated_books.sort_values("ratings_count", ascending= False) grouped_most_rated_books # Create the visualization sns.set_theme(style="darkgrid") ax = sns.barplot(x = "ratings_count", y = "title", data = grouped_most_rated_books.head(10), palette="BuGn_r") ax.set_xticklabels(ax.get_xticklabels(), rotation = 90) plt.xlabel("Ratings") plt.ylabel("Title") ax.text(x=0.5, y=1.1, s='Top 10 Most Rated Books', fontsize=16, weight='bold', ha='center', va='bottom', transform=ax.transAxes) ax.text(x=0.5, y=1.05, s='In Millions', fontsize=8, alpha=0.75, ha='center', va='bottom', transform=ax.transAxes) plt.show()

By taking the mean of the ratings for each book title, we are able to simplify the data and get a general idea of the average rating for each book. The high average rating for Twilight, The Hobbit, and Harry Potter makes sense, as they are popular books and have a large number of ratings. Additionally, as you mentioned, The Catcher in the Rye is a non-fiction novel set in the WWII era and gives us a glimpse into the society of that time, which might have contributed to its popularity and high rating.

Which Authors That Has Most Books ?

# grouped_authors = df.groupby("authors")["title"].count().reset_index() grouped_authors = grouped_authors.sort_values("title", ascending = False) # Create the visualization sns.set_theme(style = "darkgrid") ax = sns.barplot(x = "title", y = "authors", data = grouped_authors.head(10), palette= "rocket") ax.set_xticklabels(ax.get_xticklabels(), rotation = 90) plt.xlabel("Total Number of Books") plt.ylabel("Authors") ax.text(x=0.5, y=1.1, s='Top 10 Authors With Most Books', fontsize=16, weight='bold', ha='center', va='bottom', transform=ax.transAxes) for i in ax.patches: ax.text(i.get_width()+.3, i.get_y()+0.5, str(round(i.get_width())), fontsize = 10, color = "k")

This plot gives us a visual representation of the distribution of book titles in the datasets and it appears that Stephen King has the most number of books in the list. However, it's important to note that some of the books might be various publications of the same title. t

The recognition and status of being classic author can also contribute to having more books in the list. The hype around a book or author does play a role in its popularity and representation in this datasets.

What is The Rating Distribution for The Books

plt.figure(figsize=(10,10)) rating= df.average_rating.astype(float) sns.distplot(rating, bins=20) plt.xlabel("Average Ratings")

It appears that books with a score of 5 are relatively rare and the majority of ratings are distributed near the 3.7 to 4.3 range. This indicates that the majority of the books have received average to slightly above-average ratings, with only a few books receiving the highest rating of 5.

Clustering Books Titles

The Meaning of Clustering

Clustering is the process of grouping similar data points together in a dataset. it's an unsupervised learning method where the aim is to partition the data into different groups (also knows as clusters) based on their similarity.

The algorithm used to perform clustering may vary, but popular algorithms include K-Means, Hierarchical Clustering, and Density-Based Clustering.

Today we used K-Means for clustering, because it's simple and also the most popular ones. K-means is an iterative algorithm grouping similar data points together in a dataset. It's a type of machine learning that partitions data into "K" clusters based on their similarity.

Right now we have some problem here, we are clustering Text data which more bit tricky than the usual numeric data. So we need to convert these text into numeric data that can be used and understood by machine or we called Text Vectorization

We use TF IDF which is a poplar frequency based vectorization method is simple but for sure better than the simple counts vectorizer.

Manufacturing Steps

Transform Title(Text) into Vector

Before we getting started let's import TF IDF vectorizer from scikit-learn package, after that we initiate the vectorizer object using stopword being the English language(Words like You, Me, At, The, With, etc) those are the words that are generally not interesting for No Op tasks (No Operation).

After we set our TF IDF let's move on into Ngram_range, well Ngram_range used in feature extraction techniques, such as the TfidfVectorizer. it determines the size of the n-grams that the feature extraction process will create.

And then we can use the vectorizer to fit and transform the title into a vector. X here is just basically a huge array that represents each book title.

vectorizer = TfidfVectorizer(stop_words= "english", ngram_range=(1,2)) X = vectorizer.fit_transform(df_clean["title"]) pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names())

Implement K-Means

Before we start using K-Means we need to import scikit-learn first to use K-Means, we have some small challenge with K-Means is that we have to specify the number of clusters that we want to create. We don't know in advance what is the optimal number of clusters so we have to find it out ourselves.

Strategy is that we'll assume a minimum number of clusters let's say 2 clusters and the maximum number of clusters let's say 10 and for each value in those potential number of clusters will perform K-Means cross string and then we'll calculate the sum of squared distance or inertial as how it's called in the cycle learn Library. It is the sum of squared distance of the data points to their closest cluster center, Don't forget to append those sum of squared distance into an array.

sum_of_squared_distance = [] # Kmeans Clustering K = range(2,10) for k in K : km = KMeans(n_clusters=k, max_iter=600, n_init=10) km.fit(X) sum_of_squared_distance.append(km.inertia_)

Optimal Number Of Clusters

We plot out the sum of squared distance, you can see that the more clusters we have, the lower the sum of squared distance is. it makes sense because in extreme case if we make one cluster for each individual data point then the sum squared distance would be technically zero. With the elbow method we want to find the optimal number of clusters that have the lowest sum of squared distance, We can see that around cluster 6 the sum square distance starts declining more slowly, so we can expect 6 is the optimal number of clusters however K-Means is not deterministic.

plt.plot(K, sum_of_squared_distance, "bx-") plt.xlabel("Number of Clusters") plt.ylabel("Sum of Squared Distances") plt.title("Elbow Method for Optimal Kmeans") plt.show()

Now we just choose 6 as the optimal number of clusters and the I"II pass that into the K-Means model to get the prediction labels it's quite simple we just get the labels out of the model and then zip the labels with the title in the data and here you can see that which cluster the books belong to.

Now you might be thinking "Okay what are those clusters actually mean ?", that's a good question. "One way to do this is just print out the terms per clusters" however it's a little bit less interesting.

# Get Clusters true_k = 6 model = KMeans(n_clusters=true_k, init = "k-means++", max_iter= 600, n_init=10) model.fit(X) # Get Prediction labels = model.labels_ book_c1 = pd.DataFrame(list(zip(df_clean["title"],labels)), columns=["title", "cluster"]) print(book_c1.sort_values(by = ["cluster"]))

WordClounds Visualization

I prefer to use a more visual way to do this we can create a word cloud for each book title cluster it's very simple for each cluster, we obtain the text from all the book titles within that cluster and then we'll create a word cloud out of it, Here I just randomly put some arguments here you can customize it if you want and then we just plot those word clouds.

Don't forget import wordcloud from WordCloud package, To put all them together to easily compare them or use subplot to make 2 rows and 3 columns because we have in total six clusters.

# Create a Wordclounds for clustering for k in range(true_k): text = book_c1[book_c1.cluster == k]["title"].str.cat(sep = " ") wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text) # visualization plt.subplot(2,3, k+1).set_title("Cluster" + str(k)) plt.plot() plt.imshow(wordcloud, interpolation="bilinear") plt.axis("off") plt.show()

This conclusion can be made based on the keywords that are most frequently appearing in each cluster, which seems to be indicative of the main topics covered in the books belonging to each cluster.

Conclusion

After conducting an Exploratory Data Analysis (EDA) and applying the K-Means clustering method, we were able to gain valuable insights into the distribution of books based on their title. We were able to observe the trend of popular authors over time, the relationship between the number of pages and aggregate ratings, and the distribution of books based on their language. The K-Means method allowed us to cluster the books based on their title into different groups, allowing us to understand the similarities and differences between the books. By analyzing the clusters, we could observe that each cluster was comprised of different types of books, including drama, adventure, science fiction, fantasy fiction, and vampire fiction. This analysis allowed us to gain a deeper understanding of the data and draw meaningful conclusions about the distribution of books.