Background
The idea behind this project is about Goodreads datasets to know the idea gain a deeper understanding of various relationships between different columns in the data. By exploring aggregate ratings, page count, author popularity, and language distribution, you can gain insights into trends and patterns in the book industry and readership preferences. This type of analysis can help you make informed decisions about which books to read, or to recommend to others, and can also provide valuable information for publishers, authors, and booksellers.
In this case, K-Means is being used as a machine learning technique to identify patterns and relationships within the data (i.e. the book titles) and group similar items together into clusters. This allows us to visualize and analyze the data in a meaningful way, uncovering insights that might not be immediately apparent from raw data.
Load the Libraries
Basic Things
Let's do some Cleaning first, it seems to be that some Author has Visual Artist so we need to remove Visual Artist so its only Author themself.
Exploratory Data Analysis
Pages vs Avg Ratings Based on Ratings Count
Let's analyze "Does many pages means have a good ratings based on ratings count ?" we used a correlation from Seaborn package, to see relationship between these variables.
In this graph as we can see that with a lot number of pages doesn't mean has a good reviews, there's a lot reviews that above 4.0 under 500 pages also has bigger ratings count. Let's check out in correlation to deep down each variables to others.
As we can see that relationship between Number of pages and Average ratings has low correlation between them, that's make sense because if we take some look at the data, within < 500 pages has good average ratings. Example: Harry Potter and the Chamber of Secrets (Harry Potter #2) has 341 pages and got reviews at 4.42 which those variables has no strong relationship and the correlation just hit 0.2, Same as Ratings Count has very low correlation in 0.0079 on num_pages and 0.06 on average_rating. Interesting thing we found that text_reviews_count has strong relationship with ratings_count with a score 0.84 that almost positive relationship it because people who given text_reviews means that they also given score ratings, wherever they given some score means they given reviews by a text so that why it has positive correlation.
The conclusion : So after we deep done in analysis we take understand that Huge pages doesnt mean always have a good review, there's pages under 500 that have good reviews and also there's few books that have low rating based on pages indicate both are weak variables. Rating count has low correlation in both variables(number of pages, average ratings) cause rating count is about where amount people given a ratings has nothing to do with number of pages and average rating cause people rated a books based on storyline, author or cover. lastly huge positive from Rating count and Text reviews count has 0.84(1 if we round up) tells that if people given rating count came up also given some text reviews that why has strong positive variables.
Distribution Of Books for All Language
From this graph, as we can see that majority of the books are in English languages, with some sub categorised into English-US, English-UK and English-CA, This reflects the widespread use of the English language globally and its dominance as a medium of communication
Most Occurances Books All The Time
As you can see that The Lliad and The Brothers Karamazov have a most number of occurances with same name in the data.
From the list, we can see that most of the books from the given chart are either old, classics or books which are usually assigned to schools. Seems like some books do age well, also have braved the flow of time.
Additionally, many of these books are assigned as part of curricula in schools, which contributes to their continued popularity.
Which Are The Top 10 Most Rated Books
By taking the mean of the ratings for each book title, we are able to simplify the data and get a general idea of the average rating for each book. The high average rating for Twilight, The Hobbit, and Harry Potter makes sense, as they are popular books and have a large number of ratings. Additionally, as you mentioned, The Catcher in the Rye is a non-fiction novel set in the WWII era and gives us a glimpse into the society of that time, which might have contributed to its popularity and high rating.
Which Authors That Has Most Books ?
This plot gives us a visual representation of the distribution of book titles in the datasets and it appears that Stephen King has the most number of books in the list. However, it's important to note that some of the books might be various publications of the same title. t
The recognition and status of being classic author can also contribute to having more books in the list. The hype around a book or author does play a role in its popularity and representation in this datasets.
What is The Rating Distribution for The Books
It appears that books with a score of 5 are relatively rare and the majority of ratings are distributed near the 3.7 to 4.3 range. This indicates that the majority of the books have received average to slightly above-average ratings, with only a few books receiving the highest rating of 5.
Clustering Books Titles
The Meaning of Clustering
Clustering is the process of grouping similar data points together in a dataset. it's an unsupervised learning method where the aim is to partition the data into different groups (also knows as clusters) based on their similarity.
The algorithm used to perform clustering may vary, but popular algorithms include K-Means, Hierarchical Clustering, and Density-Based Clustering.
Today we used K-Means for clustering, because it's simple and also the most popular ones. K-means is an iterative algorithm grouping similar data points together in a dataset. It's a type of machine learning that partitions data into "K" clusters based on their similarity.
Right now we have some problem here, we are clustering Text data which more bit tricky than the usual numeric data. So we need to convert these text into numeric data that can be used and understood by machine or we called Text Vectorization
We use TF IDF which is a poplar frequency based vectorization method is simple but for sure better than the simple counts vectorizer.
Manufacturing Steps
Transform Title(Text) into Vector
Before we getting started let's import TF IDF vectorizer from scikit-learn package, after that we initiate the vectorizer object using stopword being the English language(Words like You, Me, At, The, With, etc) those are the words that are generally not interesting for No Op tasks (No Operation).
After we set our TF IDF let's move on into Ngram_range, well Ngram_range used in feature extraction techniques, such as the TfidfVectorizer. it determines the size of the n-grams that the feature extraction process will create.
And then we can use the vectorizer to fit and transform the title into a vector. X here is just basically a huge array that represents each book title.
Implement K-Means
Before we start using K-Means we need to import scikit-learn first to use K-Means, we have some small challenge with K-Means is that we have to specify the number of clusters that we want to create. We don't know in advance what is the optimal number of clusters so we have to find it out ourselves.
Strategy is that we'll assume a minimum number of clusters let's say 2 clusters and the maximum number of clusters let's say 10 and for each value in those potential number of clusters will perform K-Means cross string and then we'll calculate the sum of squared distance or inertial as how it's called in the cycle learn Library. It is the sum of squared distance of the data points to their closest cluster center, Don't forget to append those sum of squared distance into an array.
Optimal Number Of Clusters
We plot out the sum of squared distance, you can see that the more clusters we have, the lower the sum of squared distance is. it makes sense because in extreme case if we make one cluster for each individual data point then the sum squared distance would be technically zero. With the elbow method we want to find the optimal number of clusters that have the lowest sum of squared distance, We can see that around cluster 6 the sum square distance starts declining more slowly, so we can expect 6 is the optimal number of clusters however K-Means is not deterministic.
Now we just choose 6 as the optimal number of clusters and the I"II pass that into the K-Means model to get the prediction labels it's quite simple we just get the labels out of the model and then zip the labels with the title in the data and here you can see that which cluster the books belong to.
Now you might be thinking "Okay what are those clusters actually mean ?", that's a good question. "One way to do this is just print out the terms per clusters" however it's a little bit less interesting.
WordClounds Visualization
I prefer to use a more visual way to do this we can create a word cloud for each book title cluster it's very simple for each cluster, we obtain the text from all the book titles within that cluster and then we'll create a word cloud out of it, Here I just randomly put some arguments here you can customize it if you want and then we just plot those word clouds.
Don't forget import wordcloud from WordCloud package, To put all them together to easily compare them or use subplot to make 2 rows and 3 columns because we have in total six clusters.
This conclusion can be made based on the keywords that are most frequently appearing in each cluster, which seems to be indicative of the main topics covered in the books belonging to each cluster.
Conclusion
After conducting an Exploratory Data Analysis (EDA) and applying the K-Means clustering method, we were able to gain valuable insights into the distribution of books based on their title. We were able to observe the trend of popular authors over time, the relationship between the number of pages and aggregate ratings, and the distribution of books based on their language. The K-Means method allowed us to cluster the books based on their title into different groups, allowing us to understand the similarities and differences between the books. By analyzing the clusters, we could observe that each cluster was comprised of different types of books, including drama, adventure, science fiction, fantasy fiction, and vampire fiction. This analysis allowed us to gain a deeper understanding of the data and draw meaningful conclusions about the distribution of books.