Setting up MySQL in your local machine
You can find detailed instructions on how to setup MySQL workbench on your local machine in the following link - https://ladvien.com/data-analytics-mysql-localhost-setup/
Getting the Goodreads books' data
As of December 8th, 2020, Goodreads no longer issues new developer keys for their public developer API and plans to retire the current version of these tools. Hence it wasn't possible for me to directly use their API to scrape the data. Instead, I got data that was already scraped from Kaggle in the following link - https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks
Since this is just a simple exploratory analysis, we can use the available data to see what we can find. But please note that the actual real-world data might be different from what we see here.
Creating a database and a table to load the Books' data
Set OPT_LOCAL_INFILE=1 under Others in the 'Advanced' tab under the 'localhost' connection to enable loading csv files from anywhere in the system.
Performing exploratory analysis using the loaded data
Basic checks
High Level Info on the data at hand
Since there are 5 different variants of English, we can create a new Language column where we combine them all into a single value. This would help us to get the overall trends for the books by language.
Since the average # of ratings and average # of user text reviews are way higher than the median count, we can clearly say that the data for # ratings and # of user text reviews is positively skewed. Hence in order to objectively look at any insights, it would make sense to go with the median instead of the mean.
Top 15 publishers based on the number of books published
Top 15 books with the highest number of pages
Top 15 Books with the highest number of ratings
Top 15 Books with the highest number of text reviews
Top 15 Authors based on the number of books
Top 15 Books with the highest rating (Only considering those books wth atleast median number of ratings and median number of text reviews)
Top 15 Authors with the highest average rating (Only considering those books wth atleast median number of ratings and median number of text reviews)
Top 15 highest number of collaborations between Author and Publishing house
Deeper exploration
Authors who have published books in more than 1 language
Authors with more than 1 book published
Among authors with more than 1 book, top 15 longest time taken to publish between 2 consecutive books
This is the result we are getting with the dataset we have. A complete dataset with accurate publication dates would give us a much clearer picture. We should ideally remove any anniversary editions and re-releases printed after the author's death to avoid huge difference between years. But that is a project for another day!