Portfolio project 1: Amazon data science books
This is my first project! And in this project I will analyze the Amzone data science book data set that's available on Kaggle:
Data preparation
Importing the libraries
Importing the data file & getting a first simple overview
Repairing the column n_reviews
the column n_reviews contains the number of reviews a book has received and it contains the "," symbol, making it unsuitable to work with. I have therefore decided to remove this "," symbol from the column. Furthermore, there are a lot of nan values.
Do more expensive books have better reviews?
First attempt at creating a scatter plot of Book price vs Reviews
More data prep
We see that there must be a huge outlier in de price data because x axis goes all the way up to a book price of 1400. However it doesn't seem to have any reviews making it not show up on the map.
Let's first identify highest bookprice, the title and the number of reviews
This book has indeed an extraordinary price, but it is not plotable because we have no data about the number of reviews, nor do we have data about the average review score.
If this is a single outlier we could ignore this value. To find out if this is the case I sorted the dataframe as coded in code block [3] based on descending price. It turns out that there are two books with indices 734 and 638 that have prices well above the rest of the books. Furthermore, there is not data about reviews or number of reviews. I have therefore decided to ignore these entries
Plotting the scatter plot is still not working because of other nan values for n_reviews or avg_reviews. To get an impression of the size of the problem I have determined the number of books that have no data on number of reviews or the average review score:
Although it is a lot to ignore all these data entries, I think an average review score without the number of reviews is an unreliable metric. Furthermore, if we just check the number of n_reviews that are missing we get the following:
This indicates probably that if the average review score is missing, the number of reviews is also missing. This could indicate that those books have no reviews, in stead of that there's something wrong with the database. In any case, we cannot know if these books are any good and I have therefore decided to drop those books as well:
Second attempt at scatter plot of Book price vs Reviews
Now we can finally print the scatter plot including the size as the number of reviews:
It looks like there's no correlation between price and the average number of reviews. Also the R-squared value confirms this:
Therefore we can conclude that, based on this data set, there is no correlation between the book price and the number of reviews.
What are the best Python books, what are the best ML books?
Let's find the best Machine learning books:
Analogously, we can find the best Python books:
Cluster analysis of book names/TF-IDF K-mean
Tot start the cluster analysis with TF-IDF we first load the requirede libraries.
Next, we remove NaN values.
Since al lot of words terms that we're interested in comprise two words (e.g: Data Science, Data Engineering, Deep Learning) I want to see the effect of considering terms as 1 and 2 words vs. only 1 word.
Although the Silhouette Score for combined Unigram & Bigram is much lower than for Unigram, I have opted for the first option. Otherwise a specific term consisting out of two words might end up in two clusters.
Next, I considered different values for max_df. It is used to filter out terms that occur too frequently across the documents. For example, if max_df is set to 0.8, it means that any term occurring in more than 80% of the documents will be removed.
Just as with Unigram vs Unigram & Bigram graphs also the graphs above, barely show an elbow. However, it seems that a max_df of 0.1 has the highest silhouette score in general. Therefore the max_df is set at 0.1. Furthermore, I have selected to go for 5 clusters, as a minor elbow could be seen in the Elbow method graph.
Unfortunately the Silhouette score is still very close to zero, indicating that they are overlapping and not well separated. However increasing the number of clusters does not improve the Silhouette score markedly.
To get a better understanding of the clusters, I will make a print out the a section of eacht cluster. Furthermore I will make a wordcloud for each cluster. To start, we first need to perform the clustering of course:
I have printed section of each cluster.
What can be noticed is that cluster 0 clusters for Python and Learning. Cluster 1 contains books about statistics primarily. The enitre Dummies series is represented in Cluster 2, etc. However, it becomes clear that all clusters contain the word python. Therefore I decided to exclude the word Python to see what the results bring.
Again 5 clusters with max_df at 0.1 seems to be the most reasonable selection of parameters. We can also see that the Silhouette score slightly improves. Indicating that the clusters overlap each other less and that the terms mach better within a cluster.
The clusters in this case look like this.
From a visual inspection it does not seem that the clustering has improved. Let's see how the wordcloud looks.
First, we install the wordlcoud package. And then we make the wordcloud.
Apparently the terms Data Analysis and Data Science end up in two different clusters. On the other hand the terms AWS (Amazon Web Serviced = cloud computing provider) and Cloud do end up in the same cluster. I realize that the method used (TF-IDF) does not understand semantic relationships between terms.
Using global vectors
Therefore I suspect a pre trained method such as GloVe is more appropriate. GloVe is an unsupervised word embedding technique that stands for Global Vectors for Word Representation. It generates dense vector representations of words based on their co-occurrence in large text datasets. The algorithm constructs a global word-word co-occurrence matrix and learns embeddings by optimizing a specific objective function. This objective function ensures that the dot product of two word vectors approximates the logarithm of their co-occurrence probability. As a result, words with similar meanings have similar vector representations in the embedding space. Pre-trained GloVe models are available, with glove.6B being a popular choice, trained on Wikipedia 2014 and Gigaword 5.
Glove 6B could be downloaded HERE.
The GloVe dataset contains 4 files with different dimensionalities. The higher the dimensionality the greater the potential for capturing nuances, often resulting in higher accuracy. However, this comes at the expense of lower computational efficiency. Therefore, I have decided to compare the performance of the 4 different files. Since we have already used the method of inertia/elbow method and silhouette for comparing the number of clusters and other parameters for the TF-IDF model, I will do the same here.
This section imports the necessary libraries, including NumPy, pandas, and scikit-learn.
Next, we define the GloVe file names and two utility functions. The load_glove_embeddings function reads the specified GloVe file and creates a dictionary mapping words to their corresponding embeddings. The title_to_glove function converts a given title into a single GloVe vector by averaging the GloVe embeddings of the words present in the title.
We continue with the following steps:
Initialize parameters and prepare plotting
Initialize two empty lists, inertias and silhouette_scores, which will store the inertia and silhouette score values for each number of clusters.