Portfolio project 1: Amazon data science books

Author: Patrick Dankerlui

Date: 05/04/2022

This is my first project! And in this project I will analyze the Amzone data science book data set that's available on Kaggle:

https://www.kaggle.com/datasets/die9origephit/amazon-data-science-books?resource=download

Do more expensive books have better reviews?

Do longer books have higher prices?

What are the best Python books, what are the best ML books?

Cluster analysis of book names/TF-IDF K-means

Amazon review scraping & book summary review

Data preparation

Importing the libraries

Numpy

Pandas

Matplotlib

Plotly

Importing the data file & getting a first simple overview

using pandas

Repairing the column n_reviews

the column n_reviews contains the number of reviews a book has received and it contains the "," symbol, making it unsuitable to work with. I have therefore decided to remove this "," symbol from the column. Furthermore, there are a lot of nan values.

Do more expensive books have better reviews?

First attempt at creating a scatter plot of Book price vs Reviews

df = pd.read_csv('/datasets/deep-note-google-drive/Datascience/Portfolio-project-Amazon-data-science-books/Prepped_final_book_dataset_kaggle2.csv') #px.scatter(df, x='price', y='avg_reviews', size='n_reviews') px.scatter(df, x='price', y='avg_reviews')

More data prep

We see that there must be a huge outlier in de price data because x axis goes all the way up to a book price of 1400. However it doesn't seem to have any reviews making it not show up on the map.

Let's first identify highest bookprice, the title and the number of reviews

id = df['price'].idxmax() print('the maximum book price is', df.loc[id, 'price']) print('The title of this book is called', df.loc[id, 'title']) print('And it has received', df.loc[id, 'n_reviews'], 'number of reviews') print('With an average review score of', df.loc[id, 'avg_reviews'], 'number of reviews')

This book has indeed an extraordinary price, but it is not plotable because we have no data about the number of reviews, nor do we have data about the average review score.

If this is a single outlier we could ignore this value. To find out if this is the case I sorted the dataframe as coded in code block [3] based on descending price. It turns out that there are two books with indices 734 and 638 that have prices well above the rest of the books. Furthermore, there is not data about reviews or number of reviews. I have therefore decided to ignore these entries

# delete rows with index labels 638 and 734 df = df.drop([638, 734])

Plotting the scatter plot is still not working because of other nan values for n_reviews or avg_reviews. To get an impression of the size of the problem I have determined the number of books that have no data on number of reviews or the average review score:

# determine the number of rows that have missing values in either the 'n_reviews' or 'avg_reviews' columns missing_avg_n_reviews = (df['n_reviews'].isnull() | df['avg_reviews'].isnull()).sum() print("Number of missing values in the 'avg_reviews' OR 'n_reviews' column:") print(missing_avg_n_reviews) # determine the number of rows in the dataframe length_rows_df = df.shape[0] # determine the percentage of rows with missing or n_reviews or avg_reviews percentage_missing = 100*(missing_avg_n_reviews / length_rows_df) print(f'Which means that {percentage_missing:.2f} % is missing')

Although it is a lot to ignore all these data entries, I think an average review score without the number of reviews is an unreliable metric. Furthermore, if we just check the number of n_reviews that are missing we get the following:

# determine the number of rows that have missing values in either the 'n_reviews' or 'avg_reviews' columns missing_n_reviews = df['n_reviews'].isnull().sum() print("Number of missing values in the 'n_reviews' column:") print(missing_n_reviews) # determine the percentage of rows with missing or n_reviews or avg_reviews percentage_missing = 100*(missing_n_reviews / length_rows_df) print(f'Which means that {percentage_missing:.2f} % is missing')

This indicates probably that if the average review score is missing, the number of reviews is also missing. This could indicate that those books have no reviews, in stead of that there's something wrong with the database. In any case, we cannot know if these books are any good and I have therefore decided to drop those books as well:

# drop rows that contain missing values df.dropna(inplace=True)

Second attempt at scatter plot of Book price vs Reviews

Now we can finally print the scatter plot including the size as the number of reviews:

#px.scatter(df, x='price', y='avg_reviews', size='n_reviews') px.scatter(df, x='price', y='pages', size='n_reviews')

It looks like there's no correlation between price and the average number of reviews. Also the R-squared value confirms this:

# drop rows that contain missing values df.dropna(inplace=True) # Calculate the linear regression slope, intercept, r_value, p_value, std_err = linregress(df['avg_reviews'], df['price']) # Calculate the R-squared value r_squared = r_value ** 2 print(f"R-squared value: {r_squared:.4f}")

Therefore we can conclude that, based on this data set, there is no correlation between the book price and the number of reviews.

What are the best Python books, what are the best ML books?

Let's find the best Machine learning books:

# Filter rows where the title contains "Machine Learning" or "ML" filtered_df = df[df['title'].str.contains('Machine Learning|ML', case=False)] # Sort the DataFrame by 'avg_reviews' and 'n_reviews' in descending order sorted_df = filtered_df.sort_values(by=['avg_reviews', 'n_reviews'], ascending=[False, False]) # Select the top 10 rows top_10 = sorted_df.head(10) top_10[['title','avg_reviews', 'n_reviews']] #print(top_10)

Analogously, we can find the best Python books:

# Filter rows where the title contains "Python" filtered_df = df[df['title'].str.contains('Python', case=False)] # Sort the DataFrame by 'avg_reviews' and 'n_reviews' in descending order sorted_df = filtered_df.sort_values(by=['avg_reviews', 'n_reviews'], ascending=[False, False]) # Select the top 10 rows top_10 = sorted_df.head(10) top_10[['title','avg_reviews', 'n_reviews']] #print(top_10)

Cluster analysis of book names/TF-IDF K-mean

Tot start the cluster analysis with TF-IDF we first load the requirede libraries.

import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans import matplotlib.pyplot as plt from sklearn.metrics import silhouette_score

Next, we remove NaN values.

df.dropna(subset=['title'], inplace=True)

Since al lot of words terms that we're interested in comprise two words (e.g: Data Science, Data Engineering, Deep Learning) I want to see the effect of considering terms as 1 and 2 words vs. only 1 word.

vectorizer_options = [ {'name': 'Unigram', 'ngram_range': (1, 1), 'color': 'blue'}, {'name': 'Unigram & Bigram', 'ngram_range': (1, 2), 'color': 'red'} ] fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4)) for option in vectorizer_options: print(f"Processing with {option['name']}:") # Convert book titles to numerical features vectorizer = TfidfVectorizer(stop_words='english', max_df=0.8, ngram_range=option['ngram_range']) X = vectorizer.fit_transform(df['title']) # Determine the optimal number of clusters (optional) inertias = [] silhouette_scores = [] max_clusters = 10 for n_clusters in range(2, max_clusters + 1): kmeans = KMeans(n_clusters=n_clusters, random_state=42) kmeans.fit(X) inertias.append(kmeans.inertia_) silhouette_scores.append(silhouette_score(X, kmeans.labels_)) ax1.plot(range(2, max_clusters + 1), inertias, marker='o', color=option['color'], label=option['name']) ax1.set_xlabel('Number of Clusters') ax1.set_ylabel('Inertia') ax1.set_title('Elbow Method - Inertia vs Number of Clusters') ax1.legend() ax2.plot(range(2, max_clusters + 1), silhouette_scores, marker='o', color=option['color'], label=option['name']) ax2.set_xlabel('Number of Clusters') ax2.set_ylabel('Silhouette Score') ax2.set_title('Silhouette Score vs Number of Clusters') ax2.legend() fig.tight_layout() plt.show()

Although the Silhouette Score for combined Unigram & Bigram is much lower than for Unigram, I have opted for the first option. Otherwise a specific term consisting out of two words might end up in two clusters.

Next, I considered different values for max_df. It is used to filter out terms that occur too frequently across the documents. For example, if max_df is set to 0.8, it means that any term occurring in more than 80% of the documents will be removed.

max_df_values = np.arange(0.1, 1.1, 0.1) max_clusters = 10 colors = plt.cm.viridis(np.linspace(0, 1, len(max_df_values))) fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4)) for idx, max_df in enumerate(max_df_values): # Convert book titles to numerical features vectorizer = TfidfVectorizer(stop_words='english', max_df=max_df, ngram_range=(1, 2)) X = vectorizer.fit_transform(df['title']) # Determine the optimal number of clusters (optional) inertias = [] silhouette_scores = [] for n_clusters in range(2, max_clusters + 1): kmeans = KMeans(n_clusters=n_clusters, random_state=42) kmeans.fit(X) inertias.append(kmeans.inertia_) silhouette_scores.append(silhouette_score(X, kmeans.labels_)) ax1.plot(range(2, max_clusters + 1), inertias, marker='o', color=colors[idx], label=f'max_df={max_df:.1f}') ax1.set_xlabel('Number of Clusters') ax1.set_ylabel('Inertia') ax1.set_title('Elbow Method - Inertia vs Number of Clusters') ax1.legend() ax2.plot(range(2, max_clusters + 1), silhouette_scores, marker='o', color=colors[idx], label=f'max_df={max_df:.1f}') ax2.set_xlabel('Number of Clusters') ax2.set_ylabel('Silhouette Score') ax2.set_title('Silhouette Score vs Number of Clusters') ax2.legend(loc='upper right') fig.tight_layout() plt.show()

Just as with Unigram vs Unigram & Bigram graphs also the graphs above, barely show an elbow. However, it seems that a max_df of 0.1 has the highest silhouette score in general. Therefore the max_df is set at 0.1. Furthermore, I have selected to go for 5 clusters, as a minor elbow could be seen in the Elbow method graph.

optimal_clusters = 5

Unfortunately the Silhouette score is still very close to zero, indicating that they are overlapping and not well separated. However increasing the number of clusters does not improve the Silhouette score markedly.

To get a better understanding of the clusters, I will make a print out the a section of eacht cluster. Furthermore I will make a wordcloud for each cluster. To start, we first need to perform the clustering of course:

kmeans = KMeans(n_clusters=optimal_clusters, random_state=42) kmeans.fit(X) df['cluster'] = kmeans.labels_

I have printed section of each cluster.

for cluster_id in range(optimal_clusters): print(f"Cluster {cluster_id}:") print(df.loc[df['cluster'] == cluster_id, 'title'])

What can be noticed is that cluster 0 clusters for Python and Learning. Cluster 1 contains books about statistics primarily. The enitre Dummies series is represented in Cluster 2, etc. However, it becomes clear that all clusters contain the word python. Therefore I decided to exclude the word Python to see what the results bring.

Again 5 clusters with max_df at 0.1 seems to be the most reasonable selection of parameters. We can also see that the Silhouette score slightly improves. Indicating that the clusters overlap each other less and that the terms mach better within a cluster.

The clusters in this case look like this.

From a visual inspection it does not seem that the clustering has improved. Let's see how the wordcloud looks.

First, we install the wordlcoud package. And then we make the wordcloud.

from wordcloud import WordCloud import matplotlib.pyplot as plt # Perform K-means clustering with the optimal number of clusters optimal_clusters = 5 kmeans = KMeans(n_clusters=optimal_clusters, random_state=42) kmeans.fit(X) cluster_labels = kmeans.labels_ # Create a dictionary with the book titles for each cluster clustered_titles = {i: [] for i in range(optimal_clusters)} for label, title in zip(cluster_labels, df['title']): clustered_titles[label].append(title) def combine_titles(titles): return " ".join(titles) def create_wordcloud(cluster_text, ax): wordcloud = WordCloud(background_color="white", max_words=100, contour_width=3, contour_color="steelblue") wordcloud.generate(cluster_text) ax.imshow(wordcloud, interpolation="bilinear") ax.axis("off") fig, axes = plt.subplots(2, 3, figsize=(20, 8)) for cluster in range(optimal_clusters): cluster_text = combine_titles(clustered_titles[cluster]) ax = axes[cluster // 3][cluster % 3] create_wordcloud(cluster_text, ax) ax.set_title(f"Cluster {cluster}") # Turn off the unused subplot axes[-1, -1].axis('off') plt.tight_layout() plt.show()

Apparently the terms Data Analysis and Data Science end up in two different clusters. On the other hand the terms AWS (Amazon Web Serviced = cloud computing provider) and Cloud do end up in the same cluster. I realize that the method used (TF-IDF) does not understand semantic relationships between terms.

Using global vectors

Therefore I suspect a pre trained method such as GloVe is more appropriate. GloVe is an unsupervised word embedding technique that stands for Global Vectors for Word Representation. It generates dense vector representations of words based on their co-occurrence in large text datasets. The algorithm constructs a global word-word co-occurrence matrix and learns embeddings by optimizing a specific objective function. This objective function ensures that the dot product of two word vectors approximates the logarithm of their co-occurrence probability. As a result, words with similar meanings have similar vector representations in the embedding space. Pre-trained GloVe models are available, with glove.6B being a popular choice, trained on Wikipedia 2014 and Gigaword 5.

Glove 6B could be downloaded HERE.

The GloVe dataset contains 4 files with different dimensionalities. The higher the dimensionality the greater the potential for capturing nuances, often resulting in higher accuracy. However, this comes at the expense of lower computational efficiency. Therefore, I have decided to compare the performance of the 4 different files. Since we have already used the method of inertia/elbow method and silhouette for comparing the number of clusters and other parameters for the TF-IDF model, I will do the same here.

This section imports the necessary libraries, including NumPy, pandas, and scikit-learn.

import numpy as np import pandas as pd from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score

Next, we define the GloVe file names and two utility functions. The load_glove_embeddings function reads the specified GloVe file and creates a dictionary mapping words to their corresponding embeddings. The title_to_glove function converts a given title into a single GloVe vector by averaging the GloVe embeddings of the words present in the title.

glove_files = ["glove.6B.50d.txt", "glove.6B.100d.txt", "glove.6B.200d.txt", "glove.6B.300d.txt"] def load_glove_embeddings(glove_file): embeddings = {} with open(glove_file, "r", encoding="utf-8") as f: for line in f: values = line.split() word = values[0] coeffs = np.asarray(values[1:], dtype="float32") embeddings[word] = coeffs return embeddings def title_to_glove(title, embeddings, dimensions): words = title.lower().split() word_count = 0 title_vector = np.zeros(dimensions) for word in words: if word in embeddings: title_vector += embeddings[word] word_count += 1 if word_count > 0: title_vector /= word_count return title_vector

We continue with the following steps:

Initialize parameters and prepare plotting

Initialize two empty lists, inertias and silhouette_scores, which will store the inertia and silhouette score values for each number of clusters.

Iterate over different numbers of clusters (from 2 to max_clusters). For each number of clusters, create a KMeans model, fit it to the GloVe embeddings, calculate its inertia and silhouette score, and append these values to the respective lists.

Plot the inertia and silhouette score results in separate graphs. The x-axis represents the number of clusters, and the y-axis represents either inertia or silhouette score. Use different colors for each GloVe dataset to distinguish between them, allowing for visual comparison of the performance of different GloVe dimensionalities and the number of clusters.

max_clusters = 10 colors = ['b', 'g', 'r', 'm'] fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6)) for idx, glove_file in enumerate(glove_files): dimensions = int(glove_file.split(".")[2][:-1]) print(f"Processing GloVe Embeddings: {glove_file}") embeddings = load_glove_embeddings(glove_file) df[f"glove_vector_{dimensions}d"] = df["title"].apply(lambda title: title_to_glove(title, embeddings, dimensions)) X_glove = np.vstack(df[f"glove_vector_{dimensions}d"].values) inertias = [] silhouette_scores = [] for n_clusters in range(2, max_clusters + 1): kmeans_glove = KMeans(n_clusters=n_clusters, random_state=42) kmeans_glove.fit(X_glove) cluster_labels_glove = kmeans_glove.labels_ inertia = kmeans_glove.inertia_ inertias.append(inertia) silhouette = silhouette_score(X_glove, cluster_labels_glove) silhouette_scores.append(silhouette) ax1.plot(range(2, max_clusters + 1), inertias, marker='o', color=colors[idx], label=f'{dimensions}d') ax1.set_xlabel('Number of Clusters') ax1.set_ylabel('Inertia') ax1.set_title('Inertia vs Number of Clusters') ax1.legend() ax2.plot(range(2, max_clusters + 1), silhouette_scores, marker='o', color=colors[idx], label=f'{dimensions}d') ax2.set_xlabel('Number of Clusters') ax2.set_ylabel('Silhouette Score') ax2.set_title('Silhouette Score vs Number of Clusters') ax2.legend() plt.show()

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Portfolio project 1: Amazon data science books

Data preparation

Importing the libraries

Importing the data file & getting a first simple overview

Repairing the column n_reviews

Do more expensive books have better reviews?

First attempt at creating a scatter plot of Book price vs Reviews

More data prep

Second attempt at scatter plot of Book price vs Reviews

What are the best Python books, what are the best ML books?

Cluster analysis of book names/TF-IDF K-mean

Using global vectors

Portfolio project 1: Amazon data science books