📚 Data Science Book Analysis

Data Science Book Analysis

A review of the most popular Data Science books on Amazon: Combining Exploratory Analysis, Clustering and NLP

Author: Francesca Fuentes

Dataset: The data comes from the following link 🔗 Amazon Data Science Books Dataset

Introduction

The domain of data science has experienced a meteoric rise in popularity, paralleled by a burgeoning market for literature in the field. This surge in interest begs the question of how market dynamics, such as demand, affect book pricing and consumer reviews. In this analysis, we aim to explore several facets of data science literature. We seek to uncover whether there is a correlation between the cost of data science books and their user ratings, how book length may factor into pricing, and which titles are heralded as the 'best' within specific subcategories such as Python programming, Machine Learning, and Deep Learning. Through this exploration, we intend to shed light on the value proposition these educational resources offer to learners at different stages of their data science journey.

Methodology

Methodology began with procuring a comprehensive dataset of data science books from Kaggle, which served as the foundation for our analysis. Initial data cleaning and preprocessing included handling missing values and normalizing fields for consistency. An exploratory data analysis (EDA) followed, using statistical methods to uncover patterns and distributions within the data. For textual data, we implemented TF-IDF (Term Frequency-Inverse Document Frequency) to weigh the importance of words across book descriptions, facilitating a nuanced approach to clustering.

The K-means algorithm was employed to segment books into distinct groups based on their textual features, allowing us to identify clusters related to specific subtopics within data science. We extended our analysis to user-generated content by scraping reviews from relevant online sources. These reviews were then analyzed using the BERT (Bidirectional Encoder Representations from Transformers) model to extract sentiment scores, providing insight into user opinions. To illustrate our methodology, we include select code snippets and graphical representations that highlight key steps and findings in our data processing and analysis pipeline.

To do

Exploratory data analysis: Do more expensive books get better reviews?

EDA: Do more expensive books get better reviews?

EDA: Are longer books priced higher?

What are the best Python books? What are the best ML books?

Cluster analysis of books names / TF-IDF and K-means

Amazon review scraping & Book review summary

# Import the required libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import plotly.express as px

# Load the dataset df_books = pd.read_csv('/work/final_book_dataset_kaggle2.csv') # Define a list of keywords that are related to data science. # These keywords are used to filter the books related to data science topics. data_science_keywords = ['data science', 'machine learning', 'data analysis', 'statistics', 'big data', 'data mining', 'deep learning', 'AI', 'artificial intelligence', 'neural networks', 'predictive modeling', 'data visualization'] # Define a function checks if a book's title contains any of the data science keywords. # This function takes a book title as input and returns True if the title contains any # of the keywords defined above, otherwise it returns False. def is_data_science_related(title): return any(keyword.lower() in title.lower() for keyword in data_science_keywords) df_books = df_books[df_books['title'].apply(is_data_science_related)] df_books

🔦 Exploratory Data Analysis (EDA) on DS books

In the EDA, we delved into the relationship between book prices and the reviews they garner. We utilized scatter plots to discern if higher-priced books correlate with more favorable or numerous reviews. To deepen our understanding, we calculated correlation coefficients and employed regression models to gauge the strength and statistical significance of this relationship.

We also investigated the connection between a book's length, measured by page count, and its price. Scatter plots and linear regression models were insightful here as well. Variance analysis and statistical tests were considered to ascertain if there are significant price differences across book length categories.

For identifying the 'best' books within specific domains like Python, Machine Learning, and Deep Learning, we based our criteria on review quantity and average ratings. This was further enriched by a sentiment analysis of the reviews to assess the quality, providing a more layered understanding of the books' reception.

💰 Price vs reviews

We're exploring the potential correlation between the pricing of data science books and the user reviews they receive. Scatter plots are leveraged to visualize whether more expensive books tend to garner better or more reviews. To delve deeper, correlation coefficients and regression models are computed to assess the strength and statistical significance of this relationship.

# Convert column n_reviews to float (number with decimals)) df_books['n_reviews'] = pd.to_numeric(df_books['n_reviews'], errors='coerce') # Eliminate rows with null values df_books = df_books.dropna(subset=['n_reviews']) # Plot scatterplot of price vs avg_reviews with size = n_reviews px.scatter(df_books, x="price", y="avg_reviews", size="n_reviews")

💰 Price vs book length

Our analysis investigates whether there is a relationship between the length of a book (potentially measured in number of pages) and its price. Again, scatter plots are used.

🏆 Best Python books

We identify the best Python books by focusing on titles specifically related to Python programming, while excluding those that delve into Machine Learning and Deep Learning. Our selection criteria are based on the number of reviews and the average rating of reviews. The top 10 Python books are then highlighted, giving readers insight into the most popular and well-regarded Python resources in the field.

# Best Python books (excluding Machine Learning and Deep Learning books) python_keywords = ['Python', 'Py'] exclude_keywords = ['Machine Learning', 'ML', 'Deep Learning', 'Data Mining', 'Neural Networks'] python_books = df_books[ df_books['title'].str.contains("|".join(python_keywords), case=False, na=False) & ~df_books['title'].str.contains("|".join(exclude_keywords), case=False, na=False) ] best_python_books = python_books.nlargest(10, ['n_reviews', 'avg_reviews']) print("🏆 Best Python books:") display(best_python_books)

🏆 Best Machine Learning books

Our methodology pinpoints the leading books in Machine Learning by filtering out general Python programming and Deep Learning titles. We then rank these books by the volume of reviews and their average ratings to present the top 10 Machine Learning books. This approach helps readers find the most authoritative and valuable texts for advancing their Machine Learning expertise.

# Best Machine Learning books (excluding general Python and Deep Learning books) ml_keywords = ['Machine Learning', 'ML'] exclude_keywords = ['Python', 'Deep Learning', 'Py', 'Neural Networks'] ml_books = df_books[ df_books['title'].str.contains("|".join(ml_keywords), case=False, na=False) & ~df_books['title'].str.contains("|".join(exclude_keywords), case=False, na=False) ] # ML books with most reviews and highest average reviews best_ml_books = ml_books.nlargest(10, ['n_reviews', 'avg_reviews']) print("🏆 Best Machine Learning books:") display(best_ml_books)

🏆 Best Deep Learning books

In the case of Deep Learning books, we include those that may also cover Python, given its significance in the Deep Learning space. We sort these books by the quantity and quality of their reviews, enabling us to showcase the top 10 Deep Learning books. This list serves as a guide for those seeking the most impactful sources of knowledge on Deep Learning topic.

# Best Deep Learning books (can include Python as it's often used in Deep Learning) deep_learning_keywords = ['Deep Learning', 'Neural Networks'] deep_learning_books = df_books[ df_books['title'].str.contains("|".join(deep_learning_keywords), case=False, na=False)] best_deep_learning_books = deep_learning_books.nlargest(10, ['n_reviews', 'avg_reviews']) print("🏆 Best Deep Learning books:") display(best_deep_learning_books)

🧐 Cluster Analysis of Book Titles

In this study, we conducted a cluster analysis to discover different types of data science books based on their titles. Employing the TF-IDF (Term Frequency-Inverse Document Frequency) technique, we weighted the importance of words in the corpus, emphasizing less frequent but potentially more indicative words in book titles. This step was crucial for highlighting unique themes within the data science literature.

Next, the K-means clustering algorithm was applied to divide the securities into coherent clusters. The optimal number of clusters was determined using the elbow method, which helped us identify a point at which the marginal gain in variance explained within the data begins to decrease, indicating an appropriate number of clusters for our analysis. The resulting clusters provided a meaningful categorization of the books, reflecting the various niches and areas of interest within the field of data science.

TF-IDF: With this method we can see how the importance of words within the corpus is weighted, giving more weight to less frequent words across titles, which can be crucial for identifying unique topics.

K-means: We will be able to see how this algorithm partitions the titles into K clusters, and how you determined the optimal number of clusters with the elbow method.

💡 What are the main types of Data Science books?

# Import the TfidfVectorizer class from scikit-learn's feature_extraction.text module # The module is used to transform text data into TF-IDF (Term Frequency-Inverse Document Frequency) from sklearn.feature_extraction.text import TfidfVectorizer # Create an instance of TfidfVectorizer # - 'stop_words' is set to 'english' to remove common English words that don't carry much meaning # - 'ngram_range' is set to (1, 2) to consider both individual words and 2-word phrases (bigrams) vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2)) # Apply the vectorizer to the 'title' column of the df_books DataFrame. # This transforms the text data into a matrix of TF-IDF features. X = vectorizer.fit_transform(df_books['title'])

from sklearn.cluster import KMeans from kneed import KneeLocator import matplotlib.pyplot as plt sum_squared_distances = [] K = range(2, 10) for k in K: km = KMeans(n_clusters=k, max_iter=600, n_init=10) km = km.fit(X) sum_squared_distances.append(km.inertia_) # We use KneeLocator to automatically identify the "bend" in the curve. kneedle = KneeLocator(K, sum_squared_distances, curve='convex', direction='decreasing') optimal_k = kneedle.knee # We plot inertia for each value of k. plt.plot(K, sum_squared_distances, 'bx-') plt.xlabel('Number of clusters') plt.ylabel('Sum of squared distances') plt.title('Elbow Method For Optimal k') # If KneeLocator found an "elbow", we highlight it with a red dot on the graph. if optimal_k is not None: plt.scatter(optimal_k, sum_squared_distances[optimal_k - 2], color='red', s=100, zorder=5) plt.show()

We have seen that the elbow method using KneeLocator recommends a number of 5 clusters.

# Get clusters # Set the desired number of clusters to 5 true_k = 5 # Initialize the KMeans algorithm with the following parameters # - n_clusters=true_k: 5 clusters will be used. # - init='k-means++': initialization method to select the initial centroids. # - max_iter=600: maximum number of iterations of the algorithm. # - n_init=10: the algorithm will be run 10 times with different initial centroids. model = KMeans(n_clusters=true_k, init='k-means++', max_iter=600, n_init=10) # Fit the model to the dataset X model.fit(X) # Get predictions and labels --> Get the cluster labels assigned for each data in X labels = model.labels_ # Create a DataFrame that combines the book titles with their respective cluster tags. book_clusters = pd.DataFrame(list(zip(df_books['title'], labels)), columns=['title', 'cluster']) # Print the books grouped by cluster and sorted according to their cluster label print(book_clusters.sort_values(by=['cluster']))

# Create word cloud for clusters from wordcloud import WordCloud for k in range(true_k): text = book_clusters[book_clusters['cluster'] == k]['title'].values wordcloud = WordCloud(max_font_size=40, max_words=100, background_color="white").generate(' '.join(text)) # Create subplot with 2 rows and 3 columns plt.subplot(2, 3, k+1).set_title(f'Cluster {k}') plt.plot() plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") # Adjust spacing between subplots to avoid overlapping plt.tight_layout() plt.show()

cluster_num

# Books in clusters book_clusters[book_clusters.cluster == int(cluster_num)] #book_clusters[book_clusters.cluster == 1]

🕵🏼‍♀️ Amazon Review Scraping & Summary

Review Scraping Overview

We automated the extraction of review data from Amazon, ensuring the process managed multiple pages and handled any potential errors or disruptions efficiently. Our methods were in strict compliance with ethical scraping guidelines, including observance of robots.txt and maintaining a non-disruptive request rate.

Summary with BERT

For review summarization, we leveraged BERT, a pre-trained language model, renowned for its ability to distill informative elements from extensive text. We navigated challenges like condensing lengthy reviews into coherent summaries and addressing discrepancies between machine and human text comprehension. The aim was to provide succinct yet comprehensive representations of customer opinions, harnessing BERT's natural language understanding capabilities to capture the essence of each review.

# Generate URL of a product review page on Amazon from its product URL. # It takes a product URL as input and returns the corresponding review page URL. def get_review_url(product_url): try: split_url = product_url.split('dp') product_number = split_url[1].split('/')[1] review_url = split_url[0] + 'product-reviews/' + product_number + "/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews" except: review_url = None # If there is an error, review_url is None # Return the constructed review URL or None if an error occurred return review_url

#Create review URLs for each book in the dataset and filter out those that are None. df_books['review_urls'] = df_books['complete_link'].apply(lambda x: get_review_url(x)) # Remove empty review urls and create a new dataset df_reviews = df_books.loc[~df_books['review_urls'].isnull()].reset_index()

df_reviews

Review Scraping Overview

In the process of creating a repository of consumer reviews, we embarked on the task of extracting reviews from Amazon's extensive product pages. To do this, the task is based on transforming Amazon's product URLs into URLs of its review pages, which involves modifying the product URL structure to directly access the section where users have left their opinions about that product. Here is a more detailed step-by-step. This method is adapted from a source shared from GitHub by @jrjames83, which has been adapted to the specific needs of this project. In this scraping phase, it is also ensured that if a URL is not converted correctly, it is noted and handled without interrupting the scraping process itself.

Source: amazon_review_scraper.py

Review Aggregation

Reviews for each book are combined into a single text entry, enabling us to analyze collective feedback rather than individual comments. This step simplifies the dataset and prepares it for the summarization process.

book_reviews = out # Group the reviews by title and count the number of reviews for each. book_reviews['review_text'] = book_reviews['review_text'].astype(str) book_reviews_agg = book_reviews.groupby(['title'], as_index=False).agg({'review_text': ' '.join}) book_reviews_agg

Summarization with BERT

In this phase, we use Summarizer, a tool that leverages BERT to compress the aggregated reviews into concise summaries. The specified ratio parameter is critical, as it instructs the model to reduce the content to 20% of its original length, striking a balance between brevity and substance. This extractive summarization process is not a mere truncation of the text, but a complex reconstitution that aims to retain the most salient points, thus providing us with a condensed but rich version of the collective opinion.

Install Bert extractive Summarize if not done already!

# Extractive Summarizer for book reviews from summarizer import Summarizer # Initialize the model --> # Use BERT to summarize accumulated reviews. bert_model = Summarizer() # Use the model to generate a summary for the third item (index 2) in the 'review_text' column. # The 'ratio=0.2' parameter indicates that the summary should be about 20% of the length of the original text. # ''.join() is used to convert the summary output to a single string if it's not already one. bert_summary = ''.join(bert_model(book_reviews_agg.review_text[2], ratio=0.2)) #this line means take 20% of the text and summarize it # Show the generated summary and the original text for comparison. print(bert_summary)

print(book_reviews_agg.review_text[2])

from IPython.display import display, Markdown display(Markdown(book_reviews_agg.review_text[2]))

Interpreting Summarization Outputs

The result presents us with a dichotomy: the algorithmically generated summary versus the actual text of the review. It reveals the ability (or inability) of the model to capture the nuances of the comments. When we analyze the summary together with the original text, which allows us to check the effectiveness of the automatic synthesis, whether the essence of the text has been preserved or whether divergences in comprehension have occurred.

Conclusion

This project has successfully processed data on what could be a complex interplay between book price, book length, and consumer reviews, revealing key insights into market dynamics. Our exploratory data analysis showed nuanced but significant relationships between these factors and a book's market performance. Clustering analysis revealed the existence of distinct categories within book titles, demonstrating the diversity of topics in the field of data science.

The use of NLP techniques, such as BERT to summarize reviews, demonstrated the transformative potential of machine learning to extract meaningful information from large data sets. Although the scope of this project was limited to a data snapshot, the methodologies applied here pave the way for further exploratory studies to decipher the intricate patterns of the publishing industry.

In closing, we recognize the richness of the data available to us and the myriad opportunities they present for future analysis. This project has not only highlighted current trends, but has also laid the groundwork for predictive modeling and trend analysis in the literary field.