Data Science Book Analysis
A review of the most popular Data Science books on Amazon: Combining Exploratory Analysis, Clustering and NLP
Author: Francesca Fuentes
Dataset: The data comes from the following link 🔗 Amazon Data Science Books Dataset
Introduction
The domain of data science has experienced a meteoric rise in popularity, paralleled by a burgeoning market for literature in the field. This surge in interest begs the question of how market dynamics, such as demand, affect book pricing and consumer reviews. In this analysis, we aim to explore several facets of data science literature. We seek to uncover whether there is a correlation between the cost of data science books and their user ratings, how book length may factor into pricing, and which titles are heralded as the 'best' within specific subcategories such as Python programming, Machine Learning, and Deep Learning. Through this exploration, we intend to shed light on the value proposition these educational resources offer to learners at different stages of their data science journey.
Methodology
Methodology began with procuring a comprehensive dataset of data science books from Kaggle, which served as the foundation for our analysis. Initial data cleaning and preprocessing included handling missing values and normalizing fields for consistency. An exploratory data analysis (EDA) followed, using statistical methods to uncover patterns and distributions within the data. For textual data, we implemented TF-IDF (Term Frequency-Inverse Document Frequency) to weigh the importance of words across book descriptions, facilitating a nuanced approach to clustering.
The K-means algorithm was employed to segment books into distinct groups based on their textual features, allowing us to identify clusters related to specific subtopics within data science. We extended our analysis to user-generated content by scraping reviews from relevant online sources. These reviews were then analyzed using the BERT (Bidirectional Encoder Representations from Transformers) model to extract sentiment scores, providing insight into user opinions. To illustrate our methodology, we include select code snippets and graphical representations that highlight key steps and findings in our data processing and analysis pipeline.
To do
🔦 Exploratory Data Analysis (EDA) on DS books
In the EDA, we delved into the relationship between book prices and the reviews they garner. We utilized scatter plots to discern if higher-priced books correlate with more favorable or numerous reviews. To deepen our understanding, we calculated correlation coefficients and employed regression models to gauge the strength and statistical significance of this relationship.
We also investigated the connection between a book's length, measured by page count, and its price. Scatter plots and linear regression models were insightful here as well. Variance analysis and statistical tests were considered to ascertain if there are significant price differences across book length categories.
For identifying the 'best' books within specific domains like Python, Machine Learning, and Deep Learning, we based our criteria on review quantity and average ratings. This was further enriched by a sentiment analysis of the reviews to assess the quality, providing a more layered understanding of the books' reception.
💰 Price vs reviews
We're exploring the potential correlation between the pricing of data science books and the user reviews they receive. Scatter plots are leveraged to visualize whether more expensive books tend to garner better or more reviews. To delve deeper, correlation coefficients and regression models are computed to assess the strength and statistical significance of this relationship.
💰 Price vs book length
Our analysis investigates whether there is a relationship between the length of a book (potentially measured in number of pages) and its price. Again, scatter plots are used.
🏆 Best Python books
We identify the best Python books by focusing on titles specifically related to Python programming, while excluding those that delve into Machine Learning and Deep Learning. Our selection criteria are based on the number of reviews and the average rating of reviews. The top 10 Python books are then highlighted, giving readers insight into the most popular and well-regarded Python resources in the field.
🏆 Best Machine Learning books
Our methodology pinpoints the leading books in Machine Learning by filtering out general Python programming and Deep Learning titles. We then rank these books by the volume of reviews and their average ratings to present the top 10 Machine Learning books. This approach helps readers find the most authoritative and valuable texts for advancing their Machine Learning expertise.
🏆 Best Deep Learning books
In the case of Deep Learning books, we include those that may also cover Python, given its significance in the Deep Learning space. We sort these books by the quantity and quality of their reviews, enabling us to showcase the top 10 Deep Learning books. This list serves as a guide for those seeking the most impactful sources of knowledge on Deep Learning topic.
🧐 Cluster Analysis of Book Titles
In this study, we conducted a cluster analysis to discover different types of data science books based on their titles. Employing the TF-IDF (Term Frequency-Inverse Document Frequency) technique, we weighted the importance of words in the corpus, emphasizing less frequent but potentially more indicative words in book titles. This step was crucial for highlighting unique themes within the data science literature.
Next, the K-means clustering algorithm was applied to divide the securities into coherent clusters. The optimal number of clusters was determined using the elbow method, which helped us identify a point at which the marginal gain in variance explained within the data begins to decrease, indicating an appropriate number of clusters for our analysis. The resulting clusters provided a meaningful categorization of the books, reflecting the various niches and areas of interest within the field of data science.
TF-IDF: With this method we can see how the importance of words within the corpus is weighted, giving more weight to less frequent words across titles, which can be crucial for identifying unique topics.
K-means: We will be able to see how this algorithm partitions the titles into K clusters, and how you determined the optimal number of clusters with the elbow method.
💡 What are the main types of Data Science books?
We have seen that the elbow method using KneeLocator recommends a number of 5 clusters.
🕵🏼♀️ Amazon Review Scraping & Summary
Review Scraping Overview
We automated the extraction of review data from Amazon, ensuring the process managed multiple pages and handled any potential errors or disruptions efficiently. Our methods were in strict compliance with ethical scraping guidelines, including observance of robots.txt and maintaining a non-disruptive request rate.
Summary with BERT
For review summarization, we leveraged BERT, a pre-trained language model, renowned for its ability to distill informative elements from extensive text. We navigated challenges like condensing lengthy reviews into coherent summaries and addressing discrepancies between machine and human text comprehension. The aim was to provide succinct yet comprehensive representations of customer opinions, harnessing BERT's natural language understanding capabilities to capture the essence of each review.
Review Scraping Overview
In the process of creating a repository of consumer reviews, we embarked on the task of extracting reviews from Amazon's extensive product pages. To do this, the task is based on transforming Amazon's product URLs into URLs of its review pages, which involves modifying the product URL structure to directly access the section where users have left their opinions about that product. Here is a more detailed step-by-step. This method is adapted from a source shared from GitHub by @jrjames83, which has been adapted to the specific needs of this project. In this scraping phase, it is also ensured that if a URL is not converted correctly, it is noted and handled without interrupting the scraping process itself.
Source: amazon_review_scraper.py
Review Aggregation
Reviews for each book are combined into a single text entry, enabling us to analyze collective feedback rather than individual comments. This step simplifies the dataset and prepares it for the summarization process.
Summarization with BERT
In this phase, we use Summarizer, a tool that leverages BERT to compress the aggregated reviews into concise summaries. The specified ratio parameter is critical, as it instructs the model to reduce the content to 20% of its original length, striking a balance between brevity and substance. This extractive summarization process is not a mere truncation of the text, but a complex reconstitution that aims to retain the most salient points, thus providing us with a condensed but rich version of the collective opinion.
Install Bert extractive Summarize if not done already!
Interpreting Summarization Outputs
The result presents us with a dichotomy: the algorithmically generated summary versus the actual text of the review. It reveals the ability (or inability) of the model to capture the nuances of the comments. When we analyze the summary together with the original text, which allows us to check the effectiveness of the automatic synthesis, whether the essence of the text has been preserved or whether divergences in comprehension have occurred.
Conclusion
This project has successfully processed data on what could be a complex interplay between book price, book length, and consumer reviews, revealing key insights into market dynamics. Our exploratory data analysis showed nuanced but significant relationships between these factors and a book's market performance. Clustering analysis revealed the existence of distinct categories within book titles, demonstrating the diversity of topics in the field of data science.
The use of NLP techniques, such as BERT to summarize reviews, demonstrated the transformative potential of machine learning to extract meaningful information from large data sets. Although the scope of this project was limited to a data snapshot, the methodologies applied here pave the way for further exploratory studies to decipher the intricate patterns of the publishing industry.
In closing, we recognize the richness of the data available to us and the myriad opportunities they present for future analysis. This project has not only highlighted current trends, but has also laid the groundwork for predictive modeling and trend analysis in the literary field.