Applying VADER Sentiment Analysis and KNN Classification on Amazon Reviews for Automated Seller Recommendations
With many versions of popular products as well as new innovations being solely distributed on Amazon, people are beginning to generate income simply selling their products or even bulk ordered products on Amazon. However, when many reviews begin to pile in about a product, it is difficult for a single person to sift through them all and decide what aspects of their product are strong, and what needs to be improved to boost sales. Using Vader Sentiment, we can test the positivity/negativity of a sentence, and then using either word-buckets or KNN, the topic of concern can be assigned. With this information, sellers can pinpoint which areas of their individual product need improvement and which can be marketed.
Testing Vader's Accuracy Predicting Positivity/Negativity with Amazon Reviews
Step 1: Import Necessary Modules and Read in Dataset
By comparing the tests done with the Vader analysis with the dataset pulled from Kaggle, we calculated the percentage of reviews that were correctly analyzed.
Applying Vader to a Simple Word-Bucket Algorithm
Our results are highlighted in these two graphs as they inform the producer on the areas that "SHUMEI Custom MacBook Air 13 inch Case Model A1369/A1466" can improve on and areas that it is successful in. They can improve the
Attempting to Improve Word-Buckets with KNN Classification
The Word-Bucket algorithm is not efficient and could also be classifying each sentence as more than one topic, so the use of a Classification Algorithm like KNN could be more effective.
Step 1: Build the Testing and Training dataset
Import extra NTLK packages to get most frequent words
Tokenize 10000 rows of the large dataset to get all words separated
Create loop to count the frequency of each word and assign to dictionary
Use Heapq package to get list of 500 most frequent words in "all" Amazon reviews
In order to expand the scope of the word recognition system, stem words can be useful in increasing accuracy. Read in the stem.csv table and use it to find the unique stem words of the 500 most frequent words.
These 500 stems will be the column titles for our large features table in order to train a KNN classification tool.
The next lines of code are marked off because they were used to create the training set and write them to a permanent csv file, that way this process doesn't have to be repeated.
The final result of this process is a fully filled table for how often each of the 500 most frequent words show up in each of the 450 testing and training Amazon reviews.
With a team of 4, it took just half an hour to manually label each of these reviews as related to "price", "quality", "as advertised", or "shipping". The excel sheet is then imported back as "all_features_labeled.csv".
Step 2: Create the KNN model
Method 1: Use Data 8 Manual Calculation of Nearest Neighbors via Euclidean Distance
In this method, around 20 words are selected by intuition to be the classifying features of our KNN model
Split the trained dataset into testing and training sets, and also create a features-only table for each
Define function fast_distances that calculates the Euclidean distance between two rows, and a most_common function that finds the most common label in a table
Define a general classification function for a single test row of features.
Define a specific classification function with a specific number of neighbors and the specific table of features we are using
Run the classification on each row of the test data set!
Test the correctness of the predictions against the real categories for each review
Method 2: Convert to Pandas and Plug into Sci-kit Learn KNN model
Import necessary packages for Seaborn and Scikit-Learn
Create a new features array except with "Categories" as part of the selected columns.
Convert the trained dataset into a Pandas Dataframe, and clean it into a Scikit friendly format by labelling each category as a number.
Using seaborn, plot 1v1 scatter plots of each feature to see if we can find the best features to use
Set x-data and y-data to create a scikit friendly format of the features.
Run Scikit's KNeighborsClassifier on the data to get predictions for the y variable, aka the category.
Use Scikit metrics to produce Accuracy report on the KNN classification of the test set.
Conclusions and Limitations
Conclusions
1. With 95% confidence, VADER predicts positive/negative sentiment of Amazon reviews correctly 66.6%-72.5% of the time, and it is proven not to be random chance.
Consequences of this: Vader is definitely a quick solution to classifying sentiment for text, but it is not the most accurate with Amazon reviews. If written with a different algorithm or an entirely different sentiment analysis tool, higher accuracy can be obtained.
2. We can use a word bucket algorithm to collect all reviews of a single product and calculate the sentiment towards different aspects of the product like price, shipping, quality, and description.
Consequences of this: This is a not-so-elegant but very logically sound algorithm that provides insight into each sentence of all the reviews a product has, and derives a core aspect that can be improved or marketed.
3. Using a training set, we were able to manually train a KNN classifier to classify the topic of a review as price, quality, shipping, or description issues/perks.
Consequences of this: We were able to classify the test set of reviews with 58% accuracy, which is better than the pure chance of guessing between the 4 categories.
4. Using the same training set, a Sci-kit learn KNN classifier was able to predict the topic of a review in a much more elegant manner.
Consequences of this: The Sci-kit learn classifier classified the test set with 67% accuracy, significantly better than the 58% that was achieved using our manual KNN classifier.
Limitations
Limitations with KNN model:
The reviews used for our KNN training consisted mostly of music and movies reviews. This can cause inaccuracies when analyzing reviews for different types of products. It could be improved with the use of a large training set, as our test was done with just 450 reviews, and a greater variety of product reviews.
The word-bucket algorithm is intuitively better than the KKN because it analysis by sentence and is able to assign numerous topics to one sentence while the KKN is able to assign only one topic for each entire review. Additionally, MonkeyLearn could have been a more accurate way to classify the topics of each sentence.
The KNN model features are chosen with "intuition", but in a perfect world, we would be able to individually analyze all of the scatter plots in the pair plot to decide which words are the best for distinguishing between the categories.
Limitations with the Word-Bucket Algorithm:
The signal words for each category in "Price.csv", "Quality.csv", "Shipping.csv", and "as-advertised.csv" are hand picked, so they may not include a good representation of which words correlate with which feature.
Feature selection can be implemented to pick the most important words, and the stem words method from KNN can be utilized to expand the scope of each word-bucket to include typos, tenses, and other branches of the same word.
Limitations with Software for user-friendly usage:
Our operational code example does not include an automated form of web scraping to collect all current reviews for a user-inputted Amazon link. We used a web scraping extension to collect reviews, but a possible extension could be the usage of an Amazon API to easily access product-specific data.