Applying VADER Sentiment Analysis and KNN Classification on Amazon Reviews for Automated Seller Recommendations

With many versions of popular products as well as new innovations being solely distributed on Amazon, people are beginning to generate income simply selling their products or even bulk ordered products on Amazon. However, when many reviews begin to pile in about a product, it is difficult for a single person to sift through them all and decide what aspects of their product are strong, and what needs to be improved to boost sales. Using Vader Sentiment, we can test the positivity/negativity of a sentence, and then using either word-buckets or KNN, the topic of concern can be assigned. With this information, sellers can pinpoint which areas of their individual product need improvement and which can be marketed.

Testing Vader's Accuracy Predicting Positivity/Negativity with Amazon Reviews

Step 1: Import Necessary Modules and Read in Dataset

import numpy as np import pandas as pd import re from datascience import * import matplotlib %matplotlib inline import matplotlib.pyplot as plt plt.style.use('fivethirtyeight') import warnings warnings.simplefilter('ignore', FutureWarning) !pip install vaderSentiment from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer tester=pd.read_csv("training.csv") tester.head()

tester_first_100 = tester.head(100)

first_100_reviews = tester_first_100['Text'] first_100_reviews.head()

analyzer = SentimentIntensityAnalyzer() def generate_scores(reviews): scores = [] for review in reviews: scores.append(analyzer.polarity_scores(review)) return scores first_100_scores = generate_scores(first_100_reviews)

reviews_and_scores = pd.DataFrame(first_100_scores) reviews_and_scores.head()

reviews_and_scores['Reviews'] = first_100_reviews reviews_and_scores = reviews_and_scores[['Reviews', 'neg', 'neu', 'pos', 'compound']] reviews_and_scores.head()

rounded_scores=[] for i in range(len(reviews_and_scores['compound'])): if reviews_and_scores.iloc[i,3]<reviews_and_scores.iloc[i,1]: rounded_scores.append(1) else: rounded_scores.append(2)

rounded_scores = pd.DataFrame(rounded_scores)

reviews_and_scores['rounded scores']=rounded_scores reviews_and_scores = reviews_and_scores[['Reviews', 'neg', 'neu', 'pos', 'compound', 'rounded scores']] reviews_and_scores.head()

reviews_and_scores['tester scores'] = tester_first_100['Score'] reviews_and_scores = reviews_and_scores[['Reviews', 'neg', 'neu', 'pos', 'compound', 'rounded scores','tester scores']] reviews_and_scores.head()

correct = reviews_and_scores.loc[reviews_and_scores['rounded scores'] == reviews_and_scores['tester scores']]

correct.shape[0]/reviews_and_scores.shape[0]

np_scores=tester['Score'].to_numpy() np_titles=tester['Title'].to_numpy() np_texts=tester['Text'].to_numpy()

np_tester=Table().with_columns('Score',np_scores, 'Title',np_titles, 'Text', np_texts) np_tester.show(5)

thousand_reviews= np_tester.sample(1000) def generate_rounded_array(tbl): df=pd.DataFrame(generate_scores(tbl.column("Text"))) compound=df['compound'].to_numpy() neg=df['neg'].to_numpy() pos=df['pos'].to_numpy() rounded=[] for i in range(len(neg)): if pos.item(i)<neg.item(i): rounded.append(1) else: rounded.append(2) return rounded tester_with_vader=thousand_reviews.with_columns("Vader", generate_rounded_array(thousand_reviews)) tester_with_vader.show(5)

#create test stat function def calc_accuracy(tbl): accurate=np.count_nonzero(tbl.column("Score")==tbl.column("Vader")) return accurate/tbl.num_rows calc_accuracy(tester_with_vader)

By comparing the tests done with the Vader analysis with the dataset pulled from Kaggle, we calculated the percentage of reviews that were correctly analyzed.

#run A/B testing observed=calc_accuracy(tester_with_vader) def simulate_and_test_statistic(tbl): shuffled_labels=tbl.sample(with_replacement=False).column("Vader") shuffled_tbl=tbl.select("Score").with_column("Vader", shuffled_labels) return calc_accuracy(shuffled_tbl) def percentages_in_resamples(): percentage_correct = make_array() for i in np.arange(1000): percentage_correct=np.append(percentage_correct,simulate_and_test_statistic(tester_with_vader)) return percentage_correct

Table().with_column('Proportion Correct', percentages_in_resamples()).hist(bins=20) plt.scatter(observed, 0, color='red', s=200, zorder=2) plt.ylim([-2, 30]) plt.savefig('Proportion Correct')

resample_correct_proportions = make_array() for i in np.arange(10000): resample = sample_proportions(1000,[0.696,0.304]) resample_correct_proportions = np.append(resample_correct_proportions, resample.item(0)) Table().with_column("Resample Correct proportion", resample_correct_proportions).hist(bins=np.arange(.6, .8, .005)) approximate_sd = (0.696*0.304)**(0.5)/(1000)**(0.5) lower_limit = 0.696-2*(approximate_sd) upper_limit = 0.696+2*(approximate_sd) plt.plot(make_array(lower_limit, upper_limit), make_array(0, 0), c='r', lw=10) plt.savefig('Resample Proportion Correct')

incorrect= reviews_and_scores.loc[reviews_and_scores['rounded scores'] != reviews_and_scores['tester scores']] incorrect.head()

Applying Vader to a Simple Word-Bucket Algorithm

def factorsentiment(pricesynonyms,qualitysynonyms,descriptionsynonyms,shippingsynonyms,comments): analyzer = SentimentIntensityAnalyzer() scores={"price": [0,0],"quality": [0,0],"as advertised":[0,0],"delivery":[0,0]} for comment in comments: sentences= sentence_split(comment) for sentence in sentences: words = sentence.split(" ") for word in words: if word in pricesynonyms: price_scores = analyzer.polarity_scores(sentence).values() pricevalue_list = list(price_scores) if pricevalue_list[3]< 0: scores["price"][0] += 1 elif pricevalue_list[3]>0: scores["price"][1] += 1 elif word in qualitysynonyms: quality_scores = analyzer.polarity_scores(sentence).values() qualityvalue_list = list(quality_scores) if qualityvalue_list[3]< 0: scores["quality"][0] += 1 elif qualityvalue_list[3]>0: scores["quality"][1] += 1 elif word in descriptionsynonyms: description_scores = analyzer.polarity_scores(sentence).values() descriptionvalue_list = list(description_scores) if descriptionvalue_list[3]< 0: scores["as advertised"][0] += 1 elif descriptionvalue_list[3]>0: scores["as advertised"][1] += 1 elif word in shippingsynonyms: delivery_scores = analyzer.polarity_scores(sentence).values() deliveryvalue_list = list(delivery_scores) if deliveryvalue_list[3]< 0: scores["delivery"][0] += 1 elif deliveryvalue_list[3]>0: scores["delivery"][1] += 1 return scores

def sentence_split(string): return re.split(r"[.?!;]", string)

def make_reference(csv): opened=open(csv) reference_list=[] for row in opened: row=row.strip("\n") row=row.lower() reference_list.append(row) return reference_list reviews= Table().read_table("Amazon_Review___Complete_Data_copy__31234335 (1).csv") cleaned_reviews=reviews.where("Review Date", are.containing("United States")).column("Review Content") quality_synonyms=make_reference("Quality.csv") price_synonyms=make_reference("Price.csv") description_synonyms= make_reference("as-advertised.csv") shipping_synonyms= make_reference("Shipping.csv") scores=factorsentiment(price_synonyms,quality_synonyms,description_synonyms, shipping_synonyms, cleaned_reviews) scores

def add_values(numbers): sums = 0 for item in numbers: for number in item: sums +=number return sums def consultant(scores): total = add_values(scores.values()) negratio=[] posratio=[] categories=list(scores.keys()) for i in scores.values(): negratio=np.append(negratio, i[0]/total) posratio=np.append(posratio, i[1]/total) ratios=Table().with_columns("category", categories, "complaint ratio", negratio, "compliment ratio",posratio) return ratios ratios_tbl=consultant(scores) ratios_tbl.barh("category", "complaint ratio") most_complaint=ratios_tbl.sort("complaint ratio", descending=True).column("category").item(0) print('The complaint ratio for "SHUMEI Custom MacBook Air 13 inch Case Model A1369/A1466" is highest for the "',most_complaint, '" category.')

ratios_tbl.barh("category", "compliment ratio") most_compliment=ratios_tbl.sort("compliment ratio", descending=True).column("category").item(0) print('The compliment ratio for "SHUMEI Custom MacBook Air 13 inch Case Model A1369/A1466" is highest for the "',most_compliment, '" category.')

Our results are highlighted in these two graphs as they inform the producer on the areas that "SHUMEI Custom MacBook Air 13 inch Case Model A1369/A1466" can improve on and areas that it is successful in. They can improve the

Attempting to Improve Word-Buckets with KNN Classification

The Word-Bucket algorithm is not efficient and could also be classifying each sentence as more than one topic, so the use of a Classification Algorithm like KNN could be more effective.

Step 1: Build the Testing and Training dataset

Import extra NTLK packages to get most frequent words

import nltk import re nltk.download('punkt') import heapq

Tokenize 10000 rows of the large dataset to get all words separated

control=pd.read_csv("training.csv") control_list=control["Text"].head(10000) control_string="".join(control_list) dataset = nltk.sent_tokenize(control_string) for i in range(len(dataset)): dataset[i] = dataset[i].lower() dataset[i] = re.sub(r'\W', ' ', dataset[i]) dataset[i] = re.sub(r'\s+', ' ', dataset[i])

Create loop to count the frequency of each word and assign to dictionary

word2count = {} for data in dataset: words = nltk.word_tokenize(data) for word in words: if word not in word2count.keys(): word2count[word] = 1 else: word2count[word] += 1

Use Heapq package to get list of 500 most frequent words in "all" Amazon reviews

freq_words = heapq.nlargest(500, word2count, key=word2count.get) freq_words[0:10]

In order to expand the scope of the word recognition system, stem words can be useful in increasing accuracy. Read in the stem.csv table and use it to find the unique stem words of the 500 most frequent words.

vocab_mapping = Table().read_table('stem.csv') stemmed = np.array(freq_words) vocab_table = Table().with_column('Word', stemmed).join('Word', vocab_mapping) unique_stems=np.unique(vocab_table.column("Stem")) unique_stems[10:20]

These 500 stems will be the column titles for our large features table in order to train a KNN classification tool.

The next lines of code are marked off because they were used to create the training set and write them to a permanent csv file, that way this process doesn't have to be repeated.

# train_and_test=pd.DataFrame(global_reviews.sample(450)["Text"]) # train_and_test.to_csv("review_train_and_test.csv") # proportion_tbl=pd.DataFrame(columns=unique_stems) # proportion_tbl.insert(0, 'Review', train_and_test["Text"]) # proportion_tbl=proportion_tbl.fillna(float(0)) # proportion_tbl=proportion_tbl.reset_index() # proportion_tbl #def get_stem(word): #j_stems=vocab_mapping.where("Word", are.equal_to(word)) #if j_stems.num_rows==0: #return #else: #return j_stems.column("Stem").item(0) # important_stems=vocab_table.column("Stem") # for row in proportion_tbl.index.values: # print(row) # review=complete_split(proportion_tbl.at[row, 'Review'].lower()) # review_words=[w for w in review if w.strip()] # for j in review_words: # j_stem=get_stem(j) # if j_stem in important_stems: # prop=1/len(review_words) # proportion_tbl.at[row, j_stem]=proportion_tbl.loc[row,j_stem]+prop #proportion_tbl.to_csv("all_features.csv")

The final result of this process is a fully filled table for how often each of the 500 most frequent words show up in each of the 450 testing and training Amazon reviews.

untrained=pd.read_csv("all_features.csv") untrained

With a team of 4, it took just half an hour to manually label each of these reviews as related to "price", "quality", "as advertised", or "shipping". The excel sheet is then imported back as "all_features_labeled.csv".

amazon_reviews=Table().read_table("all_features_labeled.csv") amazon_reviews

Step 2: Create the KNN model

Method 1: Use Data 8 Manual Calculation of Nearest Neighbors via Euclidean Distance

In this method, around 20 words are selected by intuition to be the classifying features of our KNN model

my_features=make_array("no", "ha", "onli", "other", "time", "work", "better", "money", "worth", "new", "see","look", "qualiti", "price", "love", "realli", "than", "day")

Split the trained dataset into testing and training sets, and also create a features-only table for each

train_reviews=amazon_reviews.take(np.arange(360)) test_reviews=amazon_reviews.take(np.arange(360,450))

train_my_features=train_reviews.select(my_features) test_my_features=test_reviews.select(my_features)

Define function fast_distances that calculates the Euclidean distance between two rows, and a most_common function that finds the most common label in a table

def fast_distances(test_row, train_table): assert train_table.num_columns < 50, "Make sure you're not using all the features of the movies table." "Make sure you are passing in a row object to fast_distances." assert len(test_row) == len(train_table.row(0)), "Make sure the length of test row is the same as the length of a row in train_table." counts_matrix = np.asmatrix(train_table.columns).transpose() diff = np.tile(np.array(list(test_row)), [counts_matrix.shape[0], 1]) - counts_matrix np.random.seed(0) # For tie breaking purposes distances = np.squeeze(np.asarray(np.sqrt(np.square(diff).sum(1)))) eps = np.random.uniform(size=distances.shape)*1e-10 #Noise for tie break distances = distances + eps return distances def most_common(label, table): return table.group(label).sort("count",descending=True).column(label).item(0)

Define a general classification function for a single test row of features.

def classify(test_row, train_features, train_labels, k): """Return the most common class among k nearest neigbors to test_row.""" distances = fast_distances(test_row, train_features) topic_and_distances = Table().with_columns("Topic", train_labels, "Distance", distances).sort("Distance") return most_common("Topic", topic_and_distances.take(k))

Define a specific classification function with a specific number of neighbors and the specific table of features we are using

def classify_permanent(row): return classify(row, train_my_features, train_reviews.column("Category"), 7)

Run the classification on each row of the test data set!

test_guesses=test_my_features.apply(classify_permanent) test_guesses

Test the correctness of the predictions against the real categories for each review

proportion_correct=np.count_nonzero(test_guesses==test_reviews.column("Category"))/test_reviews.num_rows proportion_correct

Method 2: Convert to Pandas and Plug into Sci-kit Learn KNN model

Import necessary packages for Seaborn and Scikit-Learn

from sklearn import preprocessing from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier import seaborn as sns

Create a new features array except with "Categories" as part of the selected columns.

sk_features=make_array("Category","no", "ha", "onli", "other", "time", "work", "better", "money", "worth", "new", "see","look", "qualiti", "price", "love", "realli", "than", "day")

Convert the trained dataset into a Pandas Dataframe, and clean it into a Scikit friendly format by labelling each category as a number.

sk_test=pd.read_csv("all_features_labeled.csv",skipinitialspace=True) sk_test=sk_test[sk_features] sk_test['Category'] = sk_test['Category'].map({'Q' :0, 'A' :1, 'P' :2, 'S' :3}) sk_test=sk_test.dropna() sk_test.head()

Using seaborn, plot 1v1 scatter plots of each feature to see if we can find the best features to use

plt.close() g = sns.pairplot(sk_test, hue="Category", plot_kws={'alpha':1}) plt.show()

Set x-data and y-data to create a scikit friendly format of the features.

x_data = sk_test.drop(["Category"],axis=1) y_data = sk_test["Category"] MinMaxScaler = preprocessing.MinMaxScaler() X_data_minmax = MinMaxScaler.fit_transform(x_data) data = pd.DataFrame(X_data_minmax,columns=["no", "ha", "onli", "other", "time", "work", "better", "money", "worth", "new", "see","look", "qualiti", "price", "love", "realli", "than", "day"]) data.head()

Run Scikit's KNeighborsClassifier on the data to get predictions for the y variable, aka the category.

X_train, X_test, y_train, y_test = train_test_split(data, y_data,test_size=0.5, random_state = 1) knn_clf=KNeighborsClassifier() knn_clf.fit(X_train,y_train) ypred=knn_clf.predict(X_test) ypred

Use Scikit metrics to produce Accuracy report on the KNN classification of the test set.

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score result = confusion_matrix(y_test, ypred) print("Confusion Matrix:") print(result) result1 = classification_report(y_test, ypred) print("Classification Report:",) print (result1) result2 = accuracy_score(y_test,ypred) print("Accuracy:",result2)

Conclusions and Limitations

Conclusions

1. With 95% confidence, VADER predicts positive/negative sentiment of Amazon reviews correctly 66.6%-72.5% of the time, and it is proven not to be random chance.

Consequences of this: Vader is definitely a quick solution to classifying sentiment for text, but it is not the most accurate with Amazon reviews. If written with a different algorithm or an entirely different sentiment analysis tool, higher accuracy can be obtained.

2. We can use a word bucket algorithm to collect all reviews of a single product and calculate the sentiment towards different aspects of the product like price, shipping, quality, and description.

Consequences of this: This is a not-so-elegant but very logically sound algorithm that provides insight into each sentence of all the reviews a product has, and derives a core aspect that can be improved or marketed.

3. Using a training set, we were able to manually train a KNN classifier to classify the topic of a review as price, quality, shipping, or description issues/perks.

Consequences of this: We were able to classify the test set of reviews with 58% accuracy, which is better than the pure chance of guessing between the 4 categories.

4. Using the same training set, a Sci-kit learn KNN classifier was able to predict the topic of a review in a much more elegant manner.

Consequences of this: The Sci-kit learn classifier classified the test set with 67% accuracy, significantly better than the 58% that was achieved using our manual KNN classifier.

Limitations

Limitations with KNN model:

The reviews used for our KNN training consisted mostly of music and movies reviews. This can cause inaccuracies when analyzing reviews for different types of products. It could be improved with the use of a large training set, as our test was done with just 450 reviews, and a greater variety of product reviews.

The word-bucket algorithm is intuitively better than the KKN because it analysis by sentence and is able to assign numerous topics to one sentence while the KKN is able to assign only one topic for each entire review. Additionally, MonkeyLearn could have been a more accurate way to classify the topics of each sentence.

The KNN model features are chosen with "intuition", but in a perfect world, we would be able to individually analyze all of the scatter plots in the pair plot to decide which words are the best for distinguishing between the categories.

Limitations with the Word-Bucket Algorithm:

The signal words for each category in "Price.csv", "Quality.csv", "Shipping.csv", and "as-advertised.csv" are hand picked, so they may not include a good representation of which words correlate with which feature.

Feature selection can be implemented to pick the most important words, and the stem words method from KNN can be utilized to expand the scope of each word-bucket to include typos, tenses, and other branches of the same word.

Limitations with Software for user-friendly usage:

Our operational code example does not include an automated form of web scraping to collect all current reviews for a user-inputted Amazon link. We used a web scraping extension to collect reviews, but a possible extension could be the usage of an Amazon API to easily access product-specific data.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Applying VADER Sentiment Analysis and KNN Classification on Amazon Reviews for Automated Seller Recommendations