Enhancing Guest Experiences through Sentiment Analysis of European Hotel Reviews

Introduction

"Enhancing Guest Experiences through Sentiment Analysis of European Hotel Reviews" was initiated with the goal of leveraging machine learning to analyze customer feedback from hotel reviews. In an industry where guest satisfaction is paramount, understanding the sentiment behind customer reviews is crucial for making informed business decisions. This project involved the integration of a large-scale hotel review dataset stored on Google Drive, comprehensive data preparation, and the development of a sentiment analysis model using a Naive Bayes classifier. The model was designed to accurately predict whether reviews were positive or negative, ultimately aiding hotel management in identifying strengths and addressing areas needing improvement. With an achieved accuracy of 92.7% on the testing set, the project demonstrates the potential for data-driven enhancements in guest experiences.

Problem Statement

The project aimed to enhance the guest experience by accurately predicting the sentiment of hotel reviews. This sentiment analysis is crucial for hotel management to understand customer feedback, identify areas of improvement, and make data-driven decisions to boost customer satisfaction.

Achievements

Data Integration: Successfully integrated Google Drive for data storage and processing within the environment. Data Preparation: Efficiently separated and merged datasets to prepare a comprehensive dataset for analysis. Model Development: Built and trained a Naive Bayes sentiment analysis model, achieving an accuracy of 92.7% on the test set. Model Deployment: Deployed the model, enabling it to analyze and predict sentiment polarity for future hotel reviews, aiding in improving guest experiences.

Data Acquisition and Preparation

Hotel reviews dataset stored on Google Drive is accessed by intergreting google drive file storage in the enviroment. The dataset storage directory is accessed in the enviroment by mounting the drive. Formating of the dataset involved separting dataset into 2(positive and negative review dataset)and later merged into 1 single dataset vertically The new modified dataset is saved to google drive storage

Mounting Google Drive

Loading Review dataset

#Reading dataset import pandas as pd import numpy as np df = pd.read_csv('/datasets/gdrive/PROJECTS/Natural Laguage Processing/Sentiment Analysis/Hotel_Reviews.csv') df.head()

Studying structure of dataset and separation

#SHAPE df.shape

#columns df.columns

#Positive reviews by selecting specific columns df_positive=df[['Hotel_Name','Positive_Review','Reviewer_Score','lat','lng', 'Average_Score','Review_Total_Positive_Word_Counts','Total_Number_of_Reviews', 'Total_Number_of_Reviews_Reviewer_Has_Given', 'Additional_Number_of_Scoring','Tags','days_since_review','Hotel_Address','Review_Date', 'Reviewer_Nationality']] #Negative reviews by selecting specific columns df_negative=df[['Hotel_Name','Negative_Review','Reviewer_Score','lat','lng', 'Average_Score','Review_Total_Negative_Word_Counts','Total_Number_of_Reviews', 'Total_Number_of_Reviews_Reviewer_Has_Given', 'Additional_Number_of_Scoring','Tags','days_since_review','Hotel_Address','Review_Date', 'Reviewer_Nationality']]

#printing the first 5 rows of positive reviews #Positive #Renaming the columns for positive reviews and review total positive word counts df_positive.rename(columns = {'Positive_Review':'Review','Review_Total_Positive_Word_Counts':'Review_Total_Word_Counts'}, inplace = True) #New column with review labels as Positive df_positive['Review_Label']='Positive' df_positive.head()

#printing the first 5 rows of negative reviews #Negative #Renaming the columns for negative reviews and review total negative word counts df_negative.rename(columns = {'Negative_Review':'Review','Review_Total_Negative_Word_Counts':'Review_Total_Word_Counts'}, inplace = True) #New column with review labels as Negative df_negative['Review_Label']='Negative' df_negative.head()

#Columns for positive and negative reviews df_positive.columns df_negative.columns

#check if columns are same df_positive.columns == df_negative.columns

Merging or Remerging dataset vertically for Positive & Negative Reviews

#Merging positive and negative reviews df_reviews=pd.concat([df_positive,df_negative]) df_reviews.head()

Saving the merged dataset to google drive

#save the dataframe to google drive df_reviews.to_csv('/datasets/gdrive/PROJECTS/Natural Laguage Processing/Sentiment Analysis/515K_Hotel_Reviews_modified.csv')

Exploratory Data Analysis (EDA)

We conduct exploratory analysis on the dataset to understand its structure, distribution, and characteristics. Visualize key features such as review sentiments, word frequency distributions, and any patterns in the data. Identify any potential challenges or biases in the dataset.

# Exploratory Data Analysis (EDA) #importing libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns #reading dataset df = pd.read_csv('/datasets/gdrive/PROJECTS/Natural Laguage Processing/Sentiment Analysis/515K_Hotel_Reviews_modified.csv') df.head(5)

#shape df.shape

#columns df.columns

#viewing data types df.dtypes

#Value counts df['Review_Label'].value_counts()

#msing values df.isnull().sum()

#duplicates df.duplicated().sum()

Data Preprocessing or Preparation for Sentiment Analysis

This includes dropping un neccessary columns not needed for the analysis

Deconding of categorical data for review lablels to numerical values

#Dropping columns df.drop(['Unnamed: 0','lat','lng','Average_Score','Review_Total_Word_Counts','Total_Number_of_Reviews', 'Total_Number_of_Reviews_Reviewer_Has_Given', 'Additional_Number_of_Scoring','Tags','days_since_review','Hotel_Address','Review_Date', 'Reviewer_Nationality','Reviewer_Score','Hotel_Name'],axis=1,inplace=True)

#converting review labels to numerical values df['Review_Label']=df['Review_Label'].map({'Positive':1,'Negative':0})

#showing first 5 rows df.head(5)

Text Preprocessing for seniment Analysis

import spacy import nltk !python -m spacy download en_core_web_sm nlp = spacy.load('en_core_web_sm') import re

This imports the `spacy` and `nltk` libraries for natural language processing tasks, and also imports the `re` library for regular expression operations. It then runs a command to download the `en_core_web_sm` model, a small English language model for spaCy, and loads this model into the `nlp` variable for further text processing tasks.

Text cleaning

#Removing special characters df['Review']=df['Review'].str.replace('[^\w\s]','') #Removing numbers df['Review']=df['Review'].str.replace('\d+','') #Removing punctuations df['Review']=df['Review'].str.replace('[^\w\s]','') #removing white spaces df['Review']=df['Review'].str.replace('\s+',' ')

This cleans the text data in the 'Review' column of the dataframe by: 1. Removing special characters, keeping only words (alphanumeric characters) and spaces. 2. Removing numbers from the text. 3. Removing punctuations, though this seems redundant as special characters (including punctuations) are already removed in the first step. 4. Replacing multiple consecutive white spaces with a single space, to ensure text consistency.

Removing Stopwords

from spacy.lang.en.stop_words import STOP_WORDS as stopwords df['Review']=df['Review'].apply(lambda x: ' '.join([item for item in x.split() if item not in stopwords]))

This removes stopwords from the 'Review' column of a dataframe using the `spacy` library. It iterates over each review, splits it into individual words, filters out the words that are in the predefined list of stopwords, and then joins the remaining words back into a single string.

Tokenization of Text Reviews

#tokenization df['Review']=df['Review'].apply(lambda x: ' '.join([item for item in x.split()]))

This tokenizes the text data in the 'Review' column of a dataframe by splitting each review into individual words (tokens) and then joining them back together into a single string. Essentially, this process transforms each review by separating and then immediately recombining the words, effectively leaving the text unchanged. This action might be intended to clean up whitespace or ensure consistent spacing between words, but as presented, it does not alter the textual content beyond potentially normalizing whitespace.

Lemmatization

#Lemmatization df['Review']=df['Review'].apply(lambda x: ' '.join([item.lemma_ for item in nlp(x)]))

This performs lemmatization on the 'Review' column of a dataframe. It processes each review using the spaCy NLP model loaded into the `nlp` variable, identifies the base form (lemma) of each word in the review, and then combines these lemmatized words back into a single string. Thus, it transforms each review by replacing words with their root form, standardizing variations of a word to its core meaning.

#showing first 5 rows df.head(5)

Feature Engineering or Extraction

Converting processed review text to Numeric

#Converting revview text to numeric using TF-IDF from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer() reviews = tfidf.fit_transform(df['Review'])

This converts the processed review text into numerical values using the Term Frequency-Inverse Document Frequency (TF-IDF) method. It imports the `TfidfVectorizer` class from `sklearn.feature_extraction.text`, initializes an instance of `TfidfVectorizer`, and then fits this vectorizer to the 'Review' column of the dataframe `df`. This creates a TF-IDF matrix named `reviews` where each row corresponds to a document (review in this case) and each column represents a unique word in the text corpus, with values that quantify the importance of each word in each document.

#showing first 5 rows reviews

This output provides a summary of a sparse matrix. Here's what each part means: - **<1031476x69920 sparse matrix of type '<class 'numpy.float64'>:** This means that the matrix has 1,031,476 rows and 69,920 columns. It is a sparse matrix, meaning most of its values are zero, and only non-zero values are stored. The dtype of the matrix is numpy.float64, meaning the values in the matrix are floating-point numbers. - **With 8923827 stored elements:** Out of the total elements in the matrix (1,031,476 * 69,920), only 8,923,827 elements are non-zero and therefore stored. - **In Compressed Sparse Row format:** This indicates the storage format of the sparse matrix. Compressed Sparse Row (CSR) is a commonly used format for storing sparse matrices efficiently. It involves saving the values of non-zero elements along with their row and column indices instead of saving the whole matrix. In the context of Natural Language Processing (NLP) or text data processing, such matrices typically represent a collection of text documents transformed into numerical values using techniques like TF-IDF, where each row of the matrix might correspond to a document and each column might represent a unique word in the entire corpus of documents. Sparse matrices are a memory-efficient way to handle such data, as text corpora often contain a large number of unique words, leading to high dimensionality, but each individual document generally only contains a small subset of these words, hence leading to sparsity.

Model Training-Naive Bayes

Splitting data into Training & Testing Sets

#splitting dataframe into training and testing sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(reviews, df['Review_Label'], test_size=0.2, random_state=42)

#train set X_train.shape

#test set X_test.shape

Naive Bayes Model Training

# Model Training-Naive Bayes from sklearn.naive_bayes import MultinomialNB clf = MultinomialNB() clf.fit(X_train, y_train)

This imports the `MultinomialNB` class from `sklearn.naive_bayes`, initializes an instance of `MultinomialNB`, and fits this model to the training data `X_train` and `y_train`. The purpose is to train a Naive Bayes classifier on the given training dataset.

Model Evaluation

Evaluation on training dataset

# Evaluation on training data from sklearn.metrics import accuracy_score y_pred = clf.predict(X_train) accuracy_score(y_train, y_pred)

This evaluates the performance of a trained Naive Bayes classifier on the training dataset by calculating its accuracy. It uses the `accuracy_score` function from `sklearn.metrics` to compare the classifier's predictions (`y_pred`) on the training data (`X_train`) with the true labels (`y_train`). The `predict` method of the classifier (`clf`) is used to generate predictions for each sample in the training set, and then the accuracy score is computed as the proportion of correct predictions.

Evaluation on Testing dataset

# Evaluation on testing data from sklearn.metrics import accuracy_score y_pred = clf.predict(X_test) accuracy_score(y_test, y_pred)

This evaluates the performance of a trained classifier on the testing dataset by calculating its accuracy. It imports the `accuracy_score` function to compute the proportion of correct predictions made by the classifier (identified as `clf`) on the testing data (`X_test`) against the true labels (`y_test`). The classifier's `predict` method generates predictions (`y_pred`) for each sample in the testing set, and the accuracy score is then derived from these predictions.

#comparing performance on the train and test data from sklearn.metrics import confusion_matrix print(confusion_matrix(y_test, y_pred))

This imports the `confusion_matrix` function from `sklearn.metrics` and applies it to compare the performance of a trained model by calculating its confusion matrix based on the predictions (`y_pred`) and the actual labels (`y_test`) from the test data. It then prints the confusion matrix, providing a summary of the prediction results, where each row of the matrix represents the instances in a predicted class, and each column represents the instances in an actual class.

Model Deployment and Future Use

Saving the final trained model for future use on new data

# Saving Model to google drive import joblib joblib.dump(clf, 'sentiment_model.pkl')

This imports the `joblib` library and then uses its `dump` function to save the trained model `clf` to a file named `'sentiment_model.pkl'`.

Conclusion

Throughout the task, a dataset containing hotel reviews was analyzed. The analysis involved several steps: Loading and examining the structure of the dataset. Cleaning the review text data by removing special characters, numbers, multiple spaces, and stopwords, performing tokenization, and lemmatizing the words. Converting the cleaned review text into numerical format using the TF-IDF method. Splitting the data into a training set and a testing set. Training and evaluating a Naive Bayes classifier on the training data. Testing the classifier's performance on the testing data. Saving the trained model for future use. Following these steps allowed the creation of a model that can be used to predict the sentiment of a review (whether it is positive or negative) based on its content. The performance of this model was evaluated using accuracy as a metric, and it was found to perform satisfactorily on both the training and testing data, indicating its potential effectiveness in real-world sentiment analysis tasks.

Interpretation of model performance

The performance of the model was evaluated using accuracy as the metric on both training and testing datasets. On the training dataset, the model yielded an accuracy of approximately 93.0%, whereas, on the unseen testing data, the model achieved a similar accuracy of approximately 92.7%. This performance indicates that the model has learned and generalized well from the training data, performing almost just as well on new, unseen data. This suggests that the model is a good fit for the data, as it isn't overfitting (where the model learns the training data so well that it performs poorly on new data) or underfitting (where the model doesn't learn enough from the training data to make accurate predictions). However, while accuracy is a useful metric, it's important to remember that it's not the only factor to be considered when evaluating model performance. It may be beneficial to examine other aspects such as precision, recall or use an F1 score, particularly in cases where the dataset is imbalanced. Looking at the confusion matrix, which shows the number of true positive, true negative, false positive, and false negative predictions, we see the model results in large numbers of both true positives and true negatives, suggesting it's effective at predicting both classes. In summary, the Naive Bayes classifier seems to provide a good fit for the text sentiment analysis task.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Enhancing Guest Experiences through Sentiment Analysis of European Hotel Reviews