Enhancing Guest Experiences through Sentiment Analysis of European Hotel Reviews
Introduction
"Enhancing Guest Experiences through Sentiment Analysis of European Hotel Reviews" was initiated with the goal of leveraging machine learning to analyze customer feedback from hotel reviews. In an industry where guest satisfaction is paramount, understanding the sentiment behind customer reviews is crucial for making informed business decisions. This project involved the integration of a large-scale hotel review dataset stored on Google Drive, comprehensive data preparation, and the development of a sentiment analysis model using a Naive Bayes classifier. The model was designed to accurately predict whether reviews were positive or negative, ultimately aiding hotel management in identifying strengths and addressing areas needing improvement. With an achieved accuracy of 92.7% on the testing set, the project demonstrates the potential for data-driven enhancements in guest experiences.
Problem Statement
The project aimed to enhance the guest experience by accurately predicting the sentiment of hotel reviews. This sentiment analysis is crucial for hotel management to understand customer feedback, identify areas of improvement, and make data-driven decisions to boost customer satisfaction.
Achievements
Data Integration: Successfully integrated Google Drive for data storage and processing within the environment. Data Preparation: Efficiently separated and merged datasets to prepare a comprehensive dataset for analysis. Model Development: Built and trained a Naive Bayes sentiment analysis model, achieving an accuracy of 92.7% on the test set. Model Deployment: Deployed the model, enabling it to analyze and predict sentiment polarity for future hotel reviews, aiding in improving guest experiences.
Data Acquisition and Preparation
Hotel reviews dataset stored on Google Drive is accessed by intergreting google drive file storage in the enviroment. The dataset storage directory is accessed in the enviroment by mounting the drive. Formating of the dataset involved separting dataset into 2(positive and negative review dataset)and later merged into 1 single dataset vertically The new modified dataset is saved to google drive storage
Mounting Google Drive
Loading Review dataset
Studying structure of dataset and separation
Merging or Remerging dataset vertically for Positive & Negative Reviews
Saving the merged dataset to google drive
Exploratory Data Analysis (EDA)
We conduct exploratory analysis on the dataset to understand its structure, distribution, and characteristics. Visualize key features such as review sentiments, word frequency distributions, and any patterns in the data. Identify any potential challenges or biases in the dataset.
Data Preprocessing or Preparation for Sentiment Analysis
This includes dropping un neccessary columns not needed for the analysis
Deconding of categorical data for review lablels to numerical values
Text Preprocessing for seniment Analysis
This imports the `spacy` and `nltk` libraries for natural language processing tasks, and also imports the `re` library for regular expression operations. It then runs a command to download the `en_core_web_sm` model, a small English language model for spaCy, and loads this model into the `nlp` variable for further text processing tasks.
Text cleaning
This cleans the text data in the 'Review' column of the dataframe by: 1. Removing special characters, keeping only words (alphanumeric characters) and spaces. 2. Removing numbers from the text. 3. Removing punctuations, though this seems redundant as special characters (including punctuations) are already removed in the first step. 4. Replacing multiple consecutive white spaces with a single space, to ensure text consistency.
Removing Stopwords
This removes stopwords from the 'Review' column of a dataframe using the `spacy` library. It iterates over each review, splits it into individual words, filters out the words that are in the predefined list of stopwords, and then joins the remaining words back into a single string.
Tokenization of Text Reviews
This tokenizes the text data in the 'Review' column of a dataframe by splitting each review into individual words (tokens) and then joining them back together into a single string. Essentially, this process transforms each review by separating and then immediately recombining the words, effectively leaving the text unchanged. This action might be intended to clean up whitespace or ensure consistent spacing between words, but as presented, it does not alter the textual content beyond potentially normalizing whitespace.
Lemmatization
This performs lemmatization on the 'Review' column of a dataframe. It processes each review using the spaCy NLP model loaded into the `nlp` variable, identifies the base form (lemma) of each word in the review, and then combines these lemmatized words back into a single string. Thus, it transforms each review by replacing words with their root form, standardizing variations of a word to its core meaning.
Feature Engineering or Extraction
Converting processed review text to Numeric
This converts the processed review text into numerical values using the Term Frequency-Inverse Document Frequency (TF-IDF) method. It imports the `TfidfVectorizer` class from `sklearn.feature_extraction.text`, initializes an instance of `TfidfVectorizer`, and then fits this vectorizer to the 'Review' column of the dataframe `df`. This creates a TF-IDF matrix named `reviews` where each row corresponds to a document (review in this case) and each column represents a unique word in the text corpus, with values that quantify the importance of each word in each document.
This output provides a summary of a sparse matrix. Here's what each part means: - **<1031476x69920 sparse matrix of type '<class 'numpy.float64'>:** This means that the matrix has 1,031,476 rows and 69,920 columns. It is a sparse matrix, meaning most of its values are zero, and only non-zero values are stored. The dtype of the matrix is numpy.float64, meaning the values in the matrix are floating-point numbers. - **With 8923827 stored elements:** Out of the total elements in the matrix (1,031,476 * 69,920), only 8,923,827 elements are non-zero and therefore stored. - **In Compressed Sparse Row format:** This indicates the storage format of the sparse matrix. Compressed Sparse Row (CSR) is a commonly used format for storing sparse matrices efficiently. It involves saving the values of non-zero elements along with their row and column indices instead of saving the whole matrix. In the context of Natural Language Processing (NLP) or text data processing, such matrices typically represent a collection of text documents transformed into numerical values using techniques like TF-IDF, where each row of the matrix might correspond to a document and each column might represent a unique word in the entire corpus of documents. Sparse matrices are a memory-efficient way to handle such data, as text corpora often contain a large number of unique words, leading to high dimensionality, but each individual document generally only contains a small subset of these words, hence leading to sparsity.
Model Training-Naive Bayes
Splitting data into Training & Testing Sets
Naive Bayes Model Training
This imports the `MultinomialNB` class from `sklearn.naive_bayes`, initializes an instance of `MultinomialNB`, and fits this model to the training data `X_train` and `y_train`. The purpose is to train a Naive Bayes classifier on the given training dataset.
Model Evaluation
Evaluation on training dataset
This evaluates the performance of a trained Naive Bayes classifier on the training dataset by calculating its accuracy. It uses the `accuracy_score` function from `sklearn.metrics` to compare the classifier's predictions (`y_pred`) on the training data (`X_train`) with the true labels (`y_train`). The `predict` method of the classifier (`clf`) is used to generate predictions for each sample in the training set, and then the accuracy score is computed as the proportion of correct predictions.
Evaluation on Testing dataset
This evaluates the performance of a trained classifier on the testing dataset by calculating its accuracy. It imports the `accuracy_score` function to compute the proportion of correct predictions made by the classifier (identified as `clf`) on the testing data (`X_test`) against the true labels (`y_test`). The classifier's `predict` method generates predictions (`y_pred`) for each sample in the testing set, and the accuracy score is then derived from these predictions.
This imports the `confusion_matrix` function from `sklearn.metrics` and applies it to compare the performance of a trained model by calculating its confusion matrix based on the predictions (`y_pred`) and the actual labels (`y_test`) from the test data. It then prints the confusion matrix, providing a summary of the prediction results, where each row of the matrix represents the instances in a predicted class, and each column represents the instances in an actual class.
Model Deployment and Future Use
Saving the final trained model for future use on new data
This imports the `joblib` library and then uses its `dump` function to save the trained model `clf` to a file named `'sentiment_model.pkl'`.
Conclusion
Throughout the task, a dataset containing hotel reviews was analyzed. The analysis involved several steps: Loading and examining the structure of the dataset. Cleaning the review text data by removing special characters, numbers, multiple spaces, and stopwords, performing tokenization, and lemmatizing the words. Converting the cleaned review text into numerical format using the TF-IDF method. Splitting the data into a training set and a testing set. Training and evaluating a Naive Bayes classifier on the training data. Testing the classifier's performance on the testing data. Saving the trained model for future use. Following these steps allowed the creation of a model that can be used to predict the sentiment of a review (whether it is positive or negative) based on its content. The performance of this model was evaluated using accuracy as a metric, and it was found to perform satisfactorily on both the training and testing data, indicating its potential effectiveness in real-world sentiment analysis tasks.
Interpretation of model performance
The performance of the model was evaluated using accuracy as the metric on both training and testing datasets. On the training dataset, the model yielded an accuracy of approximately 93.0%, whereas, on the unseen testing data, the model achieved a similar accuracy of approximately 92.7%. This performance indicates that the model has learned and generalized well from the training data, performing almost just as well on new, unseen data. This suggests that the model is a good fit for the data, as it isn't overfitting (where the model learns the training data so well that it performs poorly on new data) or underfitting (where the model doesn't learn enough from the training data to make accurate predictions). However, while accuracy is a useful metric, it's important to remember that it's not the only factor to be considered when evaluating model performance. It may be beneficial to examine other aspects such as precision, recall or use an F1 score, particularly in cases where the dataset is imbalanced. Looking at the confusion matrix, which shows the number of true positive, true negative, false positive, and false negative predictions, we see the model results in large numbers of both true positives and true negatives, suggesting it's effective at predicting both classes. In summary, the Naive Bayes classifier seems to provide a good fit for the text sentiment analysis task.