import calendar from collections import Counter import enchant import h3 import matplotlib.pyplot as plt import nltk from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer import numpy as np import os import pandas as pd import plotly.express as px import plotly.figure_factory as ff import plotly.graph_objects as go from plotly.offline import init_notebook_mode, iplot import random import re import seaborn as sns from sklearn import metrics from sklearn.cluster import DBSCAN from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.metrics import accuracy_score, confusion_matrix, classification_report from sklearn.model_selection import train_test_split import textblob import sys from textblob.sentiments import NaiveBayesAnalyzer from textblob import TextBlob, Word from wordcloud import STOPWORDS, WordCloud import xgboost as xgb from xgboost import XGBRegressor import warnings warnings.filterwarnings('ignore')

Introduction.

As a freelancer and Michelin trained chef for around 25 years, I worked in over 200 different hotels, restaurants, and private family residences such as castles, yachts and mansions, so it's fair to say I have seen my share of hotels of all sizes and qualities (for my sins). Admittedly I was never a huge fan of review sites because I know that people can and will leave negative reviews for things that are out of a stakeholder's control, but I am also interested in human nature and curious about the motivations behind customers who leave reviews of a certain polarity as well as being appreciative of the more accurate, constructive criticism some of the review sites provide amid the rough. I will probably concentrate more on the negative reviews in a bid to uproot issues, or at least explore possible business optimisation opportunities via the medium of NLP.

Thanks for reading, and please note that this isn't a professional project, this is me truffling through data for fun when I was learning NLP, as well as investigating for no other reason than to see what I can come up with. This means there will be little in the way of structure, and probably even less in the way of professionalism & common sense.

nltk.download('averaged_perceptron_tagger') nltk.download('punkt') nltk.download('stopwords') nltk.download('vader_lexicon') nltk.download('wordnet')

df = pd.read_csv("/work/Datafiniti_Hotel_Reviews_Jun19.csv")

The data.

print(f'Dataframe size: {round(sys.getsizeof(df) / (1024 * 1024), 1)} MB. ')

os.environ["CUDA_VISIBLE_DEVICES"] = "1"

A quick peek.

A quick look at the dataframe shows a whole bunch of columns I want to drop. The URLS could come in handy for other projects but for this one I would rather stay on topic and drop them from this project. My first thought was that I could perhaps take the company name from the URL to use in this project but some URLs are full of utter chaff that will result in way too much work for a fun side project

All of the column labels.

1000 missing features exist in a column I would rather not keep anyway, as well as 2 missing in another column I initially didn't want to keep. There is only one value missing in the 'title' column.

df = df.drop(['id', 'address', 'country', 'dateAdded', 'dateUpdated', 'categories', 'primaryCategories', 'keys', 'reviews.dateAdded', 'reviews.dateSeen', 'reviews.sourceURLs', 'reviews.userProvince', 'sourceURLs', 'websites'], axis=1)

Cleaning the column names - changing camelCase because this isn't JS :-) and getting rid of full stops:

df.columns = [col.replace('.', '_').replace(' ', '_').lower() for col in df.columns]

Statistical description.

The mean review score across the dataset isn't too bad, sitting at 4.084 I dare say there isn't going to be a massive amount of negative reviews.

Pulling the review with the empty title column.

df[df['reviews_title'].isnull() & df['reviews_text'].notnull()][['reviews_title', 'reviews_text']]

As it's a positive review I will fill in the title with something positive, containing similar generic words relative to the review text.

df.reviews_title.fillna('Excellent service and staff, would stay again.', inplace=True)

Date and time cleaning.

df['reviews_date'] = df['reviews_date'].str.split('T').str[0] df['reviews_date'] = pd.to_datetime(df['reviews_date'])

df['reviews_year'] = df['reviews_date'].dt.year df['reviews_month'] = df['reviews_date'].dt.month df['reviews_day'] = df['reviews_date'].dt.day_name() df['reviews_quarter'] = df['reviews_date'].dt.quarter

df["reviews_month"] = df["reviews_month"].apply(lambda x: calendar.month_abbr[x])

df.insert(11, 'reviews_date', df.pop('reviews_date'))

Categorical description.

I saved the categorical description until after the datetime cleaning so that those metrics would show up in better detail.

• The Hyatt House Seattle is the most frequently occurring hotel name.

• San Diego is the top city.

• CA is the top province.

• New York is the top user city.

• 'Great Location' is the top review title.

• Monday is the most frequent day for writing reviews.

• July is the most frequent month.

• And 'Michael M' is our most frequent intrepid explorer.

Analysis.

First up, a hotel review map box with the hotel /review rating as the marker colour (5* reviews as the light yellow hue, 1* reviews the darker purple hue), and the review title on popup.

Provinces with the highest count of hotels.

California tips the scale for the hotel count by province with a respectable 2647 hotels, over double that of Florida's 1277 hotels in second place, then Georgia holds the third-largest volume of data with 844 hotels:

Rating distribution.

• The most common review rating is 5, with a count of 4840.

• Then we see rating 4 with its count of 2849.

• Rating 3 is third with a count of 1190.

• Rating 1 is the fourth-most common with a count of 567.

• And finally rating 2 comes in last with a count of 554.

It's actually quite rare for people to leave 2* reviews, people are either likely to cut a hotel a bit of slack or dunk on it completely (if it isn't positive then it's either neutral (3*) or negative (1*)).

Review distribution by weekday.

• Monday is the most common day for leaving reviews, then it's Tuesday, Wednesday, Sunday, Thursday, Friday and Saturday in that order:

Review distribution by month.

• July is the most common month for reviews, followed swiftly by May and August.

So that's the two peak summer months and then May, the first month of the year that sees anything like decent weather as well as being a month that includes a busy half-term.

The highest rated hotels.

• The Hyatt House in Seattle and the Hotel Emma are the two hotels with the highest average rating as seen in blue.

• The three highest rated hotels in-tow are the French Market Inn, the Grand Hyatt Seattle and the Drury Inn & Suites (New Orleans).

The highest average rated provinces.

The 20 provinces with the highest average ratings. The top three provinces here are New Mexico, Arkansas and Utah.

The lowest average rated provinces.

And the 20 provinces with the lowest average rating.

The bottom three provinces here are Rhode Island, New Jersey and Kentucky.

This is quite unfair to RI in all honesty, their value count is one single review, which was a 1* review. NJ has 50 value counts and KY has 60 value counts. MS has 17 value counts, so it may be worthwhile to remove RI from the top (or bottom) 3 and include MS instead:

The top twenty "super-reviewers".

A starburst chart of the top 20 users leaving the most reviews along with the name of the hotels they reviewed, plus the ratings they assigned to those hotels (click on the username in the inner sunburst to see more of that user's reviews).

The main reviewer locations / user cities and their average ratings.

The highest average reviews were left by people living in Austin, Plainview, Enfield, Lander and Taylorville, to name but a handful of reviewers in the top 30 or so.

The top four reviewer locations / user cities.

And a count of the values of reviews by user city shows the majority of reviewers coming from NY City, LA, Chicago and Houston.

Top provinces visited for reviewers living in top user city 1 - NYC.

• California takes the lion's share of data at 32%.

• Florida is the second-most common destination with 14%.

• That is followed by Louisiana with 10%.

• And then we see Philly in fourth with 9%.

A list of the hotels reviewed by these customers include Philly Airport's Fair

field Inn, the Mandarin at Miami, the Hotel Diva and the Annex at the Chelsea.

Top provinces visited for reviewers living in top user city 2 - LA.

• For reviewers living in L.A. we see a much greater percentage of data for California as the top holiday destination at 59%.

• That is followed by Washington State at 11.5%.

• Florida at 5%.

• And Louisiana at around 4%.

A list of the hotels reviewed by travellers living in LA includes the Pearl Hotel, the Estancia La Jolla, Galleria Park Hotel and the Best Western Seven Seas.

Top provinces visited for reviewers living in top user city 3 - Chicago.

More on the staycation front here with reviewers living in Chicago, where we see:

• The majority (24%) staying in Illinois for their holidays.

• California is the second-most common destination at 17.5%.

• Washington State is in third place at 12%.

• And Florida is fourth at 9%.

Chicago residents' destination hotels consist of the Conrad Chicago, the Hyatt Place Chicago, the Grand Hyatt Seattle and the Kimpton Hotel Allegro:

Top provinces visited for reviewers living in top user city 4 - Houston.

A running theme with American residents, but I can't say I blame them living in such a large and varied / beautiful country! Like Californians and Illinois... ians (?!):

• Texans mostly seem to stay in Texas for their holidays to the tune of 35%.

• That is followed by Louisiana at 20%.

• Cali is then the third-most common destination at 9%.

• And finally Georgia takes fourth place at a figure of 9%.

Houston residents' destination hotel list primarily consists of the Hotel Emma, the Hyatt House Downtown, the St. james and the Best Western Plus French Quarter.

Review counts.

Sorting the review counts per hotel, grouping and visualising the hotels with more than 20 reviews.

Top hotels by review distribution.

• As seen in the categorical description, the Hyatt House Seattle/Downtown is the #1 most reviewed hotel.

• That is followed by the French Market Inn (with almost 10K less reviews as the Hyatt House) and the Grand Hyatt Seattle.

NLP.

Average review length.

The average review length by day sees customers leaving shorter reviews on a Monday, and longer reviews on a Friday.

Maybe the reviewers were in a rush to drop a quick, positive review on the first day of a week-long holiday which would be a Monday, leaving longer reviews on a Friday on their way home or once they arrive home? After giving it plenty of thought, nothing quite gels, and this is bugging me mildly.

Investigating the most common hotels for reviews written on Monday, Hotel Emma is the most common entry.

I checked the reviews for Hotel Emma online and found they were actually pretty short in content. Most reviewers seemed to be somewhat speechless with the experience, so i'm wondering if people write short reviews when the experience is positive? One would assume that there'd be more to write about if there were a host of complaints...

Well I'll be a monkey's uncle, people do leave longer reviews when there are things to complain about. Which is natural and pretty obvious if I think about it, but I wouldn't have even considered it until I saw the reviews by day and dug deeper on a hunch. Apparently the average length of 1* reviews is 36% longer than that of 5* reviews! So an important nugget of information for hoteliers here: All you have to do to know if your hotel makes the grade is count the average review length.

End of project in one line of code.

for x in range(1,6): print(f'Average review length for {x}* reviews: {round(df[df.reviews_rating == x].word_count.mean())} chars. ')

Exclamation marks.

Creating a new column for features with 3 or more exclamation marks, pre-cleaning, to be examined after the polarity analysis.

def exclamation(text): if text.count('!') >= 3: return 1 else: return 0 df['exclamation'] = df['reviews_text'].apply(exclamation)

Data preprocessing.

df['reviews_text'] = df['reviews_text'].apply(lambda x: re.sub(r'(?<=[.,])(?=[^\s])', r' ', x)) # adding whitespace to the full stops in case some reviewers didn't. df['reviews_text'] = df['reviews_text'].apply(lambda x: re.sub('[^a-zA-Z\s]', '', x.lower())) df['reviews_text'] = df['reviews_text'].apply(lambda x: re.sub(r'\n', '', x)) df['reviews_text'] = df['reviews_text'].apply(lambda x: re.sub(r'http\S+', '', x))

I would like to analyse some top keywords / phrases. Instead of truffling through the dataframe i'll go with bigram & trigram analysis to see if bigrams such 'top floor' and 'car park' exist for negativity.

Bigram and trigram functions.

def get_bigram(string): review_string = string.lower() review_string = string.replace(' ', '_') bigram = df.loc[df['name'] == string, 'reviews_text'].values bigram = bigram[0] bigram = bigram.split() bigram = list(nltk.bigrams(bigram)) return random.sample(bigram, 20)

def get_trigram(string): review_string = string.lower() review_string = string.replace(' ', '_') trigram = df.loc[df['name'] == string, 'reviews_text'].values trigram = trigram[0] trigram = trigram.split() trigram = list(nltk.trigrams(trigram)) return random.sample(trigram, 20)

Polarity functions.

def get_polarity(text): return TextBlob(text).sentiment.polarity df['polarity'] = df['reviews_text'].apply(get_polarity)

def get_analysis(score): if score < 0: return 'Negative' elif score == 0: return 'Neutral' else: return 'Positive' df['analysis'] = df['polarity'].apply(get_analysis)

Visualisation functions.

def visualise_positive_sentiment(df, word): df['polarity'] = df['reviews_text'].apply(get_polarity) df['analysis'] = df['polarity'].apply(get_analysis) df['count'] = df['reviews_text'].str.count(word) df = df[df['count'] > 0] df = df[df['analysis'] == 'Positive'] fig = px.histogram(df, x = 'polarity', title = f"The positive sentiment associated with the word '{word}' in df.reviews_text", color_discrete_sequence = ['darkkhaki']) fig.show()

def visualise_negative_sentiment(df, word): df['polarity'] = df['reviews_text'].apply(get_polarity) df['analysis'] = df['polarity'].apply(get_analysis) df['count'] = df['reviews_text'].str.count(word) df = df[df['count'] > 0] df = df[df['analysis'] == 'Negative'] fig = px.histogram(df, x = 'polarity', title = f"The negative sentiment associated with the word '{word}' in df.reviews_text", color_discrete_sequence = ['darkkhaki']) fig.show()

Stop-word removal.

stop = stopwords.words('english') df['reviews_text'] = df['reviews_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

Back to the analysis.

With my polarity analysis set only to pos, neg and neutral, the odd 'negative' review may be classed as positive so this won't be 100% accurate, but it will be close.

Looking at the polarity for the posts containing three or more exclamation marks now, overall it's looking just ok. Not overly positive but certainly not negative. Possibly some seriously positive comments vs. some not so great ones evening things out into the pos < neutral scale.

df = df.drop(['exclamation_polarity'], axis=1)

To be on the safe side (and to save me a ton of code, like, "for x in [polarity column] where reviews_rating == 5(just in case): write a ton of code here"), I thought I would better check the average polarity scores for the reviews rating! There will be low-scoring reviews with decent polarity here and there but as long as the averages look good then I can say with a degree of certainty that my NLP analysis is as good as it would be if I were looking at review rating scores.

Hotels by polarity score.

1: Positive.

• The top 20 hotels with the highest average polarity score shows Roseberry's Inn holding the most positives on average, warranted, as it looks like a lovely place to stay.

• Gordon beach Inn sits at second-most positive on average, then the Lodge at Lolo Hot Springs:

2: Negative.

• The Knights Inn holds the two highest values for negative polarity.

• The Super 8 Dubois comes in at an equally, but only mildly disappointing third place.

Pos / neg polarity distribution by province.

It seems like the negative sentiment isn't too far out of line in most areas, growing in conjunction with the amount of positive reviews / reviewer counts with the odd outlier here and there, such as Louisiana (looking like a good state to travel to with only 2% of its total polarity scoring as negative).

Besides NJ (with 18 negative and 32 positive), there doesn't appear to be any surprises as far as negative polarity goes.

Polarity distribution by day.

1: Positive.

The distribution of positive reviews per day is exactly the same as the overall reviews left per day, no surprises considering the majority of reviews are positive.

Again, Monday is the most common day for leaving reviews, then it's Tuesday, Wednesday, Sunday, Thursday, Friday and Saturday in that order.

2: Negative.

A slight difference in the negative reviews per day here, Tuesday and Monday have swapped places, Wednesday is still the third-most common day for negative as it is for positive, but now Saturday is the fourth-most common day which is quite surprising.

Friday and Thursday have swapped places for the 5th and 6th most common days, and Sunday is the least common day for negative reviews.

Polarity distribution by month.

1: Positive.

Polarity by month tells a similar story to the best review scores (naturally), with July, May and August being the top three most positive review months.

2: Negative.

But, the negative polarity by month chart sees August, June and September as the most common months for negative reviews. As these are counts as opposed to averages, August and July will score highly for both positive and negative polarity due to the amount of holidaymakers in those months.

Polarity distribution by year.

1: Positive.

The positive and negative polarities by year don't see too many differences besides 2015 where the negative polarity jumps up slightly in accordance to the positive polarity. A record number of Americans travelled abroad in 2015 so this maybe had some impact. 2005's negative polarity is also worthy of note, possibly due to Hurricane Katrina, oil price surges and the like.

2: Negative.

Top provinces for negative sentiment in 2015.

Looking at the top provinces for negative sentiment 2015 shows Cali holding the number one spot, in accordance with its level of tourism, so let's check some hotel names.

These numbers are in keeping with the total amount of holidaymakers but we may see the odd outlier here.

There are a few hotels causing that spike in 2015's negativity, primarily the Best Western Seven Seas. 2015 saw some very hot / dry weather in Cali along with some droughts so I will give the hotel some credit and suppose the words 'critters', 'centipedes', 'spiders', 'crawling' etc. in the data for the Seven Seas for this year are a result of that, with the critters seeking refuge from the heat within the confines of the hotel. I checked the hotel online and it mostly looks like a case of a great hotel being let-down by unreliable management, which is a shame for what could be an amazing place. Beautiful surroundings, perfectly situated for the local attractions, a nice building, the decor isn't bad considering almost every facet of the hotel has been taken for granted. Neglect is still the primary reason it's being plagued by bad reviews in 2023, a few of the reviews I have seen still mention the same odor in some of the rooms that was mentioned years ago. Granted, the current-era odour could be a different odour to the odour of yore, but the presence of any untreated 'stank' usually reflects a certain degree of neglect either way.

This is a common issue with hotels in many popular tourism destinations I've worked in over the years; the hotelier won't have a hospitality background but will be aware that the repeat custom will come rolling in irrespective of negative reviews due to the location of the hotel and the reasonable room pricing, resulting in some good financial returns for next to no investment / an acceptable reputation risk for the business owner which the location and affordability factors outweigh. Proved again here, by the lack of staff training in the reviews. Considering the location, if this hotel owner spent [x] amount on reviving the hotel, they could easily quadruple that investment in a few years' time once they'd (justifiably) adjusted the room prices to reflect the refurb cost, but as it stands, the customers will keep booking and it's those customers who suffer. Quite a pity in all honesty, the main product the customer pays for is peace of mind, and customers who purchase hotel stays should be expecting a bare minimum quality of 'comfortable' at the very least, no matter what the base price is.

One quick edit to add my favourite review of this hotel:

"A FREAKING BEE WAS CHASING MY SON AROUND THE ROOM..."

Not particularly the fault of the hotelier in all fairness...

Analysing hotel-specific ngrams.

(removed for brevity's sake).

Following some ngram investigation:

Polarity for the keyword "weather".

df['weather'] = df['reviews_text'].apply(lambda x: 1 if 'weather' in x else 0)

It looks like North Dakota, Oklahoma and Massachusetts score the highest in the negative weather reviews. Seeing as I'm from a tiny northern town in England that nobody has heard of & have never been to ND, I will look into this online and knock-up a word cloud for further investigation.

First, a bar chart visualising the negative polarity rate where the word 'weather' was mentioned in the reviews (orange = not great):

text_by_weather_neg = pd.DataFrame(df[(df['weather'] == 1) & (df['polarity'] < 0)]['reviews_text'].dropna())

The general rule of thumb is to assume that not all of the text in the reviews where negative sentiment regarding the weather will purely be because of the weather. Miserable weather makes people miserable and this will influence their review polarity, although I won't accept these results as completely conclusive based on the fact that so much can go wrong in hotels, leaving many things besides the weather to complain about.

Although, the main province for negative polarity where the weather is concerned is North Dakota, a state that does have some pretty extreme conditions, humid summers, pretty terrible winters and lots of rain in-between. Some of the words in the word cloud back that up, with 'mildew', 'winter', 'water', 'lung' present. On the other end of the spectrum we see, 'air', 'conditioner', 'hot', etc., amid other words such as 'stranded' and 'rude'. So, a mixture of things, but primarily bad staff, rooms and food. Whether (no pun intended) the hot or cold conditions were the sole deciding factor behind the negative reviews is then up for debate considering all of the other unrelated words in the word cloud, but I don't think they helped matters.

Negative reviews containing the word "weather" by month.

The majority of the negative reviews where the weather is related were written in February, March, October and December. The next outlier month that isn't a cold month is July.

Personally, I wouldn't like to go to ND - or anywhere for that matter - in Feb or March, and October is usually monsoon season. Each to their own.

Polarity for the keyword "beds".

df['bed'] = df['reviews_text'].apply(lambda x: 1 if 'bed' in x else 0)

It appears that only three provinces are responsible for 100% positive polarity as far as the beds go, and they're: 'RI' (Rhode island), 'ND' (North Dakota) and 'DE' (Delaware). Nice to see the beds in ND making-up for the shabby weather.

The worst provinces for negative bed reviews are Massachusetts, New Jersey and Nebraska.

bed_negative_polarity = pd.DataFrame(df[(df['bed'] == 1) & (df['polarity'] < 0)]['name'].dropna())

Top hotels for negative polarity mentioning the word "bed".

A histogram of the negative 'bed' polarity by hotel name clearly shows the Days Inn by Windham, the Annex at the Chelsea, the Econo Lodge and the Best Western Orlando East holding the largest share of negative bed reviews.

Polarity for the keyword "cheap".

df['cheap'] = df['reviews_text'].apply(lambda x: 1 if 'cheap' in x else 0)

Here we see Delaware, New Joisey and Massachusetts take the three top spots for 'cheap' mentions in the negative reviews.

cheap_negative_polarity = pd.DataFrame(df[(df['cheap'] == 1) & (df['polarity'] < 0)]['name'].dropna())

With the Baymont by Windham Florida Mall and the Best Western Seven Seas each holding two counts.

Moving on... The top 10 provinces by polarity.

def province_by_positive_review(df): df = df[df['reviews_rating'] == 5] df = df.groupby(['province'])['reviews_rating'].count().reset_index() df = df.sort_values(by=['reviews_rating'], ascending=False) df = df.rename(columns={'reviews_rating': 'positive_reviews'}) df = df.reset_index(drop=True) return df[:10] def province_by_negative_review(df): df = df[df['reviews_rating'] < 5] df = df.groupby(['province'])['reviews_rating'].count().reset_index() df = df.sort_values(by=['reviews_rating'], ascending=False) df = df.rename(columns={'reviews_rating': 'negative_reviews'}) df = df.reset_index(drop=True) return df[:10]

The top (or bottom, depending on how you look at it -.-) 10 provinces for negative polarity.

Here we see GA positioned slightly higher in the negative review table as it was in the positive review table. HI pops up for the negative reviews but is nowhere to be seen in the positive review table.

Let's have a look to see what's going on in Georgia!

ga_negative_polarity = pd.DataFrame(df[(df['province'] == 'GA') & (df['polarity'] < 0)]['reviews_text'].dropna())

The standout, but not outstanding words here are 'Disappointed', 'Run down', 'mattresses terrible', 'pool closed', 'nastiness' etc.. I'll grab some hotel names and see what pops up in a word cloud once I've narrowed it down to specific hotels:

Grabbing the hotel names in GA where the polarity is negative, the Wingate tops the list followed swiftly by the Residence Inn /Peachtree at 17th.

negative_ga_hotel_names = df[(df.province == 'GA') & (df.polarity < 0)].groupby('name').polarity.count().sort_values(ascending=False) negative_ga_hotel_names.head(10)

Wingate reviews.

Grabbing the Wingate reviews now.

wingate_neg_reviews = pd.DataFrame(df[(df.name == 'Wingate By Wyndham Atlanta Galleria Center') & (df.polarity < 0)]['reviews_text'].dropna()) wingate_neg_reviews

Wingate word cloud.

Some words in the negative polarity word cloud worthy of note for the Wingate are, 'pool', 'shocked', 'management', 'staff', 'failed', 'respiratory problems', 'housekeeping', 'smoke', 'bugs', 'incident', 'nasty' and 'dirty'.

The Peachtree at 17th reviews.

peachtree_neg_reviews = pd.DataFrame(df[(df.name == 'Residence Inn Atlanta Midtown/Peachtree at 17th') & (df.polarity < 0)]['reviews_text'].dropna()) peachtree_neg_reviews

For the secondmost common hotel in GA with negative polarity, the Residence Inn Atlanta Midtown/Peachtree at 17th, we see, 'service', 'disappointing', 'noises', 'broken', 'dirty', 'filthy', 'problems-terrible', 'elevators' (although possibly not associated with negativity, but we do also see 'maintenance'), 'slow', 'unimpressed' and 'shock'.

So, as a business analysis it isn't too difficult to continue on in this vein and uproot hotel-specific issues in a bid to bring further attention to them, but I will stop there because the curious hacker in me could happily go on forever.

Now, let's look at some review polarities by username. First I will create a new dataframe containing the review usernames, polarity scores and a count of total reviews left by each user.

df_reviews_user = df[['reviews_username', 'polarity']] df_reviews_user = df_reviews_user.groupby('reviews_username').agg({'polarity': 'sum'}) df_reviews_user['reviews_count'] = df['reviews_username'].value_counts() df_reviews_user.head()

Polarity by username.

Sorting the username values by max positive polarity.

And sorting them again by review count per user.

The users responsible for the most negative reviews.

Just in case they're fake accounts.

You're welcome.

The least active users (minimum review count by username).

The distribution of hotel reviews left by the top positive reviewers.

(Hover on the bars for extra information if you're still awake at this point)

Hospitality companies in operation.

In a bid to see which hotel companies are the leaders, I will parse the first two words (excluding the word 'the') from the features in df.name, then I will merge / count them up, drop duplicates and visualise them.

df.name = df.name.str.replace('-', ' ')

df['company_name'] = df['name'].str.split().str[:3].str.join(' ')

df['company_name'] = df['company_name'].apply(lambda x: x[4:] if x[:3] == 'The' else x)

For a 30 second strategy, this looks almost passable-ish.

I will remove some words such as 'san' from the hotel title but I won't go too far with it. If I were to be doing this as per a paid employee with time to spare as opposed to off my own back for fun, I would remove any county / state names or franchise-specific words from the titles (such as 'san', 'garden' etc.) and narrow it down a lot more. So this method isn't completely fair on some brands owning different styled hotels in different towns, but if they are such a large chain then they will appear here anyway. I think that if 'Hilton Garden' is at number 3 in the top 20 hotel brands then the other hotels with 'Hilton' as the precursor will only serve to keep that chain in the top three hotel brands in this list regardless. Additionally, single hotels not associated with a brand such as Hotel Emma will exist in this part of the data for obvious reasons.

Had this been a professional project I would filter these out by using the company names in the URLs at the beginning of the EDA.

df['company_count'] = df['company_name'].apply(lambda x: len(x))

Removing the words 'san', 'by', as well as commas and ampersands from the end of each string in company_name.

items_to_remove = ['san', 'by', ',', '&'] df['company_name'] = df['company_name'].apply(lambda x: ' '.join([word for word in x.split() if word not in (items_to_remove)]))

The 20 most commonly rated hotel companies.

Best Western, Hilton Garden and Hampton Inn are the top three most rated brands.

top_20_companies = df.company_name.value_counts() top_20_companies = pd.DataFrame(top_20_companies) top_20_companies = top_20_companies.reset_index() top_20_companies.columns = ['company_name', 'company_count'] top_20_companies = top_20_companies.head(20)

Returning a table consisting of the amount of times a hotel brand appears in each exact geolocation. As far as i'm concerned the most logical strategy would be to combine the latitude & longitude coordinates and link them to a company name in the next column, this way I can also count how many times the company name or hotel at those coordinates comes up and add those counts to another column called 'review_counts' (or something similar).

First, combining both coordinate features (latitude and longitude) into one single tuple, in a new column named 'lat_long'.

df['lat_long'] = df[['latitude', 'longitude']].apply(tuple, axis=1)

Creating a new df called 'geo_groups' consisting of those geographical coordinate tuples along with the corresponding company name, a count of how many occurrences of the company name appears per unique lat_long feature (hotel_count), the province column, and a review count column (how many instances of each tuple exists per company_name feature).

geo_groups = df.groupby(['lat_long', 'company_name', 'province']).size().reset_index(name='review_count')

geo_groups['hotel_count'] = geo_groups.groupby(['company_name', 'province'])['lat_long'].transform('nunique')

The features in hotel_count (max, descending) per province in geo_groups.

Best Western seems to have the greatest land share in a couple of provinces so I'll have a quick look into how many appear here exactly. Best Western have 138 hotels in Cali alone as of 2023 but they won't all be in the dataset. However, I'm willing to eat my hat if they don't have the majority count here anyway.

count_best_western = 0 for i in geo_groups.company_name: if i.split()[0] == 'Best' and i.split()[1] == 'Western': count_best_western += 1 print(f'There are {count_best_western} Best Westerns in the geo_groups dataset. ')

Distance travelled per reviewer (data not currently present).

Ok so this one hurt a little. I wanted to figure out the coordinates for the reviews_usercity features to get a sum-total distance travelled to the hotel's location for each username. I tried a few APIs but they all had limits and / or timed-out due to the data download limit, so because I didn't wish to split the dataframe in half and max the API out over a few days, I decided to do it manually.

What I did was:

Scraped an American government website that was full of latitude and longitude coordinates for a bunch of cities in America.

Created a copy of the original dataframe and named it "df_city".

Placed df.city's values in a list (list_1) and placed df_city.reviews_usercity's values in a second list (list_2). There weren't many matching cities between the two though.

So I dropped from list_2 what wasn't in list_1 to get a complete list with no missing values.

Then combined the latitude and longitude columns into one column called 'city_coordinates' (as per df.lat_long above) in the df_city dataframe.

After that, I created a new df column 'df['reviews_usercity_coords']' which I filled with the values in df_city.city_coordinates where df.reviews_username.and city_df.reviews_username were the same values using this absolute unit of a one-liner:

# df['reviews_usercity_coords'] = df.apply(lambda x: df_city[df_city.reviews_username == x.reviews_username].city_coordinates.values[0] if x.reviews_username in df_city.reviews_username.values else np.nan, axis=1).

Now time to split the coordinates in usercity_loc_df into usercity_loc_df['user_latitude'] and usercity_loc_df['user_longitude'] columns and create another column called df['reviewer_distance_travelled'] for the values in latitude and longitude that are present for a df.reviews_username using h3, sorting and users by travel distance.

# usercity_df = pd.read_csv(r"")

# fig = px.bar(usercity_df.sort_values(by='reviewer_distance_travelled', ascending=False).head(50), # x='reviews_username', y='reviewer_distance_travelled', color='reviews_username', # title='Top 50 users with the longest travel distances to hotels') # fig.update_traces(marker_color='rgb(158,202,225)', marker_line_color='rgb(8,48,107)', # marker_line_width=1.5, opacity=0.6) # fig.show()

DBSCAN for business recommendations and optimisation.

data = pd.read_csv('/work/Datafiniti_Hotel_Reviews_Jun19.csv') geo_data = data[['latitude', 'longitude']].dropna() eps = 0.1 min_samples = 5 cluster_model = DBSCAN(eps=eps, min_samples=min_samples, metric='haversine').fit(geo_data) geo_data['cluster'] = cluster_model.labels_ data['cluster'] = geo_data['cluster'] data.head()

num_clusters = geo_data['cluster'].nunique() num_noise = sum(geo_data['cluster'] == -1) num_clusters_excluding_noise = num_clusters - (1 if -1 in geo_data['cluster'].unique() else 0) num_clusters_excluding_noise, num_noise

DBSCAN has formed 87 distinct clusters, and 214 additional data points have been labeled as noise. These were labeled as noise due to their low review counts (noise hotels have a minimum review count of 1 and a maximum review count of 46, with the majority having lower review counts (mean = ~2.33)) because in some situations it would be preferable to remove these from a model or analysis, however, they're still businesses, and still deserve to be included in this data.

Hotel clusters via DBSCAN:

Those clusters and their rating counts, plus mean ratings:

Hotel clusters grouped by geographic location and average rating, for customers looking for a fuss-free getaway:

• 1. Cluster 66: Located at latitude 44.056925, longitude -121.312379 with an average rating of 4.92 (13 reviews). • 2. Cluster 60: Located at latitude 29.442860, longitude -98.481722 with an average rating of 4.88 (185 reviews). • 3. Cluster 27: Located at latitude 37.078506, longitude -82.163895 with an average rating of 4.80 (20 reviews). • 4. Cluster 47: Located at latitude 38.622334, longitude -84.502387 with an average rating of 4.78 (36 reviews). • 5. Cluster 36: Located at latitude 34.525010, longitude -95.621731 with an average rating of 4.76 (38 reviews). • 6. Cluster 43: Located at latitude 33.178687, longitude -81.814432 with an average rating of 4.70 (44 reviews). • 7. Cluster 59: Located at latitude 38.749407, longitude -112.804436 with an average rating of 4.63 (64 reviews). • 8. Cluster 83: Located at latitude 27.966700, longitude -82.549500 with an average rating of 4.62 (13 reviews). • 9. Cluster 62: Located at latitude 40.872631, longitude -99.300003 with an average rating of 4.47 (15 reviews). • 10. Cluster 38: Located at latitude 40.098297, longitude -75.406751 with an average rating of 4.41 (17 reviews).

Recommending highest-rated hotels in common areas based on the customers' home location data (user_city). Although not a precise science - ergo: simply because you live in Austin does not mean you'll automatically want a holiday in Seattle - it does highlight certain trends and perhaps a holiday in Seattle might be what you want if you do live in Austin, who knows (!?), it's 50/50.

Alternatively, Booking.com's recommendation algo targeted customers in Austin for hotels in Seattle, as it does for whatever reason. Either way:

The top hotels (based on aggregated positive keywords and mean rating re: cleanliness and service) are: • 1. Hyatt House Seattle/Downtown (Cleanliness/Service Score: 92, Avg. Rating: 4.30, Reviews: 209) • 2. Drury Inn & Suites New Orleans (Cleanliness/Service Score: 83, Avg. Rating: 4.61, Reviews: 132) • 3. Best Western Seven Seas (Cleanliness/Service Score: 82, Avg. Rating: 3.59, Reviews: 132) • 4. Homewood Suites by Hilton Lake Buena Vista-Orlando (Cleanliness/Service Score: 80, Avg. Rating: 4.12, Reviews: 122) • 5. Hilton Garden Inn Orlando Airport (Cleanliness/Service Score: 78, Avg. Rating: 4.21, Reviews: 105) • 6. Anaheim Del Sol Inn (Cleanliness/Service Score: 70, Avg. Rating: 3.99, Reviews: 96) • 7. Hampton Inn San Diego Del Mar (Cleanliness/Service Score: 65, Avg. Rating: 4.09, Reviews: 100) • 8. Hampton Inn & Suites Orlando at SeaWorld (Cleanliness/Service Score: 63, Avg. Rating: 4.42, Reviews: 89) • 9. Best Western Mission Bay (Cleanliness/Service Score: 60, Avg. Rating: 3.78, Reviews: 69) • 10. The Orchard Garden Hotel (Cleanliness/Service Score: 58, Avg. Rating: 4.49, Reviews: 73)

Assessing under-performing hotels / areas for further investment.

Top regions requiring attention:

• Cluster 69: Latitude 39.709414, Longitude -83.061636, Avg. Rating 3.14 (196 reviews). • Cluster 49: Latitude 37.565964, Longitude -97.091410, Avg. Rating 3.25 (144 reviews). • Cluster 17: Latitude 33.625679, Longitude -84.453801, Avg. Rating 3.29 (196 reviews). • Cluster 55: Latitude 28.990516, Longitude -83.324659, Avg. Rating 3.30 (729 reviews). • Cluster 32: Latitude 41.781290, Longitude -116.329420, Avg. Rating 3.38 (169 reviews).

And for the purposes of getting something else that's mildly tangible from this project, a simulation adjusted hotel performance based on higher cleanliness scores and slight improvements in review ratings.

Top hotels under these changes are: • 1. Hyatt House Seattle/Downtown: Cleanliness/Service Score 1844, Avg. Rating 5.08, Reviews 183. • 2. Hotel Emma: Cleanliness/Service Score 1844, Avg. Rating 5.08, Reviews 183. • 3. French Market Inn: Cleanliness/Service Score 1497, Avg. Rating 4.66, Reviews 144. • 4. St. James Hotel - an Ascend Hotel Collection Member: Cleanliness/Service Score 1404, Avg. Rating 4.51, Reviews 136. • 5. Drury Inn & Suites New Orleans: Cleanliness/Service Score 1403, Avg. Rating 4.81, Reviews 132.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Introduction.

The data.

A quick peek.

All of the column labels.

Statistical description.

Date and time cleaning.

Categorical description.

Analysis.

Provinces with the highest count of hotels.

Rating distribution.

Review distribution by weekday.

Review distribution by month.

The highest rated hotels.

The highest average rated provinces.

The lowest average rated provinces.

The top twenty "super-reviewers".

The main reviewer locations / user cities and their average ratings.

The top four reviewer locations / user cities.

Top provinces visited for reviewers living in top user city 1 - NYC.

Top provinces visited for reviewers living in top user city 2 - LA.

Top provinces visited for reviewers living in top user city 3 - Chicago.

Top provinces visited for reviewers living in top user city 4 - Houston.

Review counts.

Top hotels by review distribution.

NLP.

Average review length.

Exclamation marks.

Data preprocessing.

Bigram and trigram functions.

Polarity functions.

Visualisation functions.

Stop-word removal.

Back to the analysis.

Hotels by polarity score.

Pos / neg polarity distribution by province.

Polarity distribution by day.

Polarity distribution by month.

Polarity distribution by year.

Top provinces for negative sentiment in 2015.

Analysing hotel-specific ngrams.

Polarity for the keyword "weather".

Negative reviews containing the word "weather" by month.

Polarity for the keyword "beds".

Top hotels for negative polarity mentioning the word "bed".

Polarity for the keyword "cheap".

Moving on... The top 10 provinces by polarity.

The top (or bottom, depending on how you look at it -.-) 10 provinces for negative polarity.

Wingate reviews.

Wingate word cloud.

The Peachtree at 17th reviews.

Polarity by username.

The users responsible for the most negative reviews.

The least active users (minimum review count by username).

The distribution of hotel reviews left by the top positive reviewers.

Hospitality companies in operation.

The 20 most commonly rated hotel companies.

Distance travelled per reviewer (data not currently present).

DBSCAN for business recommendations and optimisation.

Fin.

Introduction.