import calendar from imblearn.over_sampling import SMOTE import matplotlib.pyplot as plt import numpy as np from numpy import set_printoptions import os import pandas as pd from plotly.subplots import make_subplots import plotly.express as px import plotly.graph_objects as go from scipy.stats import zscore import seaborn as sns from sklearn.compose import make_column_transformer from sklearn.ensemble import RandomForestClassifier from sklearn.feature_selection import SelectKBest, f_classif from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV from sklearn.preprocessing import OneHotEncoder from sklearn.svm import SVC import xgboost as xgb from xgboost import XGBClassifier import warnings warnings.filterwarnings('ignore')

Introduction.

This data was collected and compiled by Guarav Dutta, distributed courtesy of Kaggle. I created this as a capstone (read: 'old') project for my data analysis education.

Problem statement:

"Customer behavior and booking possibilities have been radically changed by online hotel reservation channels. Cancellations or no-shows cause a significant number of hotels to lose money. Cancellations can be caused by a variety of factors, such as scheduling conflicts, changes in plans, etc. In many cases, this is made easier by the possibility of doing so free or at a low cost, which is beneficial for hotel guests but less desirable and possibly revenue-diminishing for hotels."

train = pd.read_csv("/work/train__dataset.csv")

A peek at the data.

The things dreams are made of: No null values. 18 columns consisting of 18,137 entries totaling 326,466 values.

• The mean number of adults is around 2, with a 25%, 50% and 75% of 2 and a max of 4. So, there will mostly be parties of two, very little in the way of single customers and less parties of three than there are singletons.

• The "number of children" count has a max of 9 with no other values and it looks as though the proportion of customers with no children is a good bit greater than those who do have children.

• Almost certainly due to there being more days of the week than weekend days, there is a much higher count of weekday nights than weekend nights. Also we see lower prices through the week which could be attracting more customers on weekdays.

• The majority of customers didn't require a parking space, with a mean of around 0.03.

• Similar with "room_type_reserved", the majority customers did not require a reservation.

• "Lead time" (the wait time between booking and arrival) sees a mean of 85, a max of 443 and a std of 86. So, sparsely spread out across a whole bunch of values with what will likely be a large cluster in the lower echelons of the dataset.

• The data is spread across all months of the year but leaning more toward the latter part of the year, over two years; 2017 and 2018. Most guests recorded in the dataset, if not almost all of the guests, are not repeat customers.

• The number of previous cancellations will mostly be 0 as that also has a very low mean and a max of 13.

• And the same can be said for the number of previous bookings not cancelled column. With a standard deviation of 1.7 and a max of 58, that data will be spread out quite far but with some tight clusters somewhere.

• The average price per room has a mean of 35.47 and a max of 540, with an obvious high std.

train.rename(columns = {'no_of_previous_bookings_not_canceled': 'no_of_previous_bookings_not_cancelled'}, inplace = True)

train = train[train.no_of_adults != 0]

Analysis.

The data shows almost half of the bookings falling under the bracket of "cancellation":

Histograms for every feature in the dataset.

The statistical description held some good information so there should be no unpleasant surprises here.

• 16,767 bookings out of 18,070 have no kids present.

• The arrivals by month are primarily in October.

• Weekday bookings outperform weekend bookings.

• The lead time sees plenty of right-skew with a long tail.

• Most bookings didn't require a reservation.

• The number of previous cancellations and the parking required features are also minor.

Correlations.

We see the strongest correlation to cancelled bookings in the lead time column. I have no firm assumptions here. My initial train of thought is that the grey hue in the lead time column pertains to long booking times where something came up in the time between booking and arrival, or short booking times where the customer is travelling around quickly / booking things in a rush, both of those have happened in my experience but I wouldn't like to bank money on either just yet without knowing more about where the hotel is located (in a business district or holiday destination), but I doubt there will be an in-between.

As well as the "lead_time" column bearing a strong correlation to booking status, we see the "average_price_per_room" column. Why this is highly correlated could be due to many reasons depending on the data, but the main reason here will simply be that every room booked will have had a price.

Scanning the chart, I can see that 'Room type reserved' is one feature that makes sense; long wait times == rooms reserved, or short wait times == no reservation. Either or both will result in a correlation.

With the strength of "lead_time"'s correlation to the target variable and the fact there seem to be plenty of 0 - low figures in the dataset, SMOTE and Shap will likely have to be used in the final model. I would expect "lead_time" to override several other features which will mean plenty of experimentation with different algorithms and good EDA.

Looking at the absolute correlation. There are a couple of columns that are weighted toward the features that are heavily correlated with cancellation, such as "no_of_special_requests", where there is a strong correlation to "average price per room". Possibly due to the room price rising per special request figure(?!).

1 - Lead time.

• The majority of the lead_time data is '0', meaning the majority of bookings were placed on the same day as arrival.

• The cancellations appear to be more prominent in the lead_time column after the 150 day marker, visualised in the horizontal boxplot at the top of the figure where we see cancellations sitting at a median of 122 days, a Q3 of 207 and an upper fence of 433 - slightly over 2x that of the upper fence for customers who did not cancel, so this goes some way to eliminate the possibility of short lead times being a key factor in cancellation rates.

• The data for the average lead time by cancellation further shows the spike in average cancellations around the lead_time_150 value. There are a handful of troughs around the 275-350 mark but it looks like the majority of features after 150 days are cancellations:

Lead time distributions.

• 188 days (or around 6 months) is the most common lead time for cancellations.

• 166 days (around 5.5 months) is the second-most common lead time.

• 245 days (around 8 months) is the third-most common lead time.

Bearing in mind the line chart and histogram above, the key takeaway should be that there is a higher probability of cancellation by a customer who books a room at least 3 - 6 months in advance.

2 - Number of adults cancelled.

Sum.

• A sum of 4586 couples cancelled.

• A sum of 892 single bookings cancelled.

• A sum of 440 parties of three cancelled.

• And a sum of 4 parties of four cancelled.

Average.

• 38% of parties of three cancelled.

• 34% of couples cancelled.

• 23% of single bookings cancelled.

• 14% of parties of four cancelled.

3 - Number of children present.

Sum.

• At a sum total of 5424, cancellations by customers with no children present are the majority value.

• A sum of 283 bookings with one child present cancelled.

• A sum of 210 bookings with two children present cancelled.

4 - Arrival month.

The majority of cancellations are for room bookings in October, followed by September, then August.

Considering that the majority of cancellations || lead_time features reside in the >=6 <=8 month realm, logic would suggest that the customer contacted the hotel in winter to place their booking Customers booking in the colder months for stays in the warmer months would then lead me to believe that this hotel is based in a tourist district.

Arrival month by arrival date.

Here we can see the main cancellation days for each month, and those daily distributions have around 1% - 2% difference from one another.

Also we have more distinct idea of why October is the most common month for cancellations. For instance, October is the only month with a max daily cancellation figure of 2% (on the 16th).

With regard to important days / dates, we see a handful of occurrences for the 15th and the 30th spread across several months.

5 - Arrival date.

The top five of the ten most common arrival dates for cancellations.

• The 15th (12%).

• The 30th (10/5%).

• The 16th (10.5%).

• The 4th (10%).

• The 12th (10%).

Creating some new features.

I thought there would be more to insights to gain by tying a few of these metrics together, so, I will add the number of adults to the number of children for each row and create a new column from that data called "arrival_party_size", add the number of week nights along with the number of weekend nights for each row and create a new column from that data called "night_sum", then create a new subset dataframe using that data, include the booking status and arrival date features from the original dataframe before dropping booking_status == 0 to see how the customers and nights cancelled check out.

The subset dataframe:

Cancellations by total head count by weekday.

Now, hopefully we will get a clearer image of the total amount of customers cancelling by day / date.

• Arrival dates 0-4: A sum of bookings for 8 people at a total of 20 nights.

• Arrival dates 5-10: A sum of bookings for 13 people at a total of 30 nights.

• Arrival dates 11-15: A sum of bookings for 8 people at a total of 17 nights.

• Arrival dates 16-20: A sum of bookings for 9 people at a total of 15 nights.

• Arrival dates 21-25: A sum of bookings for 11 people at a total of 23 nights.

• Arrival dates 26-30: A sum of bookings for 12 people at a total of 16 nights.

• Arrival date 31: A sum of bookings for 1 person at a total of 3 nights.

6 - Number of week nights.

The average cancellations for week nights are spread out across a wide range of values, with the majority in a region above five nights.

'5 week nights booked' has an average cancellation rate of 39% where '6 week nights booked' has an average of 51%. I don't know much about how this hotel works so this is a wild guess; this could *possibly* be due to the 5 night stays being booked in one big slot (Mon - Fri) where 6 nights will be 2 x 3 night stays / 3 x 2 night stays with a lead time gap in between.

Week nights >5 booked.

• All three bookings for 13 week nights cancelled.

• The one single booking for 16 week nights cancelled.

• 86% of the bookings for 10 week nights cancelled.

• 83% of the bookings for 11 week nights cancelled.

• 83% of the bookings for 15 weeknights cancelled.

• 80% of the bookings for 12 and 14 weeknights cancelled.

We still see relatively high averages in the groups for the lower counts of week nights booked, with the lowest average cancellation rate being 47% (for 7 week nights booked).

Investigating some of the lead times for these features shows some vast differences with no real discernable patterns:

• All three cancelled bookings for 13 week days have lead times of 5, 12 and 130 days.

• The one cancelled booking for 16 week nights has a lead time of 31 days.

Looking at a month-by-month distribution of cancellations, of the twenty six cancelled bookings for 10 week nights, the majority cluster of lead times are within the first month of booking:

• The five cancelled of six bookings for 11 week nights have lead times of 5, 31, 68, 152 and 137 days.

• The four cancelled of five bookings for 15 week days have lead times of 8, 12, 90 and 150 lead days. The one booking that did not cancel had a lead time of 17 days.

7 - Number of weekend nights.

• The one single booking for 7 weekend nights cancelled.

• 85% of bookings for 6 weekend nights cancelled, and 84% of bookings for 5 weekend nights cancelled.

From there, we see quite a good drop in cancellations; 69.6% for 4 weekend nights, decreasing exponentially with the number of weekend nights booked, down to 34.6% and 34.3% for one and two weekend nights.

Along with the lead time features, I added a sum of the size of each "no_of_weekend_nights" feature to this chart. Although there is only one count where [["no_of_weekend_nights"] == 7], it is easy to see that this one entry has a long lead time (188 days).

So the average week night cancellation figures are quite a bit higher than the weekend night cancellation figures.

• On average, bookings for a number of weekend nights above 4 have a higher probability of cancelling than a number of weekend nights below 4.

• Bookings for two weekend nights have the lowest chance of cancellation.

• Bookings for a number of week nights above 5 have a higher probability of cancelling than those below 5.

It is difficult to tell if the > 3 weekend nights are split bookings as opposed to block bookings (hotels classing 4 weekend nights as Fri-Tues), something I won't make a guess on that without the presence of any additional data.

8 - Meal plan.

The majority of meal plan cancellations relate to [["meal_plan"] == 0]. Meal plan 3 holds no cancellations.

9 - Market segment type.

• Market segment type 4 has the lowest cancellation figure of 0% of 195 bookings.

• That is followed by market segment type 2 with 11% cancelled of 1029 bookings.

• Then segment type 3 sees 26% of 50 bookings cancelled.

• Segment type 0 sees 29% of 5238 bookings cancelled.

• And the largest share of cancellations belongs to the market segment type value holding the most data - segment 1 - with 36% cancelled of 11,558 bookings:

10 - Repeat guest.

Repeat guests aren't too common for this business, only 455 are repeat guests and 17,615 customers are first-timers.

repeat_guest_values = train.groupby('repeated_guest')['booking_status'].agg(['mean', 'size']).reset_index() repeat_guest_values.sort_values(by='mean', ascending=False)

11 - Number of previous cancellations.

Another '0' value which will cause the model to better classify 0's over any other value, the number of previous cancellations column.

• There are only two values for "13 previous cancellations" both of which cancelled.

• The next-largest percentage (32%) belongs to 0 previous cancellations at a size of almost 18K values.

So it is safe to say that people who have not cancelled previously are more likely to cancel.

12 - Number of previous bookings not cancelled.

A similar story with the number of previous bookings not cancelled:

We see 0 as the majority value and the largest average.

Although here we see a slightly higher percentage for 0: 33.4%, and a figure of 14% for 12 previous bookings not cancelled.

13 - Average price per room.

There doesn't seem to be anything worth noting here besides the large spike in bookings not cancelled around the 0-1.99 average price mark. 1.99 is a strange price for a room so I'm guessing these prices have been converted from one currency to another. Anyway, the average price range of 50 or 60 to 100 appears to bear the most cancellations based on an eyeball of this chart.

Binning the price groups.

train['average_price_0_50'] = train['avg_price_per_room'].apply(lambda x: 1 if x <= 50 else 0) train['average_price_50_100'] = train['avg_price_per_room'].apply(lambda x: 1 if x > 50 and x <= 100 else 0) train['average_price_100_150'] = train['avg_price_per_room'].apply(lambda x: 1 if x > 100 and x <= 150 else 0) train['average_price_150_200'] = train['avg_price_per_room'].apply(lambda x: 1 if x > 150 and x <= 200 else 0) train['average_price_200_250'] = train['avg_price_per_room'].apply(lambda x: 1 if x > 200 and x <= 250 else 0)

A more in-depth look at that data confirms the average price range of 100-150 as the most common price range, with 41% of 7018 bookings cancelled. The price range 50-100 also holds a good percentage of cancellations, with 26% of 8953 bookings cancelled.

There are some cancellation values in the average room price above 200 that I would like to look into a little further.

average_price_hi = train.groupby(['booking_status', 'avg_price_per_room', 'no_of_weekend_nights', 'no_of_week_nights'])['booking_status'].agg(['mean', 'size']).reset_index() average_price_hi = average_price_hi[average_price_hi.booking_status == 1] average_price_hi = average_price_hi[average_price_hi.avg_price_per_room > 200]

A closer look at the average price per room > 200.

It looks like the most popular average room price for this range is 210-229, followed by 190-209. The higher average room prices are mostly spread across week nights 1-5, at a value of 93 cancellations.

And a closer look into the cancellation values where the average price per room is less than 200.

Again, mostly week nights. The week night figure is around +2200 on the weekend nights.

average_price_lo = train.groupby(['booking_status', 'avg_price_per_room', 'no_of_weekend_nights', 'no_of_week_nights'])['booking_status'].agg(['mean', 'size']).reset_index() average_price_lo = average_price_lo[average_price_lo.booking_status == 1] average_price_lo = average_price_lo[average_price_lo.avg_price_per_room < 200]

14 - Number of special requests.

Cancellations by number of special requests shows us that zero special requests are the dominant count at 72%.

Bookings making one special request are then 23% likely to cancel, followed by only 5% of bookings making two special requests.

The average number of special requests per average room price bracket.

• The mean special request figure for average_price_0_50 where the customer cancelled is 59.

• The mean special request figure for average_price_50_100 where the customer cancelled is 48.

• The mean special request figure for average_price_100_150 where the customer cancelled is 71.

• The mean special request figure for average_price_150_200 where the customer cancelled is 93.

• The mean special request figure for average_price_200_250 where the customer cancelled is 96.

Special requests vs. lead time.

Darker hues represent the cancelled bookings.

As we saw in the chart above, there are a lot of customers making three and four special requests judging by the size of their markers. But here we can see that those customers did not cancel one single booking.

As per the pie chart two cells up, it is the customers making one and two special requests who are cancelling. And as we have already seen in the lead time portion of the EDA, it is those with lead times over 150 days doing the cancelling who are mostly making the requests.

Average special requests by date.

The 29th bears the largest request average for those who cancelled, followed by the 18th, then the 23rd.

15 - Room type reserved.

(All "cancelled" data).

Here we can see a good distribution of room types reserved (and cancelled) right across the lead time ranges, but with room type 0 and room type 2 the only room types being cancelled over a lead time of 350 days.

The majority of cancellations for room types 1, 2 and 3 specifically appear to pick up quite a lot around the lead time 140 day mark.

With no bookings for room types 4, 5 and 6 from 280 days onwards, we still see cancellations pick up the pace into the higher lead time ranges for the remaining room types.

Better visual evidence of cancellations increasing as the lead time increases, with most room types falling foul of those cancellations besides room type 5, whose majority cancellations reside in the lead time range below 140.

Cancellations in the darker hues:

Modeling.

y = train['booking_status'] X = train.drop(['booking_status', 'average_price_0_50', 'average_price_50_100', 'average_price_100_150', 'average_price_150_200', 'average_price_200_250'], axis=1)

Dummifying the special requests column after binning the other data.

X = pd.get_dummies(X, columns=['no_of_special_requests'])

Using a 70 / 30 split.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

xgb_model = XGBClassifier( n_estimators=random_search.best_params_['n_estimators'], max_depth=random_search.best_params_['max_depth'], learning_rate=random_search.best_params_['learning_rate'], min_child_weight=random_search.best_params_['min_child_weight'], gamma=random_search.best_params_['gamma'], colsample_bytree=random_search.best_params_['colsample_bytree'], eta=random_search.best_params_['eta'], objective=random_search.best_params_['objective'] ) xgb_model.fit(X_train, y_train) xgb_pred = xgb_model.predict(X_test)

Model explanation.

• The lead time is by far the most important factor in booking cancellation. As seen multiple times in the EDA, the farther away the booking day / the higher the value, the higher the probability of cancellation.

• The average price per room feature is another important variable. With the cancellation distribution following a very similar pattern to the rooms booked and not cancelled in the same price ranges, it's safe to say that this data is important in the way I assumed in the beginning of this project where the room prices are among the most important data because the rooms were booked. That being said, the price range 50-150 is the most common for cancellations.

• Although the high values (likely segment type 1) aren't too unhealthy for the model, market segment type is an important feature here. Segments 1, 0 and 3 are especially worthy of investigation.

• Number of week nights is a highly important feature impacting the model.

• Number of weekend nights - as seen in the EDA - is as important but doesn't bear as much weight as the week nights.

• Type of meal plan "0" is another important feature highly worthy of investigation for the hotelier.