Hotel Customers

Import libraries.

import matplotlib.pyplot as plt import pandas as pd import seaborn as sns import plotly.express as px from scipy.stats.mstats import trimmed_var from sklearn.cluster import KMeans from sklearn.decomposition import PCA from sklearn.metrics import silhouette_score from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler

pip install openpyxl

Read the data.

The data was obtained from Kaggle. "The data comprehends three full years of customer personal, behavioral, demographic, and geographical information."

https://www.kaggle.com/datasets/nantonio/a-hotels-customers-dataset

file_name = "HotelCustomersDataset.xlsx" # File name sheet_name = 0 # 1st sheet header = 0 # The header is the 1st row df = pd.read_excel(file_name, sheet_name = sheet_name, header = header)

df=pd.DataFrame(df)

The data frame head.

df.head()

Explore

df.shape

df.isnull().sum()

We must delete the null values of age.

df.nunique().to_frame(name = 'Number of unique values')

df.describe().T

We can see that there is values negatives of Age and of AverageLeadTime. We must delete them.

df.info()

The Dtype of ID is int64, we have to change it to str.

We make the changes in the dataframe.

df = df[df['Age'] > 0] df = df[df['AverageLeadTime'] >= 0] df.dropna(inplace=True)

# Change Type df['ID'] = df['ID'].astype(str)

df.reset_index(inplace=True)

We drop the columns which are not necessary.

cols = ['DocIDHash', 'NameHash', 'index'] df.drop(cols,axis = 1, inplace = True)

Age

Now, we are going to explore the Age of the Hotel Customers.

age_groups = df["Age"].unique() print("Age Groups:", age_groups)

age_groups.max()

First, we make a dict to plot the information about the age in categories.

age_dict = { range(0,35): "Under 35", range(35,45): "35-44", range(45,55): "45-54", range(55,65): "55-64", range(65,75): "65-74", range(75,123): "75 or Older", } age_d = df["Age"].replace(age_dict) age_d.head(10)

age_d_value_counts = age_d.value_counts() age_d_value_counts.plot( kind="bar", xlabel="Age Group", ylabel="Frequency (count)", title="Hotel Customers Age Groups" );

We can see that the category with the highest number of customers is Under 35.

Now, we plot a histogram of Age

df["Age"].hist(bins=10) plt.xlabel("Age") plt.ylabel("Frequency (count)") plt.title("Hotel Customers Age Distribution");

Market Segment

df["MarketSegment"].unique()

market_segment = df["MarketSegment"].value_counts(normalize=True) market_segment.plot(kind="barh") plt.xlim((0, 1)) plt.xlabel("Frequency (%)") plt.ylabel("Market Segment") plt.title("Hotel Customers: Market Segment");

We can see that there are more people in the category Other. The second category is Travel Agent / Operator.

Nationalities

nationalities = df["Nationality"].value_counts().sort_values().tail(15) nationalities.plot(kind="barh") plt.xlabel("Frequency (%)") plt.ylabel("Nationalities") plt.title("Hotel Customers: Nationalities");

In this graph we can see the first 15 nationalities of hotel guests.

Features used for the model

Now to choose the features that we are going to use for clustering, For this, we are going to select the numerical features that have the largest variance.

First me select the colums with the type number and we create a new data frame called df_number.

df_number = df.select_dtypes(include="number")

We calculate the variance of the features. We choose the 10 features with the largest variance.

top_ten_var = df_number.var().sort_values().tail(10) top_ten_var

Now we are going to make a horizontal bar chart of the ten variables that have the largest variance.

# Create horizontal bar chart of `top_ten_var` fig = px.bar( x = top_ten_var, y = top_ten_var.index, title = "High Variance Features" ) fig.update_layout(xaxis_title="Variance", yaxis_title="Feature") fig.show()

Now we create a boxplot of Lodging Revenue to see if the values are skewed.

fig = px.box( data_frame = df_number, x = "LodgingRevenue", title = "Distribution of LodgingRevenue" ) fig.update_layout(xaxis_title = "") fig.show()

We can see that the data is right-skewed. There are outliers. To deal with this we have to use the trimmed variance.

For this reason, we calculate the trimmed variance for the number features in our data frame df_number and with this we create a series top_ten_trim_var with th 10 features with the largest variance.

# Calculate trimmed variance top_ten_trim_var = df_number.apply(trimmed_var, limits=(0.1,0.1)).sort_values().tail(10) top_ten_trim_var

Now, we make a horizontal bar chart of the top ten trim var.

fig = px.bar( x = top_ten_trim_var, y = top_ten_trim_var.index, title = "High Variance Features" ) fig.update_layout(xaxis_title="Trimmed Variance", yaxis_title="Feature") fig.show()

Now we make a list with thefive features with the highest trimmed variance.

high_var_cols = top_ten_trim_var.tail(5).index.to_list() high_var_cols

Split

We create a matrix called X that contains the columns in high_var_cols.

X = df[high_var_cols] print("X shape:", X.shape) X.head()

Build Model

Iterate

Now we are going to put all the variables of the dataset in the same scale.

In the next table, we can see the information that we are going to standardize.

X_summary = X.aggregate(["mean", "std"]).astype(int) X_summary

We are going to create a Standard Scaler and we are going to use it to transorm the data in X. Then we put this data in X scaled.

ss = StandardScaler() X_scaled_data = ss.fit_transform(X) X_scaled = pd.DataFrame(X_scaled_data, columns=X.columns) print("X_scaled shape:", X_scaled.shape) X_scaled.head()

Now all the features use the same scale.

X_scaled.aggregate(["mean", "std"]).astype(int)

Now we are going to calculate the Inertia and the Silhouette Scores for different number of clusters.

n_clusters = range(2,13) inertia_errors = [] silhouette_scores = [] for k in n_clusters: model = make_pipeline(StandardScaler(), KMeans(n_clusters=k, random_state=42)) #fit the model model.fit(X) inertia_errors.append(model.named_steps["kmeans"].inertia_) silhouette_scores.append( silhouette_score(X,model.named_steps["kmeans"].labels_) ) print("Inertia:", inertia_errors[:3]) print() print("Silhouette Scores:", silhouette_scores[:3])

Now we plot the inertia error vs the number of clusters.

fig = px.line( x = n_clusters, y = inertia_errors, title="K-Means Model: Inertia vs Number of Clusters" ) fig.update_layout(xaxis_title = "Number of Clusters", yaxis_title = "Inertia") fig.show()

Now we plot the silhouette scores vs the number of clusters.

# Create a line plot of `silhouette_scores` vs `n_clusters` fig = px.line( x = n_clusters, y = silhouette_scores, title="K-Means Model: Silhouette Score vs Number of Clusters" ) fig.update_layout(xaxis_title = "Number of Clusters", yaxis_title = "Silhouette Score") fig.show()

Analyzing the both graphics we can see that the best number of clusters is 5. So now, we build and train a new k-means model named final_model.

final_model = make_pipeline( StandardScaler(), KMeans(n_clusters=5, random_state=42) ) final_model.fit(X)

Communicate

Now we extract the labels of the final_model and asign then to the variable labels.

labels = final_model.named_steps["kmeans"].labels_ print(labels[:5])

Now we create a dataframe named xgb, this is going to contain the mean values of the features in X for each cluster that we obtain in our final model.

xgb = X.groupby(labels).mean() xgb

Finally, we are going to make a side by side bar chart from xgb, it is going to show the mean of the features in X for each of the clusteers in the final model

# Create side-by-side bar chart of `xgb` fig = px.bar( xgb, barmode="group", title="Hotel Clients by Cluster" ) fig.update_layout(xaxis_title="Cluster", yaxis_title="Value [$]") fig.show()

We can see that in the cluster 2 the Lodging Revenue is the highest. So, we can analyze what happen with the Other Revenue in the clusters.

df_labels=df df_labels["labels"]=labels df_labels_or=df_labels.groupby("labels")["OtherRevenue"].mean()

fig = px.bar( df_labels_or, barmode="group", title="Hotel Clients by Cluster" ) fig.update_layout(xaxis_title="Cluster", yaxis_title="Value [$]") fig.show()

In the graphic we can see that the other revenue mean in the cluster 2 is the highest too.

To analyze why is this happening, we can analyze the information about other features.

df_labels_summary = df_labels.select_dtypes(include="number").groupby("labels").aggregate(["mean", "std"]).astype(int)

In the next table, we can see the mean and the standard deviation of the numerical features.

df_labels_summary

If we see the information for the cluster 2, in Persons Nights, we can see that the mean for this feature is the highest. This could be the reason why Lodging Revenue and Other Revenue is the highest for the cluster 2.

Now we are going to create a PCA transformer to reduce dimensionality of X. Now we will have two columns PC1 and PC2

pca = PCA(n_components=2,random_state=42) X_t = pca.fit_transform(X) X_pca = pd.DataFrame(X_t, columns=["PC1", "PC2"]) print("X_pca shape:", X_pca.shape) X_pca.head()

Finally we are going to make a scatter plot with the data we obtain with the PCA transformer.

fig = px.scatter( data_frame=X_pca, x="PC1", y="PC2", color=labels.astype(str), title="PCA Representation of Clusters" ) fig.update_layout(xaxis_title="PC1", yaxis_title="PC2") fig.show()

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Hotel Customers

Explore

Split

Build Model

Communicate

Hotel Customers