Hotel Customers
Import libraries.
Read the data.
The data was obtained from Kaggle. "The data comprehends three full years of customer personal, behavioral, demographic, and geographical information."
The data frame head.
Explore
We must delete the null values of age.
We can see that there is values negatives of Age and of AverageLeadTime. We must delete them.
The Dtype of ID is int64, we have to change it to str.
We make the changes in the dataframe.
We drop the columns which are not necessary.
Age
Now, we are going to explore the Age of the Hotel Customers.
First, we make a dict to plot the information about the age in categories.
We can see that the category with the highest number of customers is Under 35.
Now, we plot a histogram of Age
Market Segment
We can see that there are more people in the category Other. The second category is Travel Agent / Operator.
Nationalities
In this graph we can see the first 15 nationalities of hotel guests.
Features used for the model
Now to choose the features that we are going to use for clustering, For this, we are going to select the numerical features that have the largest variance.
First me select the colums with the type number and we create a new data frame called df_number.
We calculate the variance of the features. We choose the 10 features with the largest variance.
Now we are going to make a horizontal bar chart of the ten variables that have the largest variance.
Now we create a boxplot of Lodging Revenue to see if the values are skewed.
We can see that the data is right-skewed. There are outliers. To deal with this we have to use the trimmed variance.
For this reason, we calculate the trimmed variance for the number features in our data frame df_number and with this we create a series top_ten_trim_var with th 10 features with the largest variance.
Now, we make a horizontal bar chart of the top ten trim var.
Now we make a list with thefive features with the highest trimmed variance.
Split
We create a matrix called X that contains the columns in high_var_cols.
Build Model
Iterate
Now we are going to put all the variables of the dataset in the same scale.
In the next table, we can see the information that we are going to standardize.
We are going to create a Standard Scaler and we are going to use it to transorm the data in X. Then we put this data in X scaled.
Now all the features use the same scale.
Now we are going to calculate the Inertia and the Silhouette Scores for different number of clusters.
Now we plot the inertia error vs the number of clusters.
Now we plot the silhouette scores vs the number of clusters.
Analyzing the both graphics we can see that the best number of clusters is 5. So now, we build and train a new k-means model named final_model.
Communicate
Now we extract the labels of the final_model and asign then to the variable labels.
Now we create a dataframe named xgb, this is going to contain the mean values of the features in X for each cluster that we obtain in our final model.
Finally, we are going to make a side by side bar chart from xgb, it is going to show the mean of the features in X for each of the clusteers in the final model
We can see that in the cluster 2 the Lodging Revenue is the highest. So, we can analyze what happen with the Other Revenue in the clusters.
In the graphic we can see that the other revenue mean in the cluster 2 is the highest too.
To analyze why is this happening, we can analyze the information about other features.
In the next table, we can see the mean and the standard deviation of the numerical features.
If we see the information for the cluster 2, in Persons Nights, we can see that the mean for this feature is the highest. This could be the reason why Lodging Revenue and Other Revenue is the highest for the cluster 2.
Now we are going to create a PCA transformer to reduce dimensionality of X. Now we will have two columns PC1 and PC2
Finally we are going to make a scatter plot with the data we obtain with the PCA transformer.