Cambria USA Clustering Analysis
Authored by Max Grabowski
We aim to understand the distinct behaviors and needs of primary user types: trade visitors, end users, and everything in between. Currently, we lack detailed insights into how these groups interact differently with our website. This segmentation will enable us to customize our marketing strategies, improve user experience, and ultimately, boost conversions and customer satisfaction.
Big Query Integration
'Client' Data Table
Data Inspection
The dataset contains 56 columns. Here is a brief summary of the first few rows:
client_id: This seems to be a unique identifier for each client. Some values are NaN, indicating missing data. country: This is the country in which the client is located. state: This is the state in which the client is located. city: This is the city in which the client is located. device: This indicates the device type (e.g., mobile, desktop) used by the client. brand_name: This indicates the brand of the device used by the client. model_name: This indicates the model of the device used by the client. operating_system: This indicates the operating system of the device used by the client. operating_system_version: This indicates the version of the operating system of the device used by the client. total_event_count: This indicates the total number of events associated with the client. ...: There are numerous other columns, including various metrics related to the user's engagement, bounce rate, average time spent on different types of pages, and a categorical variable indicating the level of engagement.
The dataset contains both numerical and categorical variables. Some variables, such as client_id, state, city, brand_name, and country, have missing values that we might need to handle depending on our analysis.
For numerical features, we can generate histograms or boxplots to understand their distribution. For categorical features, we can use bar plots or pie charts to visualize the number of occurrences for each category.
Correlation heatmaps might be useful for understanding the relationships between numerical features. However, keep in mind that clustering algorithms such as K-Means work based on distances between data points in a multi-dimensional space, and correlations may not always provide useful insights for such algorithms.
Before we proceed with the visualization, it might be useful to deal with the missing values. Depending on the nature of the missing data, we could either fill them with a default value, use a method like mean or median imputation, or drop the rows/columns with missing values altogether. The strategy would depend on the context and the amount of missing data.
Considering the size of the dataset and the fact that we're only performing exploratory data analysis at this stage, I suggest we drop the rows with missing values for the sake of simplicity.
Cleaning the Data
EDA
Heat Map
Heatmap of Feature Correlations: This heatmap shows the correlations between the numerical features in the dataset. Lighter colors indicate a strong positive correlation, while darker colors indicate a strong negative correlation. It's important to note that correlation does not imply causation, and these correlations might not necessarily be helpful for clustering purposes.
Categorical EDA
Visuals to show CUSA users operating system and device type
After grouping by the necessary columns to create new datasets I created bar charts to dive deeper into understanding the categorical data. first grouping the data by Operating System, and then by Device type.
Numerical EDA
It is clear that numeric data is the primary way that we separate consumers from one another so we spent more time exploring the numbers before going through with the clustering, Including:
Consumer Specific: Dealer Locator, Planning, Finance, and Inspiration Pageviews
Professional Pageviews: Professional, Commercial, and Specify Pageviews
Commercial or Professional: Cambia Style and Samples Pageviews
Key Takeaways: professionals tend to use desktops; whereas, consumers lean more towards mobile devices. More purchases are made using Desktops by professionals
Sessions
Average Sessions by Operating System
Events
Purchases
Purchase by Device
Desktops remain the clear leader in most measures. However, the smart tv activity is what peaks my interest the most; especially because this data excludes employees
Pageviews
Pageviews is an important indicator of who is visiting the CUSA website, this is because not every page is designed for every user; some pages are there purely for end user consumers and some are there for professionals such as trade professionals or designers. With that in mind we looked at several different metrics that we felt strongly indicated one user or the other.
Average Pageviews per session
Consumer Specific: Dealer Locator, Planning, Finance, and Inspiration Pageviews
Professional Pageviews: Professional, Commercial, and Specify Pageviews
Commercial or Professional: Cambia Style and Samples Pageviews
Pageviews by Device
Average Pageviews per Session
Consumer Specific Pageviews
Professional Specific Pageviews
Processing
Earlier in the code we identified the data types of each column. In order to successfully run a k-means cluster analysis we also need to identify what data is categorical vs numerical; this will help in a few steps when we prepare the separate categorical and numerical data columns for the clustering
Feature Importance
Principal Component Analysis (PCA):
PCA is a dimensionality reduction technique that can help identify which features account for the most variance in the data. While the principal components themselves may not directly correspond to the original features, the magnitude of the PCA loadings can give insights into which features are most important.
In summary, each component captures different aspects of the data:
Component 1 primarily emphasizes features related to session engagement, pageviews, and event counts. Features like engaged_sessions, total_pageviews, and total_event_count have strong positive associations. On the other hand, bounce_rate has a notable negative association.
Component 2 seems to capture variance related to user properties, particularly those in the United States using Apple devices. The presence of features like country_United States, brand_name_Apple, and device_mobile suggests this component might be distinguishing between device brands and geographic regions.
Component 3 focuses on specific pages and engagement times, with features like avg_time_on_pro_pages and commercial_pages_engagement_time having strong positive associations.
In summary, each component captures different aspects of the data:
Component 1: Overall user engagement metrics. Component 2: User properties and device characteristics. Component 3: Engagement on specific types of pages.
Clusters Give the Data Meaning
In order to find if there are groups of users that stand out in certain areas of the CUSA Website we decided to use K-Means clustering. K-Means clustering is simple to implement, easy to use on large sets of data and assures convergence. A cluster of four was chosen as the optimal number of clusters as shown by the K-Means Inertia Elbow Chart on Web User Data. The following cluster_groups data frame was created to continue investigation in to what makes these clusters unique.
Assign Clusters to Dataframe
Silhouette Score
'Pageview' Data Table
Data Inspection
The dataset contains the following columns:
client_id: This seems to be a unique identifier for each client. Some values are NaN, indicating missing data. country: This is the country in which the client is located. state: This is the state in which the client is located. city: This is the city in which the client is located. device: This indicates the device type (e.g., mobile) used by the client. device_operating_system_version: This indicates the version of the operating system of the device used by the client. Device_Browser: This represents the browser used by the client. page_location: This indicates the URL of the page visited by the client. trade_lead_form_submit: This is a count of trade lead form submissions by the client. residentialconsumer_lead_form_submit: This is a count of residential consumer lead form submissions by the client. contactus_form_submit: This is a count of contact us form submissions by the client. purchase: This is a count of purchases made by the client. custom_session_number: This seems to be a count of sessions by the client. pageview_timestamp: This is the timestamp of the pageview. page_count: This is a count of pages viewed by the client. Traffic_Name: This represents the traffic name for the client's session. Some values are NaN, indicating missing data. Traffic_Source: This represents the traffic source for the client's session. Some values are NaN, indicating missing data. Traffic_Medium: This represents the traffic medium for the client's session. Some values are NaN, indicating missing data.
The dataset contains a mix of data types including numerical (int64, float64) and categorical (object) variables. Some columns, such as client_id, country, state, city, device_operating_system_version, Traffic_Name, Traffic_Source, and Traffic_Medium, contain missing values.
We can start our exploratory data analysis by plotting the distribution of some numerical and categorical features.
Since some columns have missing values, it's important to decide how to handle these before proceeding with the analysis. As you've previously mentioned, we could drop the rows with missing data for specific columns and replace the missing values in other columns with '0'. However, in this case, given the nature of the missing data (mostly categorical), replacing missing values with '0' might not be appropriate. We could replace them with a specific value like 'Unknown' instead.
Let's start by handling the missing data and then proceed with the visualizations.
Cleaning The Newly Clustered Data
Importing Data to Zip files and CSV files
I imported the two datasets into CSV files so that the clusters are saved; importing to CSV files means the clusters are saved and I do not have to run the k means clustering code every time I open the notebook.
Next we take the clustered dataset and do another round of exploratory data analysis that will help point out and assign characteristics that define each of the four clusters
Distribution of different Devices, Brand Names, Operating Systems, and Engagement Levels across the clusters
Distribution of Total Event Count and Total Sessions across clusters
Device Types
Browser Type
General information to be used later
this table displays numerical data grouped by cluster that can help indicate what to create visuals for in order to tell the best story
Engagement Level
RadViz
A RadViz Plot is a multivariate data visualization algorithm that places each feature dimension uniformly around the circumference of a circle. It then plots points on the interior of the circle, normalizing their values along the axes extending from the center to each arc. This approach enables the visualization of numerous dimensions that can fit on the circle, significantly expanding the visualization's dimensionality. We employed this method to identify separability among clusters. Data points exhibiting extreme characteristics will appear on the outer edges of the circle, while those with less pronounced differences will mostly be located closer to the circle's center.
Consumer Pageviews
Professional Pageviews
In order to help identify the clusters we used Engagement Time with the understanding that professionals are going to have a more straight line approach to using the website; consumers are going to have a more randomized approach because they often aren't as familiar with the website, so they take more time to explore
Pageviews
Now that we have found a pattern in the engagement time of each cluster, we can look at pageviews to start assigning characteristics to the clusters.
Assigning Characteristics With GA4 Metrics
The first subplot shows metrics are averaged to eliminate duplicates from after clustering. the second subplot are metrics that could be totaled because there was not threat of duplicates