Cambria USA Clustering Analysis

Authored by Max Grabowski

We aim to understand the distinct behaviors and needs of primary user types: trade visitors, end users, and everything in between. Currently, we lack detailed insights into how these groups interact differently with our website. This segmentation will enable us to customize our marketing strategies, improve user experience, and ultimately, boost conversions and customer satisfaction.

!pip install watermark %reload_ext watermark # Nessessary Installations !pip install plotnine # Basic Libraries import pandas as pd import numpy as np import datetime as dt import re import collections from collections import defaultdict import json # Visualization Libraries from plotnine import * import plotnine import plotnine as p9 import matplotlib import matplotlib.pyplot as plt import matplotlib.ticker as mticker import seaborn as sns from datetime import timedelta import plotly.graph_objects as go import plotly.subplots as sp # Clustering Libraries from sklearn.cluster import KMeans from sklearn import metrics from scipy.spatial.distance import cdist from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.decomposition import TruncatedSVD from sklearn.metrics import silhouette_score from sklearn.impute import SimpleImputer from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.inspection import permutation_importance # Clustering EDA from pandas.plotting import radviz # Ignore warnings import warnings warnings.filterwarnings('ignore', category = FutureWarning) # Format scientific notation from Pandas pd.set_option('display.float_format', lambda x: '%.3f' % x) print("Import Libraries - Complete")

Big Query Integration

'Client' Data Table

SELECT * FROM `analytics_329819782`.`skunkworks_client_level` WHERE formatted_first_touch > '2023-04-29 00:00:00' ORDER BY RAND() LIMIT 250000

Data Inspection

The dataset contains 56 columns. Here is a brief summary of the first few rows:

The dataset contains both numerical and categorical variables. Some variables, such as client_id, state, city, brand_name, and country, have missing values that we might need to handle depending on our analysis.

For numerical features, we can generate histograms or boxplots to understand their distribution. For categorical features, we can use bar plots or pie charts to visualize the number of occurrences for each category.

Correlation heatmaps might be useful for understanding the relationships between numerical features. However, keep in mind that clustering algorithms such as K-Means work based on distances between data points in a multi-dimensional space, and correlations may not always provide useful insights for such algorithms.

Before we proceed with the visualization, it might be useful to deal with the missing values. Depending on the nature of the missing data, we could either fill them with a default value, use a method like mean or median imputation, or drop the rows/columns with missing values altogether. The strategy would depend on the context and the amount of missing data.

Considering the size of the dataset and the fact that we're only performing exploratory data analysis at this stage, I suggest we drop the rows with missing values for the sake of simplicity.

Cleaning the Data

EDA

Heat Map

Heatmap of Feature Correlations: This heatmap shows the correlations between the numerical features in the dataset. Lighter colors indicate a strong positive correlation, while darker colors indicate a strong negative correlation. It's important to note that correlation does not imply causation, and these correlations might not necessarily be helpful for clustering purposes.

Categorical EDA

Visuals to show CUSA users operating system and device type

After grouping by the necessary columns to create new datasets I created bar charts to dive deeper into understanding the categorical data. first grouping the data by Operating System, and then by Device type.

Device with the highest average total event count, pageviews, and sessions: Desktops

Browser with the highest average total event count, pageviews, and sessions: Tizen browser (Samsung)

Numerical EDA

It is clear that numeric data is the primary way that we separate consumers from one another so we spent more time exploring the numbers before going through with the clustering, Including:

Total sessions

Total Events

Total Purchases

Total Pageviews: Average pageviews per session, consumer vs professional page specific pageviews based on our thoughts and ideas going into the analysis.

Consumer Specific: Dealer Locator, Planning, Finance, and Inspiration Pageviews

Professional Pageviews: Professional, Commercial, and Specify Pageviews

Commercial or Professional: Cambia Style and Samples Pageviews

Key Takeaways: professionals tend to use desktops; whereas, consumers lean more towards mobile devices. More purchases are made using Desktops by professionals

Sessions

df_device_sorted = df_device.sort_values('total_sessions', ascending=False) plt.figure(figsize=(8, 4.8)) # Create the bar plot plt.bar(df_device_sorted['device'], df_device_sorted['total_sessions'], color='#c59617') # Add text labels for i, v in enumerate(df_device_sorted['total_sessions']): if v >= 1e6: formatted_label = f"{v/1e6:.1f}m" elif v >= 1e3: formatted_label = f"{v/1e3:.1f}k" else: formatted_label = str(v) plt.text(i, v, formatted_label, ha='center', va='bottom', size=10) # Set labels and title plt.title('Sessions by Device Type') plt.xlabel('Device') plt.ylabel('Total Sessions') plt.gca().set_yticklabels([]) plt.show()

Average Sessions by Operating System

top_5_df = df_ops.sort_values(by='total_sessions', ascending=False).head(5) # Set the figure size plt.figure(figsize=(8, 4.8)) # Create the bar plot plt.bar(top_5_df['operating_system'], top_5_df['total_sessions'], color= '#c59617') # Add data labels above the bars for i, v in enumerate(top_5_df['total_sessions']): formatted_label = f"{v:.2f}" # Format the label to display only 2 decimal points plt.text(i, v, formatted_label, ha='center', va='bottom') # Set labels and title plt.title('Top 5 Average Total Sessions by Operating System') plt.xlabel('Operating System') plt.ylabel('Total Sessions') # Remove y-axis labels plt.gca().set_yticklabels([]) # Show the plot plt.show()

Events

Purchases

Purchase by Device

Desktops remain the clear leader in most measures. However, the smart tv activity is what peaks my interest the most; especially because this data excludes employees

Pageviews

Pageviews is an important indicator of who is visiting the CUSA website, this is because not every page is designed for every user; some pages are there purely for end user consumers and some are there for professionals such as trade professionals or designers. With that in mind we looked at several different metrics that we felt strongly indicated one user or the other.

Average Pageviews per session

Consumer Specific: Dealer Locator, Planning, Finance, and Inspiration Pageviews

Professional Pageviews: Professional, Commercial, and Specify Pageviews

Commercial or Professional: Cambia Style and Samples Pageviews

Pageviews by Device

device_names = df_device['device'] total_pageviews = df_device['total_pageviews'] sorted_indices = total_pageviews.argsort()[::-1] sorted_device_names = device_names[sorted_indices] sorted_total_pageviews = total_pageviews[sorted_indices] fig, ax = plt.subplots(figsize=(8, 4.8)) bar_plot = plt.bar(sorted_device_names, sorted_total_pageviews, color='#c59617') for i, v in enumerate(sorted_total_pageviews): if v >= 1e6: formatted_label = f"{v/1e6:.1f}m" elif v >= 1e3: formatted_label = f"{v/1e3:.1f}k" else: formatted_label = str(v) plt.text(i, v, formatted_label, ha='center', va='bottom', size=10) ax.set_title('Pageviews by Device') ax.set_xlabel('Device') ax.set_ylabel('Total Pageviews') ax.set_xticks(range(len(sorted_device_names))) ax.yaxis.set_major_locator(mticker.NullLocator()) ax.yaxis.grid(False) plt.tight_layout() plt.show()

Average Pageviews per Session

avg_pageviews_s = df.groupby('device')['avg_pageviews_per_session'].mean() sorted_indices = avg_pageviews_s.argsort()[::-1] sorted_device_names = device_names[sorted_indices] avg_pageviews_s_sorted = avg_pageviews_s.iloc[sorted_indices] # Create the bar chart using Matplotlib plt.bar(sorted_device_names, avg_pageviews_s_sorted.values, color='#c59617') # Add data labels for i, v in enumerate(avg_pageviews_s_sorted.values): plt.text(i, v, f"{v:.2f}", ha='center', va='bottom') plt.xlabel('Device Type') plt.ylabel('Average Pageviews per Session') plt.title('Average Pageviews per Session by Device Type') # Remove y-axis numbers plt.gca().yaxis.set_major_locator(plt.NullLocator()) plt.tight_layout() # Display the bar chart plt.show()

Consumer Specific Pageviews

dl_pageviews = df.groupby('device')['total_dealerlocator_pageviews'].sum() dl_pageviews_sorted = dl_pageviews.sort_values(ascending=False) # Create the bar chart using Matplotlib plt.bar(dl_pageviews_sorted.index, dl_pageviews_sorted.values, color='#c59617') # Add data labels for i, v in enumerate(dl_pageviews_sorted.values): if v >= 1000: formatted_label = f"{v/1000:.1f}k" # Format the label in "#.#k" format for values >= 1000 else: formatted_label = str(v) # Leave the label as is for values < 1000 plt.text(i, v, formatted_label, ha='center', va='bottom') plt.xlabel('Device Type') plt.ylabel('Dealer Locator Pageviews by Device') plt.title('Dealer Locator Pageviews by Device') # Remove y-axis numbers plt.gca().yaxis.set_major_locator(mticker.NullLocator()) plt.tight_layout() # Display the bar chart plt.show()

tpl_pageviews = df.groupby('device')['total_planningcare_pageviews'].count() tpl_pageviews_sorted = tpl_pageviews.sort_values(ascending=False) # Create the bar chart using Matplotlib plt.bar(tpl_pageviews_sorted.index, tpl_pageviews_sorted.values, color='#c59617') # Add data labels above the bars for i, v in enumerate(tpl_pageviews_sorted.values): if v >= 1000: formatted_label = f"{v/1000:.1f}k" # Format the label in "#.#k" format for values >= 1000 else: formatted_label = str(v) # Leave the label as is for values < 1000 plt.text(i, v, formatted_label, ha='center', va='bottom', color='black') plt.xlabel('Device Type') plt.ylabel('Total Planning Pageviews by Device') plt.title('Planning Pageviews by Device') # Remove y-axis labels plt.gca().set_yticklabels([]) plt.tight_layout() # Display the bar chart plt.show()

Professional Specific Pageviews

Processing

Earlier in the code we identified the data types of each column. In order to successfully run a k-means cluster analysis we also need to identify what data is categorical vs numerical; this will help in a few steps when we prepare the separate categorical and numerical data columns for the clustering

# Identify the Categorical and Numerical Columns numeric_cols = cleaned_data.select_dtypes(include=['int64', 'float64']).columns categorical_cols = cleaned_data.select_dtypes(include=['object']).columns print('Numeric columns:') for col in numeric_cols: print(f'Column name: {col}, Position: {df.columns.get_loc(col)}') print('\nCategorical columns:') for col in categorical_cols: print(f'Column name: {col}, Position: {df.columns.get_loc(col)}')

numeric_cols = ['total_event_count', 'total_sessions', 'total_pageviews', 'total_purchases', 'total_trade_lead_form_submit', 'total_residentialconsumer_lead_form_submit', 'total_contactus_form_submit', 'total_dealerlocator_pageviews', 'total_professionals_pageviews', 'total_commercial_pageviews', 'total_specify_pageviews', 'total_case_studies_pageviews', 'total_pro_locations_pageviews', 'total_design_palette_pageviews', 'total_inspiration_pageviews', 'total_financebycambria_pageviews', 'total_planningcare_pageviews', 'total_cambriastyle_pageviews', 'total_samples_pageviews', 'total_galleries_pageviews', 'pro_pages_engagement_time', 'commercial_pages_engagement_time', 'specify_pages_engagement_time', 'case_studies_pages_engagement_time', 'pro_locations_pages_engagement_time', 'design_palette_pages_engagement_time', 'inspiration_pages_engagement_time', 'financebycambria_pages_engagement_time', 'planningcare_pages_engagement_time', 'samples_pages_engagement_time', 'galleries_pages_engagement_time', 'dealerlocator_pages_engagement_time', 'cambriastyle_pages_engagement_time', 'total_engagement_time', 'engaged_sessions', 'formatted_engagement_time', 'avg_pageviews_per_session', 'engagement_rate', 'bounces', 'bounce_rate', 'avg_time_on_pro_pages', 'avg_time_on_design_pages', 'avg_time_on_decision_pages', 'avg_time_on_loyalty_pages'] categorical_cols = ['client_id', 'country', 'state', 'city', 'device', 'brand_name', 'model_name', 'operating_system', 'operating_system_version', 'engagement_level']

Feature Importance

# 1. Load dataset and define preprocessing steps numerical_features = cleaned_data.select_dtypes(include=['int64', 'float64']).columns.tolist() categorical_features = cleaned_data.select_dtypes(include=['object']).columns.tolist() numerical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler()) ]) categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore')) ]) preprocessor = ColumnTransformer( transformers=[ ('num', numerical_transformer, numerical_features), ('cat', categorical_transformer, categorical_features) ]) # 2. Apply preprocessing data_preprocessed = preprocessor.fit_transform(cleaned_data) # 3. TruncatedSVD svd_reduced = TruncatedSVD(n_components=500) svd_data_reduced = svd_reduced.fit_transform(data_preprocessed) # 4. Extracting top features for the first few components feature_names = numerical_features + list(preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features)) num_top_features = 10 num_components_to_display = 3 top_features_per_component = {} for component_num in range(num_components_to_display): loadings = svd_reduced.components_[component_num] sorted_idx = np.argsort(np.abs(loadings))[::-1] # Sort by absolute value top_features = [feature_names[i] for i in sorted_idx[:num_top_features]] top_loadings = loadings[sorted_idx[:num_top_features]] top_features_per_component[f"Component {component_num + 1}"] = list(zip(top_features, top_loadings)) # 5. Visualization fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(15, 15)) for idx, (component, features_loadings) in enumerate(top_features_per_component.items()): features, loadings = zip(*features_loadings) axes[idx].barh(features, loadings, color='#c59617') axes[idx].set_title(component) axes[idx].set_xlabel('Loading') axes[idx].invert_yaxis() # Display the highest loading at the top plt.tight_layout() plt.show()

Principal Component Analysis (PCA):

PCA is a dimensionality reduction technique that can help identify which features account for the most variance in the data. While the principal components themselves may not directly correspond to the original features, the magnitude of the PCA loadings can give insights into which features are most important.

In summary, each component captures different aspects of the data:

Component 1 primarily emphasizes features related to session engagement, pageviews, and event counts. Features like engaged_sessions, total_pageviews, and total_event_count have strong positive associations. On the other hand, bounce_rate has a notable negative association.

Component 2 seems to capture variance related to user properties, particularly those in the United States using Apple devices. The presence of features like country_United States, brand_name_Apple, and device_mobile suggests this component might be distinguishing between device brands and geographic regions.

Component 3 focuses on specific pages and engagement times, with features like avg_time_on_pro_pages and commercial_pages_engagement_time having strong positive associations.

In summary, each component captures different aspects of the data:

Component 1: Overall user engagement metrics. Component 2: User properties and device characteristics. Component 3: Engagement on specific types of pages.

Clusters Give the Data Meaning

In order to find if there are groups of users that stand out in certain areas of the CUSA Website we decided to use K-Means clustering. K-Means clustering is simple to implement, easy to use on large sets of data and assures convergence. A cluster of four was chosen as the optimal number of clusters as shown by the K-Means Inertia Elbow Chart on Web User Data. The following cluster_groups data frame was created to continue investigation in to what makes these clusters unique.

# Separate numerical and categorical columns numerical_cols = cleaned_data.select_dtypes(include=['int64', 'float64']).columns categorical_cols = cleaned_data.select_dtypes(include=['object']).columns # Create the preprocessing pipelines for both numerical and categorical data numerical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore')) ]) # Combine preprocessing steps preprocessor = ColumnTransformer( transformers=[ ('num', numerical_transformer, numerical_cols), ('cat', categorical_transformer, categorical_cols) ]) # Preprocessing data preprocessed_data = preprocessor.fit_transform(cleaned_data) preprocessed_data.shape

# Define TruncatedSVD object svd = TruncatedSVD(n_components=50) # Apply TruncatedSVD to the data svd_data = svd.fit_transform(preprocessed_data) # Check the explained variance ratio explained_variance_ratio_svd = svd.explained_variance_ratio_ explained_variance_ratio_svd, svd_data.shape

# Define the range of potential clusters clusters_range = range(1, 10) # Initialize list to store the SSE for each k sse = [] # Loop over the range of clusters for k in clusters_range: # Define KMeans model kmeans = KMeans(n_clusters=k, random_state=42) # Fit the model kmeans.fit(svd_data) # Append the SSE for k to the sse list sse.append(kmeans.inertia_) # Plot the Elbow Method graph plt.figure(figsize=(10, 6)) plt.plot(clusters_range, sse, marker='o') plt.xlabel('Number of clusters') plt.ylabel('Sum of squared errors') plt.title('Elbow Method') plt.show()

# Define KMeans model with 4 clusters kmeans = KMeans(n_clusters=4, random_state=42) # Fit the model and predict the clusters clusters = kmeans.fit_predict(svd_data) clusters[:10]

Assign Clusters to Dataframe

# Add the cluster assignments to the original dataframe cleaned_data['cluster'] = clusters # Check the distribution of clusters cluster_distribution = cleaned_data['cluster'].value_counts().sort_index() # Display the first few rows of the dataframe with clusters data_with_clusters = cleaned_data.copy() data_with_clusters.head(), cluster_distribution

# Calculate the mean for numerical features for each cluster numerical_summary_by_cluster = data_with_clusters.groupby('cluster')[numerical_cols].mean() # Calculate the mode for categorical features for each cluster categorical_summary_by_cluster = data_with_clusters.groupby('cluster')[categorical_cols].agg(lambda x: pd.Series.mode(x)[0]) numerical_summary_by_cluster, categorical_summary_by_cluster

# Create a scatter plot of the two principal components plt.figure(figsize=(10, 8)) plt.scatter(svd_data[:, 0], svd_data[:, 1], c=clusters, cmap='copper', alpha=0.5) plt.title('User Clusters (Principal Component plot)') plt.colorbar(label='Cluster ID') plt.show()

Silhouette Score

# Determine the size of the sample you want to use. Here we take 20% of the data. sample_size = int(preprocessed_data.shape[0] * 0.20) # Use numpy's random.choice to get a random subset of your data random_indices = np.random.choice(preprocessed_data.shape[0], size=sample_size, replace=False) data_sample = preprocessed_data[random_indices, :] # Fit the KMeans model to the data_sample kmeans = KMeans(n_clusters=4) # Define the number of clusters you want kmeans.fit(data_sample) # Predictions y_pred_sample = kmeans.predict(data_sample) # Compute the silhouette score based on the sample silhouette_score_sample = silhouette_score(data_sample, y_pred_sample) print('Silhouette Score: %.3f' % silhouette_score_sample)

'Pageview' Data Table

Data Inspection

The dataset contains the following columns:

client_id: This seems to be a unique identifier for each client. Some values are NaN, indicating missing data. country: This is the country in which the client is located. state: This is the state in which the client is located. city: This is the city in which the client is located. device: This indicates the device type (e.g., mobile) used by the client. device_operating_system_version: This indicates the version of the operating system of the device used by the client. Device_Browser: This represents the browser used by the client. page_location: This indicates the URL of the page visited by the client. trade_lead_form_submit: This is a count of trade lead form submissions by the client. residentialconsumer_lead_form_submit: This is a count of residential consumer lead form submissions by the client. contactus_form_submit: This is a count of contact us form submissions by the client. purchase: This is a count of purchases made by the client. custom_session_number: This seems to be a count of sessions by the client. pageview_timestamp: This is the timestamp of the pageview. page_count: This is a count of pages viewed by the client. Traffic_Name: This represents the traffic name for the client's session. Some values are NaN, indicating missing data. Traffic_Source: This represents the traffic source for the client's session. Some values are NaN, indicating missing data. Traffic_Medium: This represents the traffic medium for the client's session. Some values are NaN, indicating missing data.

The dataset contains a mix of data types including numerical (int64, float64) and categorical (object) variables. Some columns, such as client_id, country, state, city, device_operating_system_version, Traffic_Name, Traffic_Source, and Traffic_Medium, contain missing values.

We can start our exploratory data analysis by plotting the distribution of some numerical and categorical features.

Since some columns have missing values, it's important to decide how to handle these before proceeding with the analysis. As you've previously mentioned, we could drop the rows with missing data for specific columns and replace the missing values in other columns with '0'. However, in this case, given the nature of the missing data (mostly categorical), replacing missing values with '0' might not be appropriate. We could replace them with a specific value like 'Unknown' instead.

Let's start by handling the missing data and then proceed with the visualizations.

Cleaning The Newly Clustered Data

merged_data = pd.merge(cleaned_data_pages,data_with_clusters[['client_id','cluster', 'engagement_level', 'total_event_count', 'total_sessions', 'total_pageviews', 'total_purchases', 'total_trade_lead_form_submit', 'total_residentialconsumer_lead_form_submit', 'total_contactus_form_submit', 'total_dealerlocator_pageviews', 'total_professionals_pageviews', 'total_commercial_pageviews', 'total_specify_pageviews', 'total_case_studies_pageviews', 'total_pro_locations_pageviews', 'total_design_palette_pageviews', 'total_inspiration_pageviews', 'total_financebycambria_pageviews', 'total_planningcare_pageviews', 'total_cambriastyle_pageviews', 'total_samples_pageviews', 'total_galleries_pageviews', 'pro_pages_engagement_time', 'commercial_pages_engagement_time', 'specify_pages_engagement_time', 'case_studies_pages_engagement_time', 'pro_locations_pages_engagement_time', 'design_palette_pages_engagement_time', 'inspiration_pages_engagement_time', 'financebycambria_pages_engagement_time', 'planningcare_pages_engagement_time', 'samples_pages_engagement_time', 'galleries_pages_engagement_time', 'dealerlocator_pages_engagement_time', 'cambriastyle_pages_engagement_time', 'total_engagement_time', 'engaged_sessions', 'formatted_engagement_time', 'avg_pageviews_per_session', 'engagement_rate', 'bounces', 'bounce_rate', 'avg_time_on_pro_pages', 'avg_time_on_design_pages', 'avg_time_on_decision_pages', 'avg_time_on_loyalty_pages']], on='client_id', how='left') merged_data.head(15)

Importing Data to Zip files and CSV files

I imported the two datasets into CSV files so that the clusters are saved; importing to CSV files means the clusters are saved and I do not have to run the k means clustering code every time I open the notebook.

Next we take the clustered dataset and do another round of exploratory data analysis that will help point out and assign characteristics that define each of the four clusters

Distribution of different Devices, Brand Names, Operating Systems, and Engagement Levels across the clusters

Distribution of Total Event Count and Total Sessions across clusters

Device Types

Browser Type

General information to be used later

this table displays numerical data grouped by cluster that can help indicate what to create visuals for in order to tell the best story

Engagement Level

grouped_data = data_with_clusters.groupby('cluster') engagement_level1 = grouped_data['engagement_level'].value_counts() print(engagement_level1)

RadViz

A RadViz Plot is a multivariate data visualization algorithm that places each feature dimension uniformly around the circumference of a circle. It then plots points on the interior of the circle, normalizing their values along the axes extending from the center to each arc. This approach enables the visualization of numerous dimensions that can fit on the circle, significantly expanding the visualization's dimensionality. We employed this method to identify separability among clusters. Data points exhibiting extreme characteristics will appear on the outer edges of the circle, while those with less pronounced differences will mostly be located closer to the circle's center.

Consumer Pageviews

Professional Pageviews

In order to help identify the clusters we used Engagement Time with the understanding that professionals are going to have a more straight line approach to using the website; consumers are going to have a more randomized approach because they often aren't as familiar with the website, so they take more time to explore

Pageviews

Now that we have found a pattern in the engagement time of each cluster, we can look at pageviews to start assigning characteristics to the clusters.

df_cluster2 = data_with_clusters.groupby('cluster').agg({ 'total_trade_lead_form_submit' : 'sum', 'total_residentialconsumer_lead_form_submit' : 'sum', 'total_contactus_form_submit' : 'sum' }).reset_index() df_cluster2

Assigning Characteristics With GA4 Metrics

The first subplot shows metrics are averaged to eliminate duplicates from after clustering. the second subplot are metrics that could be totaled because there was not threat of duplicates

total_purchases_by_cluster = data_with_clusters.groupby('cluster')['total_purchases'].sum().reset_index() total_purchases_sum = total_purchases_by_cluster['total_purchases'].sum() total_percent_purchase_sorted = total_purchases_by_cluster.sort_values('total_purchases',ascending=False) # Function to add percentage labels to the bars def add_percentage_labels(ax): for bar in ax.patches: percentage = f"{(bar.get_height() / total_purchases_sum) * 100:.2f}%" ax.annotate(percentage, (bar.get_x() + bar.get_width() / 2, bar.get_height()), ha='center', va='bottom', fontsize=12, color='black') # Plot the bar chart fig, ax = plt.subplots(figsize=(18, 10)) bars = ax.bar(total_purchases_by_cluster['cluster'], total_purchases_by_cluster['total_purchases'], color=cluster_colors) ax.set_title('Total Purchases', fontsize=30) ax.set_xlabel('Cluster', fontsize=25) ax.set_ylabel('Purchases',fontsize=25) ax.set_xticks(total_purchases_by_cluster['cluster']) ax.set_xticklabels(total_purchases_by_cluster['cluster']) ax.axhline(y=0, color='black', linewidth=0.5) # Add percentage labels to the bars add_percentage_labels(ax) plt.tight_layout() plt.show()

Data Mining

# Data Loading data_path = "/work/merged_data.csv" data_pages = pd.read_csv(data_path) data_pages = data_pages.iloc[:, : 19] data_pages.head()

# Data Cleaning def clean_url(url): url = url.replace('#!', '') url = url.split('?')[0] url = re.sub(r'(?<!:)//', '/', url) return url data_pages['cleaned_page_location'] = data_pages['page_location'].astype(str).apply(clean_url)

# Frequency Analysis (Apriori) def apriori_custom(transactions, min_support=0.01): item_frequency = {} for trans in transactions: for item in trans: if item in item_frequency: item_frequency[item] += 1 else: item_frequency[item] = 1 n_transactions = len(transactions) item_frequency = {k: v for k, v in item_frequency.items() if v / n_transactions >= min_support} frequent_pairs = {} items = list(item_frequency.keys()) for i in range(len(items)): for j in range(i+1, len(items)): pair_count = sum([1 for trans in transactions if items[i] in trans and items[j] in trans]) if pair_count / n_transactions >= min_support: frequent_pairs[(items[i], items[j])] = pair_count results = [] for item, freq in item_frequency.items(): results.append([set([item]), freq]) for pair, freq in frequent_pairs.items(): results.append([set(pair), freq]) df_results = pd.DataFrame(results, columns=['itemsets', 'support']) df_results['support'] = df_results['support'] / n_transactions df_results = df_results.sort_values(by='support', ascending=False).reset_index(drop=True) return df_results # Sequence Analysis def find_common_sequences(transactions, sequence_length=2): sequence_counts = defaultdict(int) for trans in transactions: if len(trans) < sequence_length: continue for i in range(len(trans) - sequence_length + 1): sequence = tuple(trans[i:i+sequence_length]) sequence_counts[sequence] += 1 sorted_sequences = sorted(sequence_counts.items(), key=lambda x: x[1], reverse=True) return sorted_sequences def remove_consecutive_repeats(transactions): refined_transactions = [] for trans in transactions: refined_trans = [trans[0]] for i in range(1, len(trans)): if trans[i] != trans[i-1]: refined_trans.append(trans[i]) refined_transactions.append(refined_trans) return refined_transactions

# Grouping the data by client_id and custom_session_number to represent each session and then creating a list of pages visited in that session sessions = data_pages.groupby(['client_id', 'custom_session_number'])['cleaned_page_location'].apply(list).reset_index() # Applying both analyses to each cluster unique_clusters = data_pages['cluster'].dropna().unique() apriori_results = {} sequence_results = {} for cluster in unique_clusters: cluster_data = sessions[data_pages['client_id'].isin(data_pages[data_pages['cluster'] == cluster]['client_id'])]['cleaned_page_location'] # Frequency analysis apriori_results[cluster] = apriori_custom(cluster_data) # Sequence analysis refined_cluster_data = remove_consecutive_repeats(cluster_data) sequence_results[cluster] = { '4-sequence': find_common_sequences(refined_cluster_data, sequence_length=4)[:5], '5-sequence': find_common_sequences(refined_cluster_data, sequence_length=5)[:5], '6-sequence': find_common_sequences(refined_cluster_data, sequence_length=6)[:5], '7-sequence': find_common_sequences(refined_cluster_data, sequence_length=7)[:5] } # Creating a DataFrame to present the results results_df = pd.DataFrame(columns=["Cluster", "Top 10 Pages (Support)", "Top 3 4-Page Sequences (Count)", "Top 3 5-Page Sequences (Count)", "Top 3 6-Page Sequences (Count)", "Top 3 7-Page Sequences (Count)"]) for cluster in unique_clusters: # Extracting top 10 pages from frequency analysis top_pages = apriori_results[cluster]['itemsets'].head(10).apply(lambda x: ', '.join(list(x))).tolist() top_pages_support = apriori_results[cluster]['support'].head(10).values top_pages_combined = [f"{page} ({support:.2%})" for page, support in zip(top_pages, top_pages_support)] # Extracting top 3 4-page sequences top_4_sequences = [f"{seq[0][0]}\n → {seq[0][1]}\n → {seq[0][2]}\n → {seq[0][3]}" for seq in sequence_results[cluster]['4-sequence'][:3]] top_4_sequences_count = [seq[1] for seq in sequence_results[cluster]['4-sequence'][:3]] top_4_sequences_combined = [f"{seq} ({count})" for seq, count in zip(top_4_sequences, top_4_sequences_count)] # Extracting top 3 5-page sequences top_5_sequences = [f"{seq[0][0]}\n → {seq[0][1]}\n → {seq[0][2]}\n → {seq[0][3]}\n → {seq[0][4]}" for seq in sequence_results[cluster]['5-sequence'][:3]] top_5_sequences_count = [seq[1] for seq in sequence_results[cluster]['5-sequence'][:3]] top_5_sequences_combined = [f"{seq} ({count})" for seq, count in zip(top_5_sequences, top_5_sequences_count)] # Extracting top 3 6-page sequences top_6_sequences = [f"{seq[0][0]}\n → {seq[0][1]}\n → {seq[0][2]}\n → {seq[0][3]}\n → {seq[0][4]}\n → {seq[0][5]}" for seq in sequence_results[cluster]['6-sequence'][:3]] top_6_sequences_count = [seq[1] for seq in sequence_results[cluster]['6-sequence'][:3]] top_6_sequences_combined = [f"{seq} ({count})" for seq, count in zip(top_6_sequences, top_6_sequences_count)] # Extracting top 3 7-page sequences top_7_sequences = [f"{seq[0][0]}\n → {seq[0][1]}\n → {seq[0][2]}\n → {seq[0][3]}\n → {seq[0][4]}\n → {seq[0][5]}\n → {seq[0][6]}" for seq in sequence_results[cluster]['7-sequence'][:3]] top_7_sequences_count = [seq[1] for seq in sequence_results[cluster]['7-sequence'][:3]] top_7_sequences_combined = [f"{seq} ({count})" for seq, count in zip(top_7_sequences, top_6_sequences_count)] # Appending to the results DataFrame """results_df = results_df.append({ "Cluster": cluster, "Top 10 Pages (Support)": "\n".join(top_pages_combined), "Top 3 4-Page Sequences (Count)": "\n".join(top_4_sequences_combined), "Top 3 5-Page Sequences (Count)": "\n".join(top_5_sequences_combined), "Top 3 6-Page Sequences (Count)": "\n".join(top_6_sequences_combined), "Top 3 7-Page Sequences (Count)": "\n".join(top_7_sequences_combined) }, ignore_index=True)""" results_df.loc[len(results_df)] = [cluster, "\n".join(top_pages_combined), "\n".join(top_4_sequences_combined), "\n".join(top_5_sequences_combined), "\n".join(top_6_sequences_combined), "\n".join(top_7_sequences_combined)] results_df

Dependencies

%watermark --iversions

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Cambria USA Clustering Analysis

Big Query Integration

'Client' Data Table

Data Inspection

Cleaning the Data

EDA

Heat Map

Categorical EDA

Visuals to show CUSA users operating system and device type

Numerical EDA

Sessions

Average Sessions by Operating System

Events

Purchases

Purchase by Device

Pageviews

Pageviews by Device

Average Pageviews per Session

Consumer Specific Pageviews

Professional Specific Pageviews

Processing

Feature Importance

Clusters Give the Data Meaning

Assign Clusters to Dataframe

Silhouette Score

'Pageview' Data Table

Data Inspection

Cleaning The Newly Clustered Data

Importing Data to Zip files and CSV files

Distribution of different Devices, Brand Names, Operating Systems, and Engagement Levels across the clusters

Distribution of Total Event Count and Total Sessions across clusters

Device Types

Browser Type

General information to be used later

Engagement Level

RadViz

Consumer Pageviews

Professional Pageviews

Pageviews

Assigning Characteristics With GA4 Metrics

Data Mining

Dependencies

Cambria USA Clustering Analysis