Businesses are bound to have customers whose needs are the same. But, being humans, they are bound to have different preferences and are from a variety of geographic places and various age groups, their expectations in terms of your products and service, would vary. The key to serving the needs of various groups of customers effectively is to clearly understand their preferences in every part of your business, like the product, communication, service and so on. This helps a business to efficiently meet their customer’s expectations in every way possible and develop a strong relationship with them.
We are going to use a pre-processed FMCG (Fast Moving Consumer Goods) dataset, which contains customer details of 2000 individuals as well as their purchase activities.
Pairwise correlation of variables
With this heat map, we see that there is a strong positive correlation between age and education or between occupation and income.
These inter-variable correlations will be important in the feature selection of the segmentation process.
Hierarchical Clustering
We’ll use the dendrogram and linkage from Scipy’s hierarchy module for clustering the dataset. A dendrogram is a tree-like hierarchical representation of data points. It is most commonly used to represent hierarchical clustering. Linkage, on the other hand, is the function that helps us to implement the clustering. Here we need to specify the method to compute the distance between two clusters, and at completion, the linkage function returns the hierarchical cluster in a linkage matrix.
At the bottom of the plot on the x-axis, we have the observations. These are the 2000 individual customer data points. On the y-axis, we see the distance between points or clusters represented by the vertical lines. The smaller the distance between points, the further down in the tree they’ll be grouped together.
Identify the number of clusters manually
The process is to slice through the biggest vertical line that is not intercepted by any extended horizontal line. And after the slice, the number of clusters under the slicing line would be the best number of clusters. In the following diagram, we see two candidate vertical lines, and between them, “candidate 2” is the taller. So we would cut through that line which would produce 4 clusters underneath.
Hierarchical clustering is very simple to implement and it returns the optimal number of clusters in the data, but unfortunately, because of its slowness, it is not used in real life. Instead, we often use K-means clustering. In our next post, we’ll see how to implement K-means clustering and we’ll try to optimize it with PCA.
k-means clustering
Hierarchical clustering is great for small datasets, but as the dataset size grows, the computational demands grows rapidly. Because of this, hierarchical clustering isn’t practical
K-means clustering can segment an unlabeled dataset into a pre-determined number of groups.
The elbow method
we use the ‘elbow method’ to identify the value of ‘k’ (number of clusters). This method essentially is a brute-force approach, where we calculate the sum of the squared distance between each member of the cluster and its centroid for some value of ‘k’ (e.g. 2–10) and plot ‘k’ against the squared distance. As the number of clusters increases, the average distortion will decrease, meaning the cluster centroids move closer to each data point. In the plot, this will produce an elbow-like structure hence the name ‘elbow method’. We’ll choose the value of ‘k’ at which the distance decreases abruptly
As we can see in the above graph, the line declines stiffly until we reach the number of cluster 4, and declines more smoothly after that. Meaning that our elbow is at 4 and that is the optimal number of clusters for us. This also aligns with the output of hierarchical clustering that we did.
Now let’s perform k-means clustering with 4 clusters, and include the segment number with our data frame.
we’ve segmented our customers into 4 groups. Now let’s try to understand the characteristics of each group. At first, let's see the mean value of each feature by clusters:
Visualize our customer segments
Conclusion
We observe in the ‘age vs income’ scatter plot that higher age customers with higher income (well-off) are separated, but the other three segments are not that distinguishable.
In the second observation, ‘education vs income’ violin plot, we see that customers with no educational records have lower income, and those who graduated have higher income. But other segments are not that separable.
Following these observations, we can conclude that k-means did a decent job separating the data into clusters. However, the outcome is not that satisfactory.
Principal Component Analysis (PCA)
When multiple features in a dataset are highly correlated it can skew the outcome of the model because of redundant information. That’s what happened with our k-means model. It is called a multicollinearity problem. We can solve this by reducing dimensionality.
The correlation matrix in the first article showed that Age and Education are correlated, and Income and Occupation are also correlated. We will tackle this using Principle Component Analysis (PCA), a dimensionality reduction method.
Identifying principal components
The property “explained_variance_ratio_” of the pca object contains seven components that explain 100% of the variability of our dataset. The first component explains about 36% of the variability, 2nd and 3rd component explain 26% and 19% of variability, respectively.
The rule of thumb is to pick the number of components that retain 70–80% of the variability. If we select three top components, they will already hold more than 80% variability, and if we pick four components, they will retain almost 90% variability. Let’s pick three components and fit our pca model. Then we create a dataframe with the three principle components while using the columns from our original dataset. Notice, that all values in the dataframe are between negative one and one as they are essentially correlations.
The new correlation matrix
Component one has a positive correlation with age, income, occupation, and settlement size. These features are related to the career of a person.
On the other hand, sex, marital status, and education are the most prominent determinant for the second component. We can also see in this component all career-related features are uncorrelated. Therefore, this component doesn’t refer to an individual’s profession but rather education and lifestyle.
For the third component, we observe that age, marital status, and occupation are the most prominent determinants. Marital status and occupation weigh negatively but are still important.
K-Means Clustering with PCA
Analyze segmentation results
we’ve established that component one represents career, component two represents education & lifestyle and component three represents life or work experience
❏ Segment 0: low career and experience values with high education and lifestyle values. Label: Standard ❏ Segment 1: high career but low education, lifestyle, and experience Label: Career-focused ❏ Segment 2: low career, education and lifestyle, but high life experience Label: Fewer opportunities ❏ Segment 3: high career, education and lifestyle as well as high life experience Label: Well-off
Let’s visualize the segments with respect to the first two components.
As you can see, now the four segments are distinctly identifiable. Though standard and fewer-opportunity have some overlaps, still the overall result is far better than the previous outcome.
Conclusion
So far, we have divided our customers into four different and clearly identifiable groups. With this, we have completed the “segmentation” part of the STP framework. Since “targeting” mostly involves business decisions on which customer segment to focus on, we’ll jump over to “positioning”
Positioning
Positioning is a crucial part of marketing strategy, specially when the firm operates in a highly competitive market. We need to understand how consumers perceive the product offering and how it differs from other competitive offerings. Pricing and discounts play a vital role in shaping customer purchase decisions.
We’ll be working with a dataset that represents the customer purchase activities of a retail shop. This dataset is linked with the already familiar customer details dataset that we have already worked with
Let’s start our data exploration with the number of unique customers per segment. We see that the biggest segment is the “fewer-opportunity” segment (38%), and other segments are almost of equal weight (around 20%). That is a well-balanced dataset we are working with.
Each segment also looks quite balanced in terms of the number of actual purchase records (Incidence = 1).
Now, when we look at the average price of each segment, they look very similar. All of them are at around the 2.0 level. But as we narrow our focus and look at the actual purchase price, we can see the “well-off” segment has a higher average price point (2.2), and the “career-focused” customers are at an even higher price point (2.65).
Logistic Regression Model
from the model coefficient, it’s evident that there is an inverse relationship between average price and a purchase event. If the average price decreases, the purchase probability will increase.
We see that the price of candy (among all brands) ranges from 1.1 to 2.8. So, keeping some buffer, let’s take the price range from 0.5 to 3.5, increasing 0.01 at a time, and check how the probability of the purchase event changes.
Conclusion
As expected, we observe that as the mean price increases, the chance of purchase decreases.
We shall continue our experiment of price elasticity prediction and learn how much we can increase the price without breaking the demand.
Predicting Price Elasticity of Demand
Compare Price Elasticity by Segment
We see in the graph (also in the dataframe) that the average inelastic price of well-off and career-focused groups is about 14 percent higher than the fewer-opportunity and standard segment. This allows us to increase the price for those earlier segments as long as the PED remains inelastic.
Another deduction we can make is that the steepness of lines varies on the right side of the graph (beyond price point 1.5). It indicates the different elasticity levels per segment and tells how the demand would act with the price increase. We have to carefully tune the price to maintain a healthy level of demand whilst the PED is elastic. Seems like the fewer-opportunity group is more sensitive to the price than other segments.