Identify Customer Segments
The goal of this notebook is to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments are then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The real data utilized to build segments was provided by Bertelsmann Arvato Analytics.
Loading the Data
There are four files associated with this notebook:
Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).
Udacity_CUSTOMERS_Subset.csv: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).
Data_Dictionary.md: Detailed information file about the features in the provided datasets.
AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columns
Convert Missing Value Codes to NaNs
Assessing Missing Data in Each Column
Names of the 6 columns that were removed because more than 18% of their data was missing:
Name, description of the top 5 columns with most percentage of missing information after removing outliers
- 'KKK': Purchasing power in region
- 'REGIOTYP': Neighborhood typology
- 'W_KEIT_KIND_HH' : Likelihood of children in household
- 'MOBI_REGIO' : Movement patterns
- 'KBA05_ANTG4' : Number of 10+ family houses in the microcell
It is really interesting to see what type of features are missing the most data for each individual. After taking a peak at their descriptions, I can imagine that those particular features are very specific in comparrasion to more generalizable features that may be publicly available.
Assessing Missing Data in Each Row
I first got a list of the features that had no missing data. Then after splitting the dataset by rows with < 10 missing values and > 10 missing values, I was able to see that the feature's data distribution differed significantly between the two groups.
The only feature that remained quite similar during the comparrasion was the ANREDE_KZ feature which represented Gender. This makes sense to me because when sureying individuals, a categorial variable with two values, is much easier to come by and the extent to which there is missing data, I would not imagine it to trend with other missing features.
The distribution of features that were about financial or personality traits differed significantly when looking at the two segments of the data (<10 missing values and >10 missing values)
Selecting and Re-Encode Features
Re-Encoding Categorical Features
Where to begin!
My first task was to find out the frequency of each data type in the dataset features. I discovered that there were quite a few categorical features, but not all of them were correctly encoded. By identifiying the length of the unique values in the categorical columns I was able to seperate the binary categorical columns from the multi-level categorical columns. There was only one feature that needed fussing with. The feature OST_WEST_KZ contained the unique values 'W' 'O' which needed dealing with. I was able to map those values into '1' and '0' values, finishing cleaning the binary features.
As for the multi-level features, to keep things straightforward I dropped them from the cleaned and trimmed dataset we are working with.
Engineering Mixed-Type Features
It was a little bit more time consuming to clean up the mixed columns as we had to create a mapping of new values to old values. That being said, thank goodness we had the data_dictionary to help see how many reuccurring categories there were in each column.
I was able to create 2 new features, a 'DECADE' column and a 'MOVEMENT' column out of the original 'PRAEGENDE_JUGENDJAHRE' feature. I decided to drop the original feature because it's mixed values may confuse the model.
I was able to create 2 new features, a 'LIFESTAGE' stage' column and a 'WEALTH' column. I decided to drop the original feature here because the creation of the two other features, there was no need to keep the mixed feature.
I decided to keep the other mixed type features.
Completing Feature Selection
Creating a Cleaning Function
Applying Feature Scaling
Discussion 2.1: Apply Feature Scaling
To make sure the dataset was clear of NaN values, I created an Imputer object using the 'most frequent' option which according to scklearns docs: 'replaces missing values using the most frequent value along each column. Can be used with strings or numeric data.'
Once I fit and transformed the dataset using the imputer, I scaled the data using a standard scaler object which scales each feature to a mean 0 and standard deviation 1.
Step 2.2: Perform Dimensionality Reduction
It was actually a little tricky deciding how many components to keep. The more components that I keep, the more variance is accounted for, but fewer components might do just as well. There was not a distinct point in which the variance stopped increases like you'd expect with a scree plot.
That being said, I choose to keep 33 components to account for 90% of the accuracy which seemed enough to represent the dataset.
Interpreting Principal Components
Analysis of first principal component
The first few features are listed below along with their sorted weights
- PLZ8_ANTG3: 0.220215
- PLZ8_ANTG4: 0.213657
- WEALTH: 0.202482
It's interesting that 'Number of 6-10 family houses in the PLZ8 region' (PLZ8_ANTG3) and 'Number of 10+ family houses in the PLZ8 region' (PLZ8_ANTG4) are the largest weights in the first principal component. They seem to be positively correlated!
When taking a look at the Wealth feature, that was positively correlated, I initially thought that it meant that the wealth increased as the number of larger family houses increased. However, after looking back at the data dictionary, the larger the value in the wealth feature, the lower the overall income, which in retrospect makes sense.
The larger the number of 6-10 family houses, the powerer the households tend to be.
Analysis of second principal component
- SEMIO_ERL: 0.228761
- FINANZ_VORSORGER: 0.221815
- SEMIO_KAEM: -0.177064
The second principal component had somse interesting findings. Event-oriented seems to have a positive correlation with financial prepardness. That makes sense and does reveal some interesting psychological implications. The more financial prepared an individual is or isn't correlates with how often they attend events or social gatherings (possibly a charity) since those generally cost money. Financial preparedness however, seems to have a negative correlation with combative attitudes implying that those who are more financial prepared are inversly not combative.
Analysis of second principal component
- SEMIO_VERT: 0.348044
- DECADE: -0.107699
- SEMIO_FAM: 0.250927
The third principal component revealed that the decade of a inviduals dominant youth movement and dreamfulness are negatively correlated. However, the higher the value of dreamfulness, the lower the intensity. So as the decade goes up (starting to get to younger and younger individuals), the dreamfulness value goes down (indicating more dreamfulness!). This makes a lot of sense. Youth tend to be more dreamy and hopeful! This was an interesting find because normally I had to look at the direction in which the values increased/decreased to accurately interpret the results.
Another intuitive result is that dreamfulness had a negative correlation with family minded. So as dreamfulness value increased (less dreamful), the more family minded individuals became. As expected, as you get older, you are more concered with family matters.
Making sure to look at the direction of the values in the data dictionary to interpret results is key during this analysis.
Applying Clustering to General Population
After testing out different numbers of cluster I decided to move forward with 5 clusters, because it looks as if the curve tends to drop off when using 5 clusters. I am hopeful that the sse score is low enough while still avoiding segmenting the data in too many clusters where the sse doesn't change much.
Applying All Steps to the Customer Data
Step 3.3: Comparing Customer Data to Demographics Data
Comparing Customer Data to Demographics Data
The results of the clustering are pretty intuitive, which is always a good sign!
Let's take a look at the difference between some of the values in the overrepresented group in the customer data cluster #4
- ALTERSKATEGORIE_GROB: 3.302811 (46-60 years old)
- ALTERSKATEGORIE_GROB: 1.674215 (Under 30 in age)
Takeaway: It's clear that older indivuduals are more likely to be popular with a mail-order company.
WEALTH (originally part of CAMEO_INTL_2015): 1.489416 (Individuals with more wealth)
LIFE_STAGE: 2.384001 (Families With School Age Children/Older Families & Mature Couples)
WEALTH (originally part of CAMEO_INTL_2015): 3.387284 (Individuals who come from Less Affluent Households)
LIFE_STAGE: 1.039959 (Young Couples with children)
Takeaway: Individuals who are less financially prepared are less likely to be popular with a this mail-order company. It is also the case that adults with school age or older children are more likely to be popular with this mail-order company while young couples with young children are not. It makes sense that younger couples are not as financially stable and do not have as much purchasing power.