Profile report for the dataset.
• Credit score: 1000 values, 352 distinct. A max of 850 and a min of 350. An IQR of 131 and a std of 96.6, plus skew of 0.024. So, reasonably well-balanced with a slight bias toward the upper-middle of the credit score range (around700) due to its most frequent value being 850. The strongest associated column to this one is age, and the weakest is tenure.
• Country: 1000 values, 3 distinct. Those are France (51%), Spain (26%) and Germany (23%). This column provides the most information on churn vs. all other columns, while bearing strong correlations with balance, estimated_salary, age & credit_score for obvious reasons.
• Gender: 1000 values, 2 distinct: 56% Male, 44% Female. These features give information on churn and also possess strong correlations with balance, tenure, credit_score and estimated_salary.
• Age: 1000 values, 58 distinct. A max of 92 and a min of 18. An IQR of 12 and a std of 10.3, plus a skew of 0.906. This column sees positive associations with credit_score and estimated_salary, negative associations with tenure and balance.
• Tenure: 1000 values, 11 distinct. A max of 10, a min of 0. A std of 2.91, skew of 0.098 and a most frequent value of 135. This column's most positive association is with the estimated_salary column, while the negative associations are with age, balance and credit_score.
• Balance: 1000 values, 630 distinct. A max of 222K, a min of 0, and a median of 96K. There is a bit of negative skew here, at a value of -0.094 thanks to its most occurring value of 0. Features in the balance column bear positive numerical association to credit_score, with negatives being tenure, age and estimated_salary. The positive categorical associations with this column exist in the products_number, country, churn, gender, credit_card, and active_member.
• Products_number: 1000 values, 4 distinct. Product 1 holds 53% of the data, product 2 holds 45% of the data, product 3 2% of the data and product 4 holds <1%. Categorical associations belong to churn and credit_card. These features give quite a bit of information on churn as well as credit_card. So there could be instances where certain credit card products are responsible for churn. This columns' numerical associations belong to the balance, age, estimated_salary, tenure and credit_score columns.
• Credit_card: 1000 values, 2 distinct. A binary column likely suggesting the simple presence of a credit card ("1" == 72%) or not ("0" == 28%). This column gives information on products_number, while its numerical correlations exist within the balance, tenure, credit_score and age columns.
• Active_member: 1000 values, 2 distinct. Once again a binary column representing figures for yes ("1" == 51%) or no ("0" == 49%). These features give information on churn which is good, as well as bearing high numerical association to the customer's age, credit_score and balance.
• Estimated_salary: 1000 values, 1000 distinct. A max of 200K, a min of 0K, and a median of 104K. Another slight negative skew here at -0.063, but a healthy sum of 101.7M. This column's features bear positive numerical association with tenure and age, negative associations with balance. The categorical associations are in the products_number, country, churn and gender columns.
• Churn: 1000 values, another binary yes (21%) or no (79%) column. The churn column provides information on - as we've already seen above analysing the associated features - products_number, country, active_member and gender. There are strong correlations with age, balance and tenure, with estimated_salary and credit_score being the least correlated but still very much worth a mention.
Clicking on the associations tab reveals strong associations between balance, products_number and country. Also, age and churn are mildly correlated here.
I would first like to look into the age column as the mean is 36 and the max age is 92. There is a good deal of data in the later ages containing healthy balances and average incomes, so these will not be treated as outliers.
Average churn rate by country.
• German customers have an average age of 39.7, with Germany having the highest churn rates overall (0.32%).
• Spain's average age of 38.9 follows, as does its churn rate of 0.17%.
• France's average age of 38.5 is the third most common, along with its churn rate of 0.16%.
Those higher ages, Germany and larger balances appear in the correlation matrix also.
Balance and age by churn.
Balance with age vs. churn shows the majority of churn data existing primarily between the 40-60 age range, with a good spread of balances ranging from 0 - 250K respectively.
Product and churn.
• Customers who signed up to use product number 1 saw the most churn (1409), followed by product number 2 (348).
• Product number 3 saw the highest disproportion of churn, with a rate of 220 vs. 46 who remained.
• Product number 4 saw 100% of customers churn, albeit a figure of only 60 customers.
Average churn rate per tenure length.
Average churn rates per tenure shows the early tenures (0 - 1 years) tend to lean towards greater churn figures.
Years 2, 4, 5, 6, 7, 8 & 10 appear to be the sweet spot for customer retention.
That leaves years 4, 5, 6, 7 & 8 as a valid long-standing customer retention pattern should there be a system considered for successfully onboarding and retaining new customers in the initial short-term.
Balance and churn.
A significant number of customers with a balance in the region of 2500 churned, but around six times that amount of customers with the same balance remained.
The majority of the churn data exists between the 85K - 160K range, with 107K - 127K being the most prominent of that specific range.
Average balance by gender vs. churn.
The average bank balance by gender vs. churn rates suggest that both Males and Females holding larger balances churned, meaning a fast response to this issue is required.
• Female, Not Churned: 71,183.25 • Female, Churned: 89,036.64 • Male, Not Churned: 73,911.61 • Male, Churned: 93,736.48
Credit score and churn.
A clear visible pattern here; the majority of churn happening around the lower credit score figures, especially under 500.
There are mild peaks and troughs which should average out to make up almost as much churn over the rest of the credit score range, but nothing outstanding to note, although the spike in churn around the credit score of 830-840 will be worthy of further investigation.
Adding the estimated salary data shows that there is a good range of salaries in the 850 credit score & churn bracket, making the issue a little difficult to pinpoint from just this data. Although here it is clearer to see that the customers in the 350 credit score bracket responsible for large churn figures have high salaries.
Adding the product data shows us that product number 1 has the highest churn rate for those with lower credit scores as well as displaying relatively high churn rates across the credit score board, where products 3 and 4 (with very high churn rates) appear to have been used by / targeted to those with higher credit scores. The credit score should be quite a balanced feature in that case, representing churn in both the low and high score margins.
Product number by age and churn.
Here we see dark hues representing high churn rates in products 3 and 4 across all age ranges with reasonably high average balances.
The customers with balances in the 80K - 130K between the ages of 40 and 65 in product 1, and the customers with balances between 45K - 85K between the ages of 49 and 75 in product 2 will be highly significant.
So from the above charts we've seen those with high salaries and low credit scores in the product 1 category are churning as well as those with a range of salaries and high credit scores in the products 3 & 4 categories. There is significant overlap between all features in the age ranges of 50-70, making this a highly important age bracket.
Credit card and churn.
1421 credit card owners churned, 613 non-holders churned. The average figures show a slight variation where +0.63% of credit card holders were retained when compared to those not holding cards:
• Non-holders: 20.81% churned. • Holders: 20.18% churned.
Active member and churn || Gender and churn.
The majority of churn came from customers who weren't active members, an average of 0.27 versus 0.14.
The majority gender responsible for churn is Female (0.25 vs. 0.16).
KMeans clustering.
I will implement both a clustering method and the Cox Proportional Hazards Model to see what customer attributes are likely to affect one another.
KMeans has returned three clusters, which can be interpreted as follows: Cluster 0: • Average credit score: 650.73 • Average age: 38.04 • Average tenure: 5.15 years • Average balance: 10,020.77 • Average number of products: 1.96 • Average estimated salary: 98,648.34 • Churn rate: 14.82% Cluster 1: • Average credit score: 651.69 • Average age: 38.97 • Average tenure: 2.31 years • Average balance: 111,972.64 • Average number of products: 1.29 • Average estimated salary: 95,852.19 • Churn rate: 23.78% Cluster 2: • Average credit score: 649.12 • Average age: 39.87 • Average tenure: 7.58 years • Average balance: 116,114.29 • Average number of products: 1.29 • Average estimated salary: 105,998.77 • Churn rate: 23.22% Cluster 0 has the lowest churn rate (14.82%).
Characteristics which constitute a low churn rate: A low average age, low average balance, a mid-range tenure, the highest average number of products (AKA: the most values for product number 2, considering most of product 3 and all of product 4 churned), the highest average of credit cards, a mid-range active_member count and a mid-range salary.
Cluster 1 and Cluster 2 have similar churn rates, but Cluster 2 (with its slightly lower churn rate) has a higher average balance and tenure.
Cox Proportional Hazards Model
The Cox Proportional Hazards Model is a technique used to explore the relationship between the survival time of subjects and one or more predictor variables. Used predominantly in medical research (for analysing survival data), I think it would provide some good insights for a BI use case today.
Some key points from the model: • Concordance: 0.72 (indicates a good predictive ability) • Partial AIC: 32983.27 • Log-likelihood ratio test: 1156.95 on 10 degrees of freedom Significant variables (p < 0.05): • Credit Score: Slightly significant with a negative coefficient. • Age: Highly significant with a positive coefficient. • Balance: Significant with a positive coefficient. • Active Member: Highly significant with a negative coefficient. • Country (Germany): Highly significant with a positive coefficient. • Gender (Male): Highly significant with a negative coefficient, thus Female. These results verify a lot of what we've witnessed so far from a couple of analysis methods, indicating that higher age ranges, higher bank balances, not being an active member, being from Germany, and of the Female gender are significant factors in predicting high churn rates.
A more detailed analysis of the factors contributing to customer churn.
• 1. Credit Score: - Coefficient: -0.00 - Interpretation: A slightly negative coefficient indicates that higher credit scores are associated with a marginally lower risk of churn. However, the effect is minimal. • 2. Age: - Coefficient: 0.05 - Interpretation: A positive coefficient indicates that older customers are at a higher risk of churn. This aligns with the earlier analysis showing higher churn rates among older age groups. • 3. Balance: - Coefficient: 0.00 - Interpretation: A positive coefficient indicates that higher balances are associated with a higher risk of churn. This is again consistent with the observation that customers with higher balances tend to churn more. • 4. Active Member: - Coefficient: -0.74 - Interpretation: A highly negative coefficient indicates that active members are significantly less likely to churn. This suggests that engagement and activity are crucial in retaining customers. • 5. Country (Germany): - Coefficient: 0.49 - Interpretation: A positive coefficient indicates that customers from Germany are at a higher risk of churn compared to those from other countries. • 6. Gender (Male): - Coefficient: -0.39 - Interpretation: A negative coefficient indicates that male customers are less likely to churn compared to female customers. This is consistent with the observation that females have higher churn rates.
Additionally...
Following on from the above information using insights found earlier in the EDA:
• 7. Product Number: - Product number 1 has the highest churn rate for those with lower credit scores. Products 3 and 4 have very high churn rates and are used by customers with higher credit scores. • 8. Tenure: - Early tenures (0 - 1 years) tend to lean towards greater churn figures. Years 2, 4, 5, 6, 7, 8 & 10 show better customer retention. • 9. Credit Card: - Credit card holders have a slightly lower churn rate compared to non-holders. • 10. Clusters: - Cluster 0: Lowest churn rate (14.82%). Characteristics: Low average age, low average balance, mid-range tenure, highest average number of products, highest average of credit cards, mid-range active_member count, and mid-range salary. - Cluster 1 and Cluster 2 have similar churn rates, but Cluster 2 has a higher average balance and tenure.