Problem Statement
In the rapidly evolving banking sector, understanding the diverse customer base is crucial for creating personalized services, improving customer satisfaction, and enhancing profitability. With over 1 million transactions by more than 800,000 customers, this project aims to develop a customer segmentation model that identifies distinct customer groups based on their demographics and transaction behaviors. These insights will enable the bank to tailor marketing strategies, optimize product offerings, and improve overall customer engagement.
Objective
The primary objective of this project is to segment the bank's customers into meaningful clusters using their demographic and transactional data. The outcomes will help the bank:
Identify high-value customer segments for targeted marketing. Understand regional trends and customer behavior. Enhance customer experience by tailoring services based on segment profiles. Optimize product offerings and improve customer retention strategies.
Dataset
This dataset consists of over 1 million transactions from more than 800,000 customers of an Indian bank, including customer demographics (age, location, gender) and transaction details (account balance, transaction amount, etc.).
Requirements
Pandas: For data manipulation and analysis. Matplotlib and Seaborn: For data visualization.
Scikit-learn: For machine learning tasks, including: StandardScaler: For standardizing features. PCA (Principal Component Analysis): For dimensionality reduction. KMeans: For clustering. Google Drive: For data storage and retrieval.
Achievements
The achievement of this project is the successful segmentation of the bank's customers into meaningful clusters based on their demographic and transactional data. This segmentation provides valuable insights into different customer groups, enabling the bank to tailor marketing strategies, optimize product offerings, and improve overall customer engagement. Key achievements include: Data Cleaning and Preprocessing: Successfully handled missing values, outliers, and performed feature engineering to create new relevant features. Dimensionality Reduction: Applied PCA to reduce the dimensionality of the data, making it easier to visualize and perform clustering. Customer Segmentation: Used K-Means clustering to segment customers into four distinct clusters: Low Value Customers, High Value Customers, Moderate Value Customers, and Emerging Customers. Cluster Analysis: Generated detailed summaries and visualizations of each customer segment, highlighting key characteristics such as average age, account balance, and transaction behavior. Marketing Strategies: Developed targeted marketing strategies for each customer segment to enhance engagement, retention, and growth. Business Opportunities: Identified key business opportunities for each cluster to drive customer value and improve financial outcomes. Data Export: Saved the segmented dataset and individual cluster for further analysis and use by the bank. These achievements provide a comprehensive understanding of the customer base, enabling the bank to make data-driven decisions to improve customer satisfaction and profitability.
Data Exploration
CONNECT TO GOOGLE DRIVE STORAGE
IMPORTING DATA
Explatory Data Analysis
DATA INFO
CHECKING MISSING VALUES
SUMMARY STATISTICS
DISTRIBUTION OF CATEGORICAL VARIABLES
DUPLICATES
There are no duplicate rows in the dataset. Next, let's check for outliers, the shape of the dataset, and the data types.
SHAPE
CHECKING DATA TYPE
CHECKING OUTLIERS
INSIGHTS FROM EDA
Here are the insights from the Exploratory Data Analysis (EDA): 1. Missing Values: CustomerDOB`: 3397 missing values. CustGender`: 1100 missing values. CustLocation`: 151 missing values. CustAccountBalance`: 2369 missing values. 2. Data Types: Most columns are of type `object`, except for `CustAccountBalance` and `TransactionAmount (INR)` which are `float64`, and `TransactionTime` which is `int64`. 3.Distribution of Categorical Variables: CustGender: The majority of customers are male. CustLocation: The top location is Mumbai, followed by other cities. 4.Distribution of Numerical Variables: CustAccountBalance: The distribution is highly skewed with a few customers having very high account balances. TransactionAmount (INR): The distribution is also skewed with a few transactions having very high amounts. 5. Outliers: Significant outliers are present in both `CustAccountBalance` and `TransactionAmount (INR)`. 6. Duplicates: There are no duplicate rows in the dataset. 7.Shape: The dataset contains 1,048,567 rows and 9 columns. These insights will guide the next steps in data cleaning and preprocessing before building the customer segmentation model.
DATA CLEANING
CHECKING FOR MISSING VALUEA IN CLEANED DATASET
All missing values have been successfully handled. The dataset is now clean with no missing values.
REMOVING OUTLIERS
The outliers in 'CustAccountBalance' and 'TransactionAmount (INR)' have been successfully removed. The dataset is now ready for further analysis and modeling.
ONE HOT ENCODING
FEATURE ENGINEERING
CREATE NEW FEATURES
CREATING NEW FEATURES
The new features `Age`, `DaysSinceLastTransaction`, and `TransactionHour` have been successfully created and added to the dataset.
EXPLAINING THE NEW FEATURES SND IMPORTANCE
1. Age: This feature represents the age of the customer, calculated from their date of birth (CustomerDOB). Age is an important demographic variable that can influence customer behavior and preferences. For example, younger customers might prefer digital banking services, while older customers might value personalized customer service. 2. DaysSinceLastTransaction: This feature represents the number of days since the customer's last transaction, calculated from the TransactionDate. This feature is crucial for understanding customer engagement and activity levels. Customers with a high number of days since their last transaction might be at risk of churn. 3. TransactionHour: This feature represents the hour of the day when the transaction occurred, extracted from the TransactionTime. This feature can help identify peak transaction times and customer behavior patterns throughout the day. 4. CLV (Customer Lifetime Value): This feature represents the total value of transactions made by the customer, calculated as the sum of TransactionAmount (INR) for each customer. CLV is a key metric for identifying high-value customers who contribute significantly to the bank's revenue. 5. Recency: This feature represents the number of days since the customer's most recent transaction. It is part of the RFM (Recency, Frequency, Monetary) metrics and helps identify how recently a customer has engaged with the bank. 6. Frequency: This feature represents the total number of transactions made by the customer. It is another component of the RFM metrics and helps identify how often a customer engages with the bank. 7. Monetary: This feature represents the total monetary value of transactions made by the customer. It is the final component of the RFM metrics and helps identify the overall spending behavior of the customer. The RFM metrics (Recency, Frequency, Monetary) are widely used in customer segmentation to identify different customer segments based on their transaction behavior. These metrics help in understanding customer loyalty, engagement, and value, which are essential for targeted marketing and personalized services.
STARDIZING OR NORMALIZING FEATURES
The features have been successfully standardized. The dataset is now ready for further analysis and modeling.
PCA TO REDUCE DIMENSIONALITY OF DATA
The dimensionality of the data has been successfully reduced using PCA. The resulting DataFrame contains two principal components, `PCA1` and `PCA2`.
The dimensionality of the data has been successfully reduced using PCA (Principal Component Analysis). The resulting DataFrame contains two principal components, `PCA1` and `PCA2`. These components capture the most significant variance in the data, making it easier to visualize and perform clustering. Next, we can proceed with clustering the data using these principal components.
Customer Segmentation Using K-Means Clustering
DERTMINING OPTIMAL NUMBERS OF CLUSTERS WITH ELBOW METHOD
The Elbow method is used to determine the optimal number of clusters for K-Means clustering. It involves plotting the Within-Cluster Sum of Square (WCSS) against the number of clusters. The WCSS measures the sum of squared distances between each point and the centroid of its cluster. In the plot, you look for an "elbow point" where the WCSS starts to decrease more slowly. This point indicates the optimal number of clusters, balancing between minimizing WCSS and avoiding overfitting.
APPYING K-MEANS CLUSTERING WITH 4 CLUSTERS
The K-Means clustering has been applied to the PCA-transformed data, and the cluster labels have been added to the DataFrame. The dataset now includes a `Cluster` column indicating the cluster assignment for each data point. Next, let's visualize the clusters.
VISUALIZING CLUSTER
The scatter plot visualizes the customer segmentation using K-Means clustering. Each point represents a customer, and the colors indicate the different clusters identified by the K-Means algorithm. The x-axis and y-axis represent the two principal components (PCA1 and PCA2) obtained from the PCA transformation, which capture the most significant variance in the data. The clusters are well-separated, indicating that the K-Means algorithm has successfully grouped similar customers together based on their demographic and transactional data.
ADDING CLUSTERS TO DATAFRAME
DEMORMALIZING DATA
SUMMAR STATISTICS OF CLUSTERS
Here is the detailed summary of each customer segment identified through K-Means clustering: Cluster 0: - Average Gender: 0.76 (approximately 76% male) - Average Account Balance: 18,740.24 INR - Average Transaction Amount: 209.52 INR - Average Age: 36.27 years - Average Days Since Last Transaction: 2900.99 days - Average Transaction Hour: 14.96 (approximately 3 PM) - Average Customer Lifetime Value (CLV): 232.73 INR - Average Recency: 2896.86 days - Average Frequency: 1.13 transactions - Average Monetary: 232.73 INR Cluster 1: - Average Gender: 0.70 (approximately 70% male) - Average Account Balance: 32,090.49 INR - Average Transaction Amount: 1466.18 INR - Average Age: 43.32 years - Average Days Since Last Transaction: 2947.97 days - Average Transaction Hour: 16.06 (approximately 4 PM) - Average Customer Lifetime Value (CLV): 1995.74 INR - Average Recency: 2922.29 days - Average Frequency: 1.63 transactions - Average Monetary: 1995.74 INR Cluster 2: - Average Gender: 0.72 (approximately 72% male) - Average Account Balance: 27,788.10 INR - Average Transaction Amount: 712.11 INR - Average Age: 40.31 years - Average Days Since Last Transaction: 2916.50 days - Average Transaction Hour: 15.83 (approximately 4 PM) - Average Customer Lifetime Value (CLV): 960.24 INR - Average Recency: 2897.64 days - Average Frequency: 1.51 transactions - Average Monetary: 960.24 INR Cluster 3: - Average Gender: 0.75 (approximately 75% male) - Average Account Balance: 20,894.71 INR - Average Transaction Amount: 352.26 INR - Average Age: 37.40 years - Average Days Since Last Transaction: 3070.48 days - Average Transaction Hour: 15.29 (approximately 3 PM) - Average Customer Lifetime Value (CLV): 369.47 INR - Average Recency: 3063.79 days - Average Frequency: 1.06 transactions - Average Monetary: 369.47 INR These summaries highlight the key characteristics of each customer segment, providing insights into their demographics and transaction behaviors.
DISTRIBUTION OF EACH CUSTOMER SEGMENT USING KEY DEMOGRAPHIC AND TRANSACTIONAL FEATURES.
Average Age By Customer Segment
Average Account Balance By Segment
Average Transaction Amount By Segment
Average Customer Life Time Value
explain
The bar charts visualize the distribution of key demographic and transactional features across the different customer segments identified through K-Means clustering: 1. Average Age by Customer Segment: This chart shows the average age of customers in each cluster. It helps to understand the age distribution across different segments. 2. Average Account Balance by Customer Segment: This chart displays the average account balance for customers in each cluster. It provides insights into the financial status of different customer segments. 3. Average Transaction Amount by Customer Segment: This chart illustrates the average transaction amount for customers in each cluster. It helps to identify the spending behavior of different segments. 4. Average Customer Lifetime Value (CLV) by Customer Segment: This chart shows the average CLV for customers in each cluster. CLV is a key metric for understanding the long-term value of customers to the bank. These visualizations help to highlight the key characteristics of each customer segment, providing valuable insights for targeted marketing and personalized services.
PIE CHARTS TO REPRESENT THE PROPORTION OF EACH SEGMENT
The pie chart visualizes the distribution of customer segments identified through K-Means clustering. Each slice of the pie represents a different customer segment, and the size of each slice indicates the proportion of customers in that segment relative to the entire dataset. This visualization helps to understand the relative size of each customer segment, providing insights into the composition of the customer base.
CLUSTERS AND THEIR RESPECTIVE CHARACTERISTIC
The clusters have been categorized as follows: Cluster 0: Low Value Customers - Description: Customers with low account balance, low transaction amount, and low CLV. They are relatively younger and have fewer transactions. - Average Age: 36.27 - Average Account Balance: 18,740.24 INR - Average Transaction Amount: 209.52 INR - Average CLV: 232.73 INR - Average Frequency: 1.13 Cluster 1: High Value Customers - Description: Customers with high account balance, high transaction amount, and high CLV. They are older and have more frequent transactions. - Average Age: 43.32 - Average Account Balance: 32,090.49 INR - Average Transaction Amount: 1,466.18 INR - Average CLV: 1,995.74 INR - Average Frequency: 1.63 Cluster 2: Moderate Value Customers - Description: Customers with moderate account balance, moderate transaction amount, and moderate CLV. They are middle-aged and have a moderate number of transactions. - Average Age: 40.31 - Average Account Balance: 27,788.10 INR - Average Transaction Amount: 712.11 INR - Average CLV: 960.24 INR - Average Frequency: 1.51 Cluster 3: Emerging Customers - Description: Customers with emerging potential, moderate account balance, and moderate transaction amount. They are relatively younger and have fewer transactions but show potential for growth. - Average Age: 37.40 - Average Account Balance: 20,894.71 INR - Average Transaction Amount: 352.26 INR - Average CLV: 369.47 INR - Average Frequency: 1.06
Marketing Strategies
Here are the marketing strategies for each of the clusters: Cluster 0: Low Value Customers Focus on engagement and retention. Offer personalized services and incentives to increase transaction frequency and account balance. Provide educational content on financial management. Cluster 1: High Value Customers Provide premium services and exclusive offers. Focus on maintaining and enhancing customer satisfaction. Offer loyalty programs and personalized financial advice to retain these high-value customers. Cluster 2:Moderate Value Customers Encourage higher engagement and spending. Offer targeted promotions and discounts. Provide personalized recommendations and financial products to meet their needs. Cluster 3: Emerging Customers Identify growth potential and nurture these customers. Offer introductory offers and incentives to increase engagement. Provide educational resources and support to help them grow their financial portfolio.
Saving Clusters
The clusters have been separated into different files and saved to the storage.
Key Business Opportunities
Here are the key business opportunities for each cluster: Cluster 0: Low Value Customers - Develop financial literacy programs to educate customers on managing their finances. - Introduce low-cost financial products and services to increase engagement. - Implement loyalty programs to incentivize frequent transactions. Cluster 1: High Value Customers - Offer exclusive premium services and personalized financial advice. - Develop loyalty programs with high-value rewards to retain these customers. - Provide early access to new financial products and investment opportunities. Cluster 2: Moderate Value Customers - Introduce targeted promotions and discounts to encourage higher spending. - Offer personalized financial products and services to meet their needs. - Implement referral programs to leverage their moderate engagement for new customer acquisition. Cluster 3: Emerging Customers - Identify and nurture potential high-value customers through personalized support. - Offer introductory offers and incentives to increase engagement. - Provide educational resources and financial planning tools to help them grow their financial portfolio.
Follow-Up Analyses
Here are potential follow-up analyses or additional data that could enhance the customer segmentation insights: Behavioral Analysis: Analyze customer behavior patterns such as transaction frequency over time, preferred transaction channels (e.g., online, in-branch), and types of transactions (e.g., deposits, withdrawals, transfers). This can provide deeper insights into customer preferences and habits. Customer Feedback and Satisfaction: Incorporate customer feedback and satisfaction scores to understand the qualitative aspects of customer experience. Analyzing this data can help identify pain points and areas for improvement. Product Usage Analysis: Examine the usage patterns of different financial products and services (e.g., loans, credit cards, savings accounts) among different customer segments. This can help identify cross-selling and upselling opportunities. Churn Prediction: Develop a churn prediction model to identify customers at risk of leaving the bank. This can help in implementing proactive retention strategies for vulnerable segments. Income and Employment Data: Incorporate additional demographic data such as income levels, employment status, and occupation. This can provide a more comprehensive understanding of customers' financial situations and needs. Geospatial Analysis: Perform geospatial analysis to identify regional trends and opportunities. This can help in tailoring marketing strategies and product offerings based on geographic location. Social Media and Digital Footprint: Analyze customers' social media activity and digital footprint to gain insights into their interests, preferences, and online behavior. This can enhance personalized marketing efforts. Life Stage Segmentation: Segment customers based on their life stages (e.g., students, young professionals, families, retirees). This can help in offering relevant financial products and services tailored to their specific needs. Credit Score Analysis: Incorporate credit score data to assess customers' creditworthiness and risk profiles. This can aid in making informed lending decisions and offering appropriate credit products. Time Series Analysis: Conduct time series analysis to identify trends and seasonality in customer transactions. This can help in forecasting future behavior and planning marketing campaigns accordingly.
Summary
In this project, we aimed to segment the bank's customers into meaningful clusters using their demographic and transactional data. The key steps and achievements of the project are as follows: Data Cleaning and Preprocessing: - Handled missing values and outliers in the dataset. - Converted relevant columns to appropriate data types. - Created new features such as Age, DaysSinceLastTransaction, TransactionHour, and Customer Lifetime Value (CLV). Dimensionality Reduction: - Applied Principal Component Analysis (PCA) to reduce the dimensionality of the data, making it easier to visualize and perform clustering. Customer Segmentation: - Used K-Means clustering to segment customers into four distinct clusters: Low Value Customers, High Value Customers, Moderate Value Customers, and Emerging Customers. - Generated detailed summaries and visualizations of each customer segment, highlighting key characteristics such as average age, account balance, and transaction behavior. Marketing Strategies: - Developed targeted marketing strategies for each customer segment to enhance engagement, retention, and growth. Business Opportunities: - Identified key business opportunities for each cluster to drive customer value and improve financial outcomes. Data Export: - Saved the segmented dataset and individual cluster files for further analysis and use by the bank.
Conclusion
The customer segmentation model successfully identified distinct customer groups based on their demographics and transaction behaviors. These insights enable the bank to tailor marketing strategies, optimize product offerings, and improve overall customer engagement. By understanding the unique characteristics and needs of each customer segment, the bank can implement targeted initiatives to enhance customer satisfaction and profitability. The project also identified potential follow-up analyses and additional data that could further enhance the customer segmentation insights, providing a roadmap for future improvements.