Telecom- Customer Churn Prediction
Churn rate, also referred to as attrition rate, measures the number of individuals or units leaving a group over a specified time. The term is used in many contexts, including in business, human resources, and IT. Most notably, churn rate is referred to as the proportion of contractual (or subscribed) customers who terminate their contractual relationships/subscriptions with a company in each timeframe. In this context, the term is primarily associated with companies operating on a subscription basis. We must predict future churn rates, because it will help the business to gain a better understanding of future expected revenue. In addition, when we can use churn prediction to forecast the potential churn rate of a particular customer, it allows us to target that individual to prevent them from discontinuing their subscription with the company. And, since the cost of acquiring a new customer as per research is much higher, 5 to 6 times more to acquire new customers than keeping an existing one, there’s plenty of revenue-based reason to do everything in our power to keep those existing customers.
The DTH company has collected data for the purchases of the customer for various account segments. The data has information about the Tenure of the selected plan, the city tier where the plan was selected, the payment method used, gender demographic, marital status of the customer, revenue generated per month, type of login device used for the account and other factors
1. Initially below dataset had 11260 Rows and 19 columns, we have removed Account ID column since it is not of much use. 2. The data has 5Float variables, 1 Integer variable and 12 object variables. 3. We have renamed values for columns which have naming inconsistency- renamed the values for Gender and Account segment variable. 4. We have kept same naming for Male and Female data, earlier we had Male, Female, F and M values for this column. So converted F and M to Female and Male respectively. 5. We have renamed values for Account segment variable, to deal with naming inconsistencies. Regular + to Regular_Plus and Super + to Super_Plus 6. There are duplicate records in the dataset, around 259, we have removed them. 7. There are also a lot of special characters in every column as well as missing values around 3616, we have replaced special characters with null value, and to compute null value we will see the distribution of numerical variables first and decide on appropriate method for imputation of null values
1) 0 means a customer is retained, 1 means customer has churned 2) The DTH company kept 83% of its users. Since the data is skewed, the number of instances in the 'Retained' class outnumbers the number of instances in the 'Churned' class by a lot. 3) But since Industry rate of churning is 14 to 16% for DTH companies, this distribution does not need oversampling. If needed, we can decide it later by building models and analysing their performance metrics
Tenure: Tenure variable does not seem to have significant effect on churn rate, average Tenure is 11 years. Payment method: Most preferred payment method is Debit card since the number of Debit card users is highest around 3800 and a greater number of users that have churned were using Debit card. Least preferred method is UPI, having around 520 users out of which 200 have churned. This number is more than Debit card users that have churned. So, customers using UPI are more highly likely to churn. Login device: Mobile users are the highest number of DTH users around 6000 from which 1000 have churned. Similar can be said for users using Computer. Using a particular type of device does not seem to affect churn rate that much. Gender: Male users are highest around 5000 and up, and that have churned are around 1000, we can say that male users are highest users that have churned. But this variable does not have a significant impact on churn rate, since number of male users continuing the service is higher. Complain_ly: Customers that have contacted the customer care highest number of times are more likely to churn CC_Agent_score: Customers that have given low rating to the Agents are most likely to churn since they might be dissatisfied with the service of the agent, thus resulting to churn from the dth service. Marital_status: There are a greater number of users that are married around 5000 and up followed by Single users around 2500. The number of Single users that have churned is highest around 1000. So, we can say that Marital_status is affecting churn rate. City_Tier: Users from city tier 1 are highest around 6000 and around 1000 have churned from tier 1 city. City tier does not seem to have greater effect on churn rate. Account_user_count: Highest number of users tagged to an account is 4, and they are around 3500 and above out of which 550 customers have churned. There is no significant intercorrelation between our features, so we do not have to worry about multicollinearity.
Before applying machine learning models on the dataset, we split it into Train and Test data in 70:30 ratio. We have scaled the data using z score, since the range for some numerical attributes are quite high and some machine learning algorithms like KNN are biased towards variables with high magnitude. Initially we have chosen not to oversample the target variable since percentage of churn in this dataset is 0.16. And the standard Industry rate of churn for DTH companies is 14 to 16%. Therefore, we first build models with the data as it is.
We will focus on reducing possibility of False negative. So primary criteria for evaluation will be Recall first and then Accuracy Using single metrics is not the only way of comparing the predictive performance of classification models. The ROC curve (Receiver Operating Characteristic curve) is a graph showing the performance of a classifier at different classification thresholds. It plots the true positive rate (another name for recall) against the false positive rate.
After oversampling the data, the scores of all the models have increased a lot. KNN, Random Forest and Bagging have best overall scores from all the models. But KNN has overfit on Recall on Train data as well as Test data Bagging has overfit on Train data on Recall as well as Accuracy Random Forest shows better scores on Recall as well as Accuracy for both Train and Test set, so we can choose best model as Random Forest as it has best overall scores.
Tenure, Account_segment, Days_since_cc_connect, Cashback, rev_growth_yoy, Login_device, Payment method, service score are negative predictors of churn. These are the attributes that prevent customer from churning. Tenure variable does not seem to have significant effect on churn rate, average Tenure is 11 years. So, it is obvious that a customer who has stayed with DTH service for more year’s greater than 11 years, is less likely to churn than customer who has a less duration service. Days since cc connect represents the least no of days after which customer has contacted customer care, average days being 4. This is also negative predictor for churn because if customer has not called customer care in a smaller number of days means he is satisfied with the service and is less likely to churn from DTH provider. Customer receiving more cashback, average 196 is less likely to churn. Customers belonging to higher Account segment, ‘Regukar plus’ and ‘Super plus’ are spending more and are more involved with services provided by DTH company, and hence less likely to churn. Attributes that are positive predictors for churn: Complaint_ly, CC_Agent_score, Rev_per_month, Marital_status, city_tier, account user count, coupon used for payment, cc contacted ly are the variables that are giving rise to customer churn. Complain_ly: Customers that have contacted the customer care highest number of times are more likely to churn CC_Agent_score: Customers that have given low rating to the Agents are most likely to churn since they might be dissatisfied with the service of the agent, thus resulting to churn from the dth service. Marital_status: There are a greater number of users that are married around 5000 and up followed by Single users around 2500. The number of Single users that have churned is highest around 1000. So, we can say that Marital_status is affecting churn rate. City_Tier: Users from city tier 1 are highest around 6000 and around 1000 have churned from tier 1 city. City tier does not seem to have greater effect on churn rate. Account_user_count: Highest number of users tagged to an account is 4, and they are around 3500 and above out of which 550 customers have churned.