This guide aims to walk you through the process of predicting customer churn in the telecom industry using Python. We will utilize Deepnote, a powerful collaborative data science platform, to build and deploy the model.
Customer churn refers to the loss of customers over a given period. Predicting churn is crucial for telecom companies as retaining customers is often more cost-effective than acquiring new ones. By using machine learning, we can analyze historical data to predict which customers are likely to churn, enabling targeted retention strategies.
Set up
Sign in to Deepnote or create an account if you don't have one.
Click on the "Create a new project" button and name your project, e.g., "Telecom Customer Churn Prediction".
Ensure your environment has the necessary packages installed, for this project, you'll need
!pip install pandas numpy matplotlib seaborn scikit-learn xgboost
Importing and understanding the dataset
We'll use a sample telecom customer dataset, which typically contains features like customer demographics, account information, services subscribed, and usage patterns.
import pandas as pd
# Load the dataset
df = pd.read_csv("/work/yourdataset.csv")# Drag and drop your dataset into the notebook
# Display the first few rows of the dataset
df.head()
Data preprocessing
Handle missing values
# Check for missing values
df.isnull().sum()
# Optionally, fill or drop missing values
df.fillna(df.median(), inplace=True)
Convert categorical variables
# Convert categorical columns to numeric
df = pd.get_dummies(df, drop_first=True)
Feature scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df.drop('Churn', axis=1)) # Assuming 'Churn' is the target variable
df_scaled = pd.DataFrame(scaled_features, columns=df.columns[:-1])
df_scaled['Churn'] = df['Churn']
Exploratory data analysis (EDA)
Visualizing the churn distribution
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x='Churn', data=df)
plt.title('Distribution of churn')
plt.show()
Correlation analysis
# Plotting correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, fmt=".2f")
plt.title('Correlation matrix')
plt.show()
Feature engineering
Feature engineering involves creating new features or transforming existing ones to improve model performance.
# Example: Creating a new feature for Total Services
df['TotalServices'] = df[['PhoneService', 'InternetService', 'StreamingTV', 'StreamingMovies']].sum(axis=1)
Model building
We'll explore different models to predict customer churn.
Logistic regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
# Split the data
X = df.drop('Churn', axis=1)
y = df['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train the model
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
# Predict and evaluate
y_pred = lr_model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))
Decision trees
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
y_pred = dt_model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
Random forest
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
Gradient boosting
from xgboost import XGBClassifier
xgb_model = XGBClassifier()
xgb_model.fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
Model evaluation
After training multiple models, compare their performance based on accuracy, precision, recall, and F1-score. This helps in selecting the best model for your use case.
Hyperparameter tuning
Use techniques like GridSearchCV or RandomizedSearchCV to optimize model parameters.
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [3, 4, 5]
}
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
print(f"Best Parameters: {grid_search.best_params_}")
Model deployment
Once the best model is selected and trained, you can deploy it to a production environment using various deployment tools. Deepnote can be integrated with APIs or platforms like AWS, Google Cloud, or Heroku for this purpose.
Conclusion and next steps
Predicting customer churn is a powerful tool in the telecom industry. By following this guide, you should now have a model capable of identifying at-risk customers. Future steps could include automating the process, integrating it with real-time data, or building a dashboard for business users.