Customer churn prediction in telecom using Python in Deepnote

This guide aims to walk you through the process of predicting customer churn in the telecom industry using Python. We will utilize Deepnote, a powerful collaborative data science platform, to build and deploy the model.

Customer churn refers to the loss of customers over a given period. Predicting churn is crucial for telecom companies as retaining customers is often more cost-effective than acquiring new ones. By using machine learning, we can analyze historical data to predict which customers are likely to churn, enabling targeted retention strategies.

Set up

Click on the "Create a new project" button and name your project, e.g., "Telecom Customer Churn Prediction".

Ensure your environment has the necessary packages installed, for this project, you'll need

!pip install pandas numpy matplotlib seaborn scikit-learn xgboost

Importing and understanding the dataset

We'll use a sample telecom customer dataset, which typically contains features like customer demographics, account information, services subscribed, and usage patterns.

import pandas as pd

# Load the dataset 
df = pd.read_csv("/work/yourdataset.csv")# Drag and drop your dataset into the notebook

# Display the first few rows of the dataset
df.head()

Data preprocessing

Handle missing values

# Check for missing values
df.isnull().sum()

# Optionally, fill or drop missing values
df.fillna(df.median(), inplace=True)

Convert categorical variables

# Convert categorical columns to numeric
df = pd.get_dummies(df, drop_first=True)

Feature scaling

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_features = scaler.fit_transform(df.drop('Churn', axis=1))  # Assuming 'Churn' is the target variable

df_scaled = pd.DataFrame(scaled_features, columns=df.columns[:-1])
df_scaled['Churn'] = df['Churn']

Exploratory data analysis (EDA)

Visualizing the churn distribution

import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x='Churn', data=df)
plt.title('Distribution of churn')
plt.show()

Correlation analysis

# Plotting correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, fmt=".2f")
plt.title('Correlation matrix')
plt.show()

Feature engineering

Feature engineering involves creating new features or transforming existing ones to improve model performance.

# Example: Creating a new feature for Total Services
df['TotalServices'] = df[['PhoneService', 'InternetService', 'StreamingTV', 'StreamingMovies']].sum(axis=1)

Model building

We'll explore different models to predict customer churn.

Logistic regression

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Split the data
X = df.drop('Churn', axis=1)
y = df['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

# Predict and evaluate
y_pred = lr_model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))

Decision trees

from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

y_pred = dt_model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Random forest

from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

y_pred = rf_model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Gradient boosting

from xgboost import XGBClassifier

xgb_model = XGBClassifier()
xgb_model.fit(X_train, y_train)

y_pred = xgb_model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Model evaluation

After training multiple models, compare their performance based on accuracy, precision, recall, and F1-score. This helps in selecting the best model for your use case.

Hyperparameter tuning

Use techniques like GridSearchCV or RandomizedSearchCV to optimize model parameters.

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5]
}

grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(f"Best Parameters: {grid_search.best_params_}")

Model deployment

Once the best model is selected and trained, you can deploy it to a production environment using various deployment tools. Deepnote can be integrated with APIs or platforms like AWS, Google Cloud, or Heroku for this purpose.

Conclusion and next steps

Predicting customer churn is a powerful tool in the telecom industry. By following this guide, you should now have a model capable of identifying at-risk customers. Future steps could include automating the process, integrating it with real-time data, or building a dashboard for business users.