Lead scoring

Nobody wants a bad lead. Lead scoring is about finding properties of leads (for example: what their job is, how they were sourced) that are correlated with them closing, so you have a better idea of what leads to focus on.

Simple lead scoring models are made by hand. Give someone 10 points if they're a product manager, another 10 points if they're on the mailing list, subtract 10 points if they're less than 1 year into their career—that sort of thing.

We can do better than that by creating a machine learning model to classify leads as likely to close or not. All we need is a dataset to train on. This notebook uses a sample dataset and XGBoost to create a model to predict leads as closeable or not.

import pandas as pd from sklearn.model_selection import train_test_split import sklearn.metrics import xgboost as xgb

The data

This data comes comes from this Lead Scoring dataset from Kaggle. It appears to contain lead information for an educational institution. It has a number of categorical (e.g. city, employment status) and quantitative features (time spent on website). Importantly, it has a converted column indicating whether that lead eventually led to a sale (1 for yes, 0 for no).

data = pd.read_csv("leads.csv") data[data.select_dtypes(["object"]).columns] = data.select_dtypes(["object"]).astype("category") data

We can already have a look at the data for potential insights. For example, there's a "tag" column. Below, we show the amount of conversions for each tag. These tags tells us a lot about when leads don't close. For example, "Interested in other courses" appears highly correlated with not converting.

Make a model

Given all this data, we can make a model. First, we need to split the data into training and test sets.

We need the test set after training, to see how the model performs with unseen data.

y_data = data["Converted"] X_data = data.drop(columns=["Converted", "Lead Number", "Prospect ID"]) X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.1, random_state=123)

We then train the model using XGBoost. We use their XGBClassifier, which does all of the work for us.

XGBoost is a popular Python library for making machine learning models.

clf = xgb.XGBClassifier(tree_method="hist", enable_categorical=True) clf.fit(X_train, y_train)

Make a prediction

Let's take two rows from the test dataset

X_test.iloc[[0, 1]]

And then predict whether they are likely to be converted or not.

clf.predict(X_test.iloc[[0, 1]])

The array([0, 1]) means that the first row is not likely to convert (so we shouldn't focus on it as a lead), but the second one is, so we should focus our attention on that one.

Model performance

No models are perfect, how good is this one? A simple way to demonstrate performance is using a confusion matrix, which shows how many leads the model predicts correctly.

In the confusion matrix below, the top left and bottom right are the true positives and true negatives, the leads that the model predicted correctly. The vast majority of leads were predicted correctly, so we didn't do too bad!

predictions = clf.predict(X_test) disp = sklearn.metrics.ConfusionMatrixDisplay.from_predictions( y_pred=predictions, y_true=y_test, cmap='Blues' )

Finally, can we find out how the model works? XGBoost gives us information of "feature importance". That is to say, how much each column affects the result of the model. This is useful for sales teams so they know the data they should be collecting for leads.

Below is a chart of all the feature importances. The top 3 features are tags, lead profile, and current occupation.

feature_importances = pd.DataFrame({ "Feature": X_train.columns, "Importance": clf.feature_importances_ })

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Lead scoring

The data

Make a model

Make a prediction

Model performance

Lead scoring