Stripe ML Example
In this notebook, we will be taking a look at the workflow for creating a model to predict whether or not a customer is likely to default based on sample transaction data.
Read in the data and display it:
We are using sample data created just for this example. This data only has 20 rows, which is extremely small. For a problem like this in the real world, we would probaly be working with thousands of records.
As can be seen here, we have 3 features available to work with:
- Email Domain
- Country (presumably country of the ip address used)
- Credit Card Company
Our goal is to build a model that can predict whether or not a customer will default based off of these 3 features.
Definte the features and the target
Here we are separating out our features (the variables used to predict) and the target (the variable we would like predicted)
Converting the text categorical data to dummy variables
Machine learning algortihms can only work with numerical data. In our case, the data is all text based. The simple solution to this problem is to create binary "dummy" features for each unique value. So if the original column had 3 possible options, we would create 3 new columns that can be either 1 or 0.
Create a decision tree model
Here we create a new instance of a Decision Tree model and train it on our data.
View the model accuracy
Now that we have trained the model, lets make predictions and see how well those predictions compare to the actual data.
Wow! Our model was 100% accurate! That seems almost too good to be true... Lets take a look at what the decision tree produced by this model looks like.
Exporting a graph of the tree
Is this tree overfit?
100% accurate models are virtually impossible in the real world.
It seems suspiciously specific to say that a customer from India that is using a non-MasterCard brand is at danger of defaulting. More than likely such a specific prediction is just a result of our small training size. This could be considered "noise". Creating a simpler tree that isn't quite as sensitive to this noise in the data could help to solve this issue.
This second model only has 95% accuracy, however lets look at the tree graph...
Looking at the graph for the second model that has been altered, we see a model that has slightly lower accuracy on the existing data, but which we can expect to predict better in the future on out-of-sample data.
What are the most important features?
One of the last things we can do is take a look at the importances of different features. We can see which features contributed the most to creating the decision tree, and which were not that helpful in distinguishing customers who default from customers who did not.
Here we can see that in this simple model, the most important feature was whether or not the card was an American Express. Just below that feature in importance is whether or not the customer is from Russia. Finally, whether or not the customer is from India plays a small role.
Predicting a new instance
Finally we are ready to predict whether or not a new customer is likely to default or not. Let's create a sample customer and pass it to our model.
It looks like we predict that this customer will default and is therefore high-risk. If we look back at the graph of our second model's decision tree and the feature importances, we can clearly see that it is the fact that the customer is using an American Express card that is causing the model to predict that the customer will default.