How to use Custom Sklearn Classes and Pipelines
sklearn) is the machine learning tool of choice for exploratory analysis by data scientists. It has over 45k stars on GitHub and was downloaded over 7 million times in the last month (March 2021) Their
predict API is now ubiquitous in the python machine learning ecosystem with many other open source projects choosing to be compatible with that API.
In order to leverage the deeper features of the
sklearn platform, it is useful to build custom data transformation pipelines using the provided classes. In this blog post, we will focus on using Custom Transformers and Pipelines which are essential to delivering replicable results.
When dealing with real-world data, it is often difficult to convert data manipulations into repeatable steps: this is where custom transformers come into play. Additionally, you might then want to apply these custom transformations in sequence. This is where
Pipelines become useful. Putting these two tools together, you have a powerful playbook to handle all the messy data the real world throws at you.
sklearnAPI and concepts such as
predictand so on. If some of that is new to you, please check out this fantastic page in the `sklearn` docs.
In this blog post we'll be using the motivating example of the Prescription Based Prediction (PBP) dataset. The PBP dataset combines the frequencies of prescribed medication under Medicare part D with practitioner data from the National Provider Identifier dataset. The authors of this aggregate dataset had a corresponding blog post explaining the dataset in a bit more detail. It is no longer live, but it is still available on The Wayback Machine.
To understand this data a bit better, lets explode both of the
json variables in the first row. We can use the
pandas builtin function
json_normalize which converts a
json dictionary into a flat table.
Cool, let's now see what the distribution of those variables looks like.
It looks like the
cms_prescription_counts column contains a
json dictionary with the names of each drug followed by the count. The
provider_variables column seems to have a
json dictionary that contains information about the provider who prescribed the drugs.
It might be interesting to build a few models that predict some of these variables given the prescribed drugs and counts. In this blog post, we'll focus on estimating the
settlement_type and the
Let's set the feature and response variables to their appropriate column and convert them to native Python lists. Let's also split the data into a train and validation set.
sklearn API compatible transformers will allow us to make full use of other aspects of the library. In this example, we'll implement two custom transformers that leverage sklearn base classes to transform the response variable in the dataset.
First, let's build a way to quickly extract a desired response variable from
y. We start by defining a class that inherits from
TransformerMixin which gives us the
fit_transform method if we define the
Next up, we define the constructor which requires two variables that tell us the key to extract from each
json row and the dimensions of the data output. We also include an
is_fit_ attribute which lets
sklearn helper functions know that this transformer is already been "fitted" (i.e. it doesn't need to be fit)
Now, let's add the
fit method. It is empty because we already have all the information we need to transform input data from the constructor alone. We return
self (the object instance) as per convention.
@patchdecorator to add instance methods to the class so that we don't have to have all of the code in a single jupyter cell.
Now let's add the transform functionality. We break this up into the two methods. The
_extract_target method allows us to pull a particular key out of an input dictionary while also validating that the inputs are as we expect them. The
transform method does the meat of the data conversion. It turns a list of
json objects to a flat list of selected values.
Lastly, let's add the identity function as the inverse transform. There is no meaningful inverse since we lose data during the transform and future sections will require this function to have an implementation.
We can test that the above implementation works with some fake data. Given we selects
"c" as the desired key, we should see
[2 4 8] in the output which is exactly what we observe.
Phew, alright now that we've got that class completed we need to build a wrapper class that will allow us to actually use it in downstream applications.
Why, you may ask? In our case, we want a simple way to both transform the response variable and estimate on the response variable. Building a composite meta estimator is the way to go here considering the broader API constraints. To be honest, it's a bit of a quirk in the way the
sklearn API was originally designed. By that, I mean that the original API design does not really take into account the possibility of transforming the response variable (specifically using
Pipelines which we'll see in just a bit).
sklearn already has addressed that flaw with a built-in meta estimator called
TransformedTargetRegressor, but unfortunately, it only works for regression estimators. (e.g. applying a log transform to your response variable) In our case, we are looking at a binary classification problem, so we'll need to roll our own.
First, let's define a constructor which takes in an instance of a classifier and a transformer.
Now when fitting this meta estimator, we need to fit both the classifier and the transformer. But first, we need to quickly validate the input y array. Stealing a bit of code from
TransformTargetRegressor we can validate both the datatypes and shape of the inputs.
check_array which is a validation function from
sklearn which converts inputs into
np.ndarray form and validates a bunch of other things for us. We follow this with a bit of code to make sure that the dimensions of the response variable are correct. This is so that we can handle either the user passing a row vector or a column vector.
fit method in place, we turn our attention to
predict which is where the magic happens.
We use the
check_is_fitted method to make sure that the particular transformer is fit.
sklearn has an internal convention which checks for a variable to exist with a trailing underscore (
_) to know if an instance has been fit. Since we don't know the exact classifier or transformer, we cannot get any more specific than that. Unfortunately,
Pipelines do not set the conventional attributes at fit time so we have no way of knowing if it has been fitted.
Next, we run the predict step of the classifier and run an inverse transformation on its output. (running the fit steps in reverse) This was why we made sure to define the
inverse_transform method in our previous custom transformer. After that, we make sure to handle data dimension issues before returning the output.
Lastly, we have to add an updated
score method. We cannot simply just use the inherited scoring method since we need to transform the ground truth variable before scoring. We use accuracy just the same as the super class.
Finally, we're ready to see if this class works!
Let's generate some random data with 100 samples to test in a
Pipeline. We'll cover exactly how
Pipelines work in the next section so don't worry about the details here. We just want to make sure that we see strings instead of numbers as the output of a prediction, which, we do indeed see.
Pipelines allow you to run multiple operations on an input dataset in succession before an estimation is performed. It was originally designed to be a linear step-by-step transformation template but there are now additional tools in the
Pipeline toolkit that allow for horizontal joining of "columns". This is out of scope for this blog post but you can check out the docs for that here.
The main use case for this approach becomes apparent when we look to perform experiments on a particular machine learning approach. Pipelines make it straightforward to perform cross validation and grid searching across both the feature engineering and estimator hyperparameters. Pipelines make it very hard to accidentally get your data into an improper state which often happens when you have the same input variable being modified across multiple jupyter notebook cells.
In the next two cells we define a
Pipeline which is applied to our PBP dataset. Since the last step in the pipeline inherits from
BaseEstimtor the pipeline behaves exactly like an estimator. (enabling methods like
score) When we call fit, each of the steps fits the transformer and transforms the data before passing it to the next step.
The following are descriptions of each of the steps across both estimators
- Feature Transformation
DictVectorizer: Converts the input column of dictionary values into a sparse nxm matrix where n is the number of rows and m is the number of unique prescriptions. The value in each cell is the frequency with which the drug was prescribed by a practitioner.
TfidfTransformer: We can think of the above dictionary almost like a "document" with frequency counts of terms. Hence, we can use term frequency, inverse document frequency (TFIDF) to process those terms into a form that would be more meaningful for a classification model
- Response Transformation (
- Transformer (also a
JSONTargetSelector: We supply this step a chosen response variable
OrdinalEncoder: This step converts a list of strings into numbers incrementing one by one.
FunctionTransformer: This step is a convenience method that allows us to apply a lambda function to the entire dataset. We make sure to reshape the data to a column vector as is required by most
LogisticRegression: We choose to use logistic regression because it is the simplest classification model
- Transformer (also a
Here we also
settlement_pipe which duplicates its structure and parameters without keeping its state. We then swap out
JSONTargetSelector with an instance that has the right target for the second pipeline.
Let's do some quick evaluations to see what the output looks like. Carefully selecting some specific rows which do not have too many prescriptions, we find the following output. Looks like 100% accuracy on this cherry picked, dataset... time to call it a day?
Anyways, when we actually score the models, we find the following training accuracies. Not too bad for no hyperparameter tuning.
- One interesting improvement here could be to add an extra custom transformer before the
DictVectorizer. One that removes frequent and infrequent medications may improve performance. You should have all the tools to add this step to the pipeline.
- Another improvement to deepen your understanding of pipelines would be to try and see how you could use
GridSearchCVto find a more optimal model. It would be especially useful to see how tuning interplays with your custom transformer from the previous step
The best place to learn more about these advanced
sklearn features is actually their documentation. They have a user guide which is quite good, and their API is well documented as well. Additionally, I picked up most of the transformer tricks by reading the source code of the feature extraction transformers. If you have access to the PyCharm debugger, it is also really useful to step through the code as
Pipeline.fit runs since a lot of functionality is abstracted away.