31 Jul 2017

Machine learning pipelines with Scikit-Learn

Pipelines in Scikit-Learn are far from being a new feature, but until recently I have never really used them in my day-to-day usage of the package. I found the documentation was sparse, and mainly consisted of contrived examples rather than covering practical use cases.

I recently started reading Aurelien Geron's Hands on Machine Learning with Scikit-Learn and TensorFlow, and finally I encountered some simple examples that actually brought the concept of data pipelines to life.

I have a notebook available here that accompanies this post- hopefully it will bring everything to life with a worked example.

Why pipelines?

If you come from an R background using the caret package, you may find Scikit-Learn a little frustrating to use. The caret package incorporates many of the patterns you encounter in supervised machine learning into its caret::train class, making the process feel almost too easy.

Scikit-Learn, however, requires you to dig around creating and fitting objects for your preprocessing, and initially I found it very difficult to maintain a neat workflow of preprocessing and model training and evaluation.

There are two powerful things that pipelines can do which convinced me of their use (there are more than two advantages they bring, but these are the two that convinced me!):

You can treat preprocessing steps as hyperparameters. As a simple example, perhaps you can't decide if you should scale your features before training your learning algorithm. With a pipeline, you can in principle treat this like any other hyperparamenter, and by using Scikit-Learns's GridSearchCV class decide whether you should perform this processing or not. Using caret, you cannot easily treat the preprocesing like a hyperparameter, and would have to build multiple models and compare them manually.
Once you have fitted your pipeline, it is just a case of using it's transform or predict method to run your test set through, either for processing data or making predictions from an estimator, respectfully. This is much cleaner than, say, spending an evening developing your model using your training set, then having to work backwards to recreate the same data preperation steps for your test data set.

No free lunch

Unfortuneately, I encountered a bit of a learning barrier to get up and running with pipelines. The documentation only shows examples, opposed to practical use cases. To learn, you are pretty much at the mercy of blogs like this one :)

There are also some difficulties- some of these are the reasons why I avoided pipelines for so long.

Not all of Scikit-Learn's built-in preprocessing transformers are 'toggleable'. That is, if you want to try switching the preprocessing step on or off as you decide the best parameters of the pipeline, you may have to build your own wrapper around the class to give this functionality.
You will have to learn a few simple patterns to build your own custom transformers- chances are you will want to go beyond Scikit-Learn's suite of built in transformers in sklearn.preprocess
Scikit-Learn runs on numpy arrays, and data scientists tend to think in terms of data frames. Keeping track of feature names as you pass them through a pipeline can be messy- especially if you have multiple steps where you may generate new features and remove others.

I will be demonstrating how I overcame these issues in this post, and hopefully give you some inspiration to get you up and running! But first, lets have a quick overview of how to use the Pipeline class and its best friend, FeatureUnion.

Pipeline and FeatureUnion

The two main buiding blocks are Pipeline and FeatureUnion classes, which you can import from sklearn.pipeline

Transformers have to have a fit and transform method, and normally a fit_transform method that does both the steps in one. They may have other useful methods or attributes, but it is fit and transform methods that are required as data is sent through the pipeline.

The idea would be you fit on and then transform your training data, and then just transform the test data, using the parameters learned from training. For example, if you perform a PCA transformation, you would learn the loadings on the training set (fit) which you would then apply to the test set (transform).

If your pipeline is for machine learning, the final step will be an estimator, rather than a transformer. For example, your final step could be a random forest model that ingests the data that is prepared in the previous steps, which will output predictions. If your pipeline has a final estimator, there will be a predict method, and the other associated behaviour of the model will be accessible through the attributes of the pipeline.

Pipeline objects allow you to build up transformers to be executed sequentially. FeatureUnion objects allow you to run transformations in parallel. For example, if you had preprocessing or transformation steps specific to categorical data, you would want to run these in parallel to your numerical features, as they may not be appropriate steps for categorical data.

Now, Pipeline and FeatureUnion objects can be nested within each other, allowing the flexibility to build up fairly intricate pipelines. You may start to build up complexity (which is normally a bad thing), but the trade off is that the resulting pipeline will be a self-contained object, ready to run new data through, or pass to GridSearchCV for hyperparameter tuning.

Stages of the pipeline can be accessed as key-value pairs via the get_params method. I really encourage you to play around- reading blogs is no substitute for practice!

Custom transformers

I found that Scikit-Learn contains many useful transformers, but chances are you will want to implement some custom behaviours. For example, I wanted a transformer that will remove features that have zero variance (i.e. the same value for all training instances), and have a 'switch' to optionally remove low variance features. Some learning algorithms will be adversely effected by low variance features, whereas others may be more resiliant.

We can inherit from sklearn.base.BaseEstimator and sklearn.base.TransformerMixin to borrow useful behaviour. Now, the minimum we have to define is a fit() and a transform() method.

Below is an example for a transformer to remove zero variance predictors (you can see the complete code base here) The class gives analogous capabilities to the R caret::zeroVariance function. The methods do the following:

__init__: intialises the object. We can specify if we want to remove low variance as well as zero variance features. For low variance, we specify thresholds for filtering.
fit: tests columns of the feature matrix for zero or low variance. the attributes zero_var and near_zero_var hold the indices of the columns we will filter when we transform.
transform: filter those columns tagged for removal.
get_feature_names: as I have mentioned, it can be a pain to keep track of what the features in the numpy array are. This method takes an array of the initial feature names as an argument, and returns a filtered array of names corresponding to the remaining features in the correct order. You could use this to move from a numpy array back to a pandas dataframe.

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

class ZeroVariance(BaseEstimator, TransformerMixin):
    '''
    Transformer to identify zero variance and optionally low variance features
    for removal
    This works similarly to the R caret::nearZeroVariance function
    '''
    def __init__(self, near_zero=False, freq_cut=95/5, unique_cut=10):
        '''
        near zero: boolean.
                    False: remove only zero var.
                    True: remove near zero as well
        freq_cut: cutoff frequency ratio of most frequent to
                  second most frequent
        unique_cut: unique cut: cutoff for percentage unique values
        '''
        self.near_zero = near_zero
        self.freq_cut = freq_cut
        self.unique_cut = unique_cut

    def fit(self, X, y=None):
        self.zero_var = np.zeros(X.shape[1], dtype=bool)
        self.near_zero_var = np.zeros(X.shape[1], dtype=bool)
        n_obs = X.shape[0]

        for i, col in enumerate(X.T):
            # obtain values, counts of values and sort counts from
            # most to least frequent
            val_counts = np.unique(col, return_counts= True)
            counts = val_counts[1]
            counts_len = counts.shape[0]
            counts_sort = np.sort(counts)[::-1]

            # if only one value, is ZV
            if counts_len == 1:
                self.zero_var[i] = True
                self.near_zero_var[i] = True
                continue

            # ratio of most frequent / second most frequent
            freq_ratio = counts_sort[0] / counts_sort[1]
            # percent unique values
            unique_pct = (counts_len / n_obs) * 100

            print(i, freq_ratio, unique_pct)

            if (unique_pct < self.unique_cut) and (freq_ratio > self.freq_cut):
                 self.near_zero_var[i] = True

        return self

    def transform(self, X, y=None):
        if self.near_zero:
            return X.T[~self.near_zero_var].T
        else:
            return X.T[~self.zero_var].T

    def get_feature_names(self, input_features=None):
        if self.near_zero:
            return input_features[~self.near_zero_var]
        else:
            return input_features[~self.zero_var]

One gotcha is to ensure fit returns self: you get a fit_transform method for free if you inherit from BaseEstimator and TransformerMixin, but it will fail if fit doesnt return self. This is as, unless you override, fit_transform just chains the two methods together.

I have defined several transformers that I hope will be of use: feel free to dig around the code in the example repo, and use if they are helpful for your work. They are:

FindCorrelation: inspired by a helper function from R caret package. The user sets a threshold, and the fit method will record column indices were teh pairwise correlation between features exceeds this threshold. The transform method then filters according to these indices.
ZeroVariance: inspired by a helper function from the R caret package. The user specifices to optionally filter low variance features as well as zero variance features, and can specify the criteria for defining 'low' variance. The fit method will record column indices were teh pairwise correlation between features exceeds this threshold. The transform method then filters according to these indices.
DataFrameExtractor: taken from Hands on Machine Learning with Scikit-Learn and TensorFlow- this allows us to specify the columns from a pandas dataframe that we wish to run through the pipeline.
ManualDropper- this allows the user to specify column indices to drop. This could be useful, for example, where one-hot encoding has been perfomed upon a categorical feature. If you are worried about multicolinearity affecting the model, you will want to drop one of the columns following the encoding.
OptionalStandardScaler: this is essentially a wrapper around sklearn.preprocessing.StandardScaler, simply adding the functionality to toggle the scaling on or off. This is so that we could consider this scaling to be a hyperparameter during model selection.
PipelineChecker: At present, this is very simple, essentially just checks that the same number of features are present in the training data set as any other data set that is passed through the pipeline. I plan to extend this in future to warn the user when (previously unseen) extreme data points are observed- this could would imply that any predictions from a learning algorithm would be extrapolating away from the space where the model was trained.

You can check out the example to see the final pipeline I created for my churn model.

Tying it all together with a final estimator and GridSearchCV

With a data processing pipeline ready, we can use it as a machine learning pipeline by appending an estimator (i.e. learning algorithm) on the end.

It is possible to include several estimators in a single pipeline, and allow the choice of model to be a hyperparameter to select. To me, however, that is a little messy. For example, it may not just be accuracy that is important: perhaps you care about interpretability, or computational complexity as well. I think it is best to treat each model seperately, and then evaluate which one is best for your use case.

Luckily, we can just make copies of the preperation pipeline, and append different estimators on the end. The following code demonstrates how to do this, then specifiy the hyperparameters for both the data preprocessing and model tuning. We return a GridSearchCV object, which is a wrapper around our Pipeline object.

'''
assume we have a pipeline that processes features
called prep_pipe (see example notebook)

logistic regression. take a copy of the pipeline,
and add an estimator to the end as a named step
'''
lr_est = copy.deepcopy(prep_pipe)
lr_est.steps.append(('logistic_regression',
                      LogisticRegression(random_state = 1234)))

'''
set the hyperparameter grid.
Good news: we can treat preprocessing steps as we would
any other hyperparameter.
Be careful though, as can easily blow up number of models
will be building (especially with CV as well)

We refer to the steps of the pipeline by using
double underscores
'''
lr_param_grid = dict(
      # should the data be centered and scaled?
      union__num_pipeline__opt_scaler__scale=[True, False],
      # try with no interactions, and degree 2 polynomial features
      union__num_pipeline__poly_features__degree=[1,2],
      # hold penalty type fixed at l1
      logistic_regression__penalty=['l1'],
      # specify values of penalty to test
      logistic_regression__C=[0.001, 0.01, 0.1, 1, 10]
      )

'''
initialise GridSearchCV object.
here, use 5-fold CV
'''
grid_search_lr = GridSearchCV(estimator=lr_est,
                              param_grid=lr_param_grid,
                              scoring='roc_auc',
                              n_jobs=1,
                              cv=5,
                              refit=True,
                              verbose=1)

'''
fit the GridSearchCV object, which will
obtain the best set of parameters by performing
an exhaustive search
'''
grid_search_lr.fit(features_train, outcome_train)

The GridSearchCV object has a predict method which we can use to run data through the best fitted pipeline. We can access the best model and final Pipeline object as attributes of the object to dissect the pipeline if needed. Once again, I suggest having a play and getting familiar with the class.

Note that you dont neccessarily need to do a grid search with cross validation- if you are sure you have a good choice of parameters, you could just use the pipeline's fit method.

Gotchas

Well, you now have some commentary and a worked example. I'll start to wrap up now with some lessons learned:

Use pandas as much as possible for preparing data. To be honest, the process of one hot encoding catagegorical data with Scikit-Learn is an absolute pain. Use pandas, and dont include it in your pipeline. Why you may ask? As Scikit-Learn models run from numerical data in numpy arrays, there is no concievable scenario where you would not want to encode categorical data. Therefore, its not really a hyperparameter you would want to tune, you have to do it. So, if there are steps you have to do, preprocess with pandas, before you get to the pipeline.
Watch out how quickly the number of models you have to train explodes when you have a lot of hyperparameters. Choose carefully, else you may be waiting a long time for your model to fit!
It's a pain to track feature names through the pipeline- pandas dataframes are so much easier to work with compared to numpy arrays for real world data. It is possible to keep track of feature names, you just have to work for it. It pays off when you want to match up model coefficients to feature names, especially if your pipeline has been creating and dropping features.

Other thoughts

So there we have it- I hope this is a fairly comprehensive example of how to get up and running with machine learning pipelines using Scikit-Learn.

I was really keen to get to grips with these, especially as Spark's MLlib uses a similar paradigm. It took a little bit of effort to get going, but now I really appreciate how powerful a technique they are. I will certainly be using them going forwards for my supervised machine learning work.

TL;DR- I finally had a go with Scikit-Learn pipelines. They are practical, although there is a learning curve that presents itself as a a bit of an entry barrier. I wish I had spent some time investigating a long time ago, as they will make my life much easier going forwards!