15 Aug 2016
My Predictive Modeling Workflow
I have found more and more over recent years how counter-intuitive the workflow is for the majority of the data science projects that I have worked on. Everyone seems to get excited and hung up over complex learning algorithms, but the reality of the situation is that once you have got to that stage, you have either already lost or won. The anecdotal sayings are along the lines that only 20-30% of the time required is building the model- the rest of the work is around obtaining, understanding and cleaning data.
The following post is going to address the steps of preparing and fitting models to data that has been extracted and cleaned. Even if it is only 20-30% in terms of the time required, you still need to understand how learning alogorthms work and have a decent grasp on how to fit them, else you will produce models that perform poorly and will not suit the context of the problem at hand. In terms of the other 70-80% of the work, I suggest you check out the Guerrilla Data Analytics approach, which has been serving me well over recent months.
I will keep the following at a relatively high level, and generalise it both for regression and classification supervised learning problems.
The first step is to read in your raw data. I'll assume you are using Python or R, so you will be reading it into memory. I'll also assume that you have sufficient memory available to do this! I read the data into a data frame, either from a flat text file, or by directly connecting to a database (if that is where the data is located).
This is the first (of many) instances where one size fits all doesn't work! Some people will tell you that you need a train, validation and test set. I think the truth is: it depends. A train and test set is great if you have the luxury of enough data, however, if you have limited data, you need as much as possible to fit your model! A validation set is only really needed if you want to do post-processing on your model: perhaps you want to investigate the optimum probability cutoff for a classifier, or combine models together to create an ensemble and need to determine what the relative weights of each models prediction are.
While an independent test set is always good to evaluate how well your model generalises, I will advocate that when data is limited (10's to 100's of instances), a good cross validation setup is a reasonable substitute to a test set (try something like leave-one-out cross validation).
So to summarise, my recommendations are:
- Perform a 70 : 30 percent train : test split if you have the luxury of having a reasonably large data set
- Perform a 70 : 10 : 20 train : validation : test split if plan on intermediate steps for calibrating probabilities, deciding on alternative class probability cutoffs, determining weights to combine models in an ensemble, etc.
- If data is very scarce, use a computationally expensive, but thorough, cross validation scheme such as leave-one-out cross validation when you train your model.
Following the split, every preprocessing step that ensues on the test or validation set (if you are using them) must mirror the action on the training set. For example, if you center and scale the columns in the training set, perform this action using the same mean and variance you obtained from the training set when centering and scaling the validation and test sets.
Also, remember in some circumstances that the context may drive the choice of data splitting. There is a great case study in Kuhn and Johnson's Applied Predictive Modeling where they build a model to predict the success of a grant application, and they choose to only include the most recent year's data in the test set, to ensure its relevancy, and all other data in the training set (they actually describe a couple of schemes- the chapter is well worth reading).
Following your split, you must assume that you can only 'see' your training set. Before I start exploring the data visually, I find it useful to obtain some basic summaries on the training set. What we are really looking for are:
- The type of predictors (continuous numeric, discrete numeric, categorical)
- Are there missing values (Nulls / NA's)
- Are there duplicate rows
- For a classification exercise, is there class imbalance?
- For a regression exercise what is the distribution of the outcome, and will it need transforming? (some models will struggle if the outcome is highly skewed, for example)
When duplicates are observed, I do the following:
- Eliminate full duplicates (i.e. features and outcome are identical)
- Average the outcome of duplicate features (i.e. features are identical, but outcome is different).
These will ensure that each observation is unique.
Where missing values are observed, I recommend investigating for any patterns: are they missing at random? If not, what are the patterns? It's always best to understand these things as much as possible. Dealing with missing values should be dealt with down the line, but its crucial to know they are there so that you don't get suprised later.
My preference is to then seperate the features and outcomes into seperate data frames (or a data frame and a vector). I can then manipulate the features, and join back to the outcome when I am ready to begin modelling.
Separate numeric and categorical features
Here, I suggest seperating categorical and numeric inputs, as these will be processed diffently. With the exception of some tree-based models in some R packages, categorical features will need to be processed into 'dummy' variables (also known as one-hot encoding). I like to seperate here, and then join together once both sets have been processed.
This is such a key step. It's crucial to gain some understanding of not only how the features relate to the outcome, but also to see how they relate to one another.
A scatterplot matrix is great to understand pairwise relations between predictors. When it's a classification exercise, it makes sense to color the points with the outcome class, and for a regression exercise you may consider the size/opacity of the points to correlate to the outcome.
Always consider plotting the features against the outcome. Different types of plots can be used for different applications. For classification, I might make histograms or density plots where I group by outcome class to identify which predictors display a class seperation. For a regression excercise, I favor plotting features vs the outcome, perhaps with a loess smoother to guide the eye.
Now we have intuition about the relationship between the predictors and the outcome (and one another), we can start by performing some filtering and preprocessing on the numeric predictors. This is known as unsupervised feature selection: we do not consider the relationship between the predictors and the outcome as we follow these steps, only the between-predictor relationships.
Its crucial to investigate correlations. Fully correlated features give the same information, and therefore all but one are redundant. If you include these, in the best case when using a model that can peform feature selection intrinsically, the model will randomly choose one of the correlated predictors. You will be unable to assess predictor importance with any confidence. Worst case, you may be unable to fit the model: for example, if you are fitting a linear model, you will find you have a non-invertable matrix and your system cannot be solved for the model coefficients.
To resolve this, I start by calculating a correlation matrix. From there, you want to eliminate strongly correlated predictors using a sensible procedure. The R caret package has a great function for doing this, else you can devise your own strategy.
I like to choose the thresholds for filtering based on the magnitude of the correlations. Predictors correlated less than 0.85 - 0.9 will be included in a 'reduced set'. Predictors correlated less than 0.99 will be incluided in a 'full set'. All others will be discarded. The full set is destined for models that can perform feature selection intrinsically (trees, rules, MARS, elastic net,...) and the reduced set is for models that cannot (linear models, neural networks,...)
As a note, I prefer a strategy along the lines of 'tag dont filter'. I create a character vector or list of feature names to be included in the full or reduced set, to later subset the data frame of inputs.
Another crucial check is to look for non-informative predictors. 'Zero variance' predictors have the same value for each sample, and therefore offer no information for your model. Even near-zero variance predictors may not be of use, so it's best to tag these near-zero variance predictors into your full set (for models that will filter them if neccessary), and only keep those safe predictors for your reduced set.
Also, if you have missing data, you may consider an imputation strategy, or perhaps to just remove the rows with missing values.
The next steps are very much model dependent. Some models are robust to highly skewed, multimodal, or poorly behaved distributions the predictors and the outcome. Further, some models may perform poorly if the predictive features are on different magnitude scales. There exist various transformations that you can perform: centering and scaling, Box-Cox, Yeo-Johnson, spatial sign, or manually perform others (take a log, a square root etc). This process can be a bit of an art.
As a note, some libraries (such as the R caret package) can perform these steps 'on the fly' when training models (as an alternative to manually performing these transformations). Other libraries, such as the Python Scikit-Learn library need these transformation steps to be performed manually.
Always remember, whatever transformations or filtering you perform on the training set, make sure you perform then on the test and validation sets as well!
If you are using linear models to describe a non-linear problem, you may benefit from calculating interaction terms. I find two general strategies work: either calculate all interaction terms (up to, say, quadratic terms) and then use a model that can perform feature selection (like elastic net for regression, or logistic regression with l1 and l2 penalties for classification), or carefully create interactions 'by hand'- typically by referring to your data visualisations.
Other than some tree based models I have encountered in R packages, all other models require categorical data to be represented as numeric data. Before blindly doing this, I suggest performing some statistical tests, especially on binomial categorical features. Fishers exact test can be a little compuationally burndensom, so a Chi-squared test is probably your best bet. This will allow you to discard any categorical features that are non-informative straight away (tag them into your full set for models that are resistant to non-informative predictors).
If you have missing values, depending on the context you may decide that it makes sense to have a category for 'not known'. Otherwise, you may have to filter these rows.
Next, I encode the categories. The caret package in R has a function for this, and when using Python I find the Pandas implementation to be a little cleaner than the implementations in Scikit-learn.
Once you have transformed, you need to remove redundant features. However, this can be model dependent:
- For binary categories (i.e. two levels), always discard one (as the pair of predictors will be fully correlated)
- For linear or nonlinear models, you should always discard one of the levels (e.g. if a category had three levels, red, blue and green, you can remove one as the result is implicit from the two- if it is not red or blue, you can infer it is green)
- For trees and rules, remove one feature if binary, but I would argue to keep all the encoded columns if there are more than two groups in the category. A binary tree considers each feature in isolation when deciding on splits: there is no way a tree can infer the category to be blue if it is not red or green.
The final steps are to remove redundant and non-informative features- mirror the steps for the numeric features. Remember, to tag not filter!
It is a simple step to merge the data frame containing the numeric features with the transformed categorical features. A final, thorough check for correlations is useful at this step, just in case any of the binarised categorical and numerical features are strongly correlated.
Assess the need for further steps
This stage is pretty vague- I'm afraid it is pretty dependent upon the dataset you are working with. For example, if you are building a classification model, how balanced are your classes? Rebalancing the classes in the training data through up or down sampling is one method you could use to remedy class imbalance. Do you have so many predictive features that dimension reduction is going to be required? Principle component analysis could be of use if this is the case.
Overall, by now you should have a good understanding of your data and the problem you are trying to solve, so you should be able to make the decision on whether further steps are needed.
We are finally at the stage to begin modelling. A really important step is to decide on your resampling scheme. Some useful things to think about are:
- If you have plenty of data and have decided to split your data into a training and test set, then I think a fairly standard cross validation scheme is to use 3-5 fold cross validation on the training data.
- For a smaller training set, repeated k-fold cross validation may be computationally feasible and is a good choice to obtain confidence intervals for your performance metric.
- If you are comparing model performance via statistical tests, you may want to consider using bootstrap resampling as you will obtain a comparitively low variance for your performance metric.
- If data is pretty scarce, and you don't have a seperate test set, I would suggest leave-one-out cross validation.
Next, its time to begin modelling. As discussed above, there are models that are robust to non-informative features (i.e. can perform internal feature selection), and those that are not. I suggest reading up a bit so you know which should be trained using the full feature set, and which should be trained using the reduced feature set.
Your choice of model needs to take into account a few key points:
- how important is it that the model is interpretable?
- does the model need to be deployed as a service, or will it only be used offline?
- are there any constraints due to available computational resources for model fitting?
I suppose models fall roughly into three camps; 'black boxes', such as random forests and C50, 'opaque models' such as MARS and FDA, and 'transparent models' such as linear models or penalised linear models. If you need interpretable model coefficients, a linear model that performs acceptably is of more use to you than a highly performant random forest model.
If the model needs to be deployed as a service, think carefully about your choice. Good luck implementing an efficient C50 model from scratch! You could decompose and hand over 'transparent' to 'opaque' models which have simple prediction functions to developers. However, some models are so complex that you will need to take other steps, such as hand writing or purchasing a framework that can wrap up and deploy Python or R models as a service.
It is worth thinking about how long it will take to fit your models. Fitting a random forest model to a large dataset with resampling can take a very long time (even if you can parallelise the resampling). When training you also need to take into account that most models have parameters that the model needs to be tuned over. For example, when training random forests, I usually fix the number of trees and the depth, and vary the number of features randomly selected at each split. If I have 5 candidate choices for the model parameter, and use 5-fold cross validation, I am fitting 25 models in total to obtain my final choice of parameter.
Once I have chosen one (or more) of my trained models by assessing their performance on the training set, I may want to take some futher steps. For example, I may:
- Calibrate the probabilities for a classification model
- Investigate alternative probability cutoffs to define the hard class assignment for a classification model
- Investigate creating a model ensemble (classification or regression)
For these actions, a validation set is neccessary to prevent over-fitting on the test set.
This is a logical place for a 'final' step, but in reality model assessment may lead you to loop back to other steps in the model fitting process.
What most people think about is to simply assess the model performance on the test set (if there is one). I would give you a word of warning here- unless your test set is very large, you should be skeptical about valuing test set performance more than your cross validation performance obtained when training. The idea with a test set is to see how well your model generalises, and a small test set cannot really provide a good indication of this.
You may also be depoying your model, so assessment can include some feedback from developers (does your model perform well as a service, or is it complex to implement and slow to provide predictions?) and key stakeholders.
Further, if your model is being used as a service to make real time predictions, you may need to consider setting up A/B testing to verify the effectiveness. For a concrete example, imagine your model is being used for a recommendation engine. You may want to try two models- a control where the customer is recommended a random selection of products, and your predictive model. Remember that people can be easy to persuade, especially when the service has a good user interface. Even bad recommendations can infuence people to buy items, so if you have fitted your model to historical transaction data (where the customers did not recieve recommendations influencing their buying decisions) you need to ensure that your model performs well 'in the wild'. Your metric, in this case, might look at additions to basket through your recommendations.
Well, we have made it through to the end! I appreciate that I may have glossed over some of the deeper details, but my aim was to provide a succint overview, in the hopes that these steps will be of use. Always remember, however:
- There are no hard and fast rules in the model building process: each situation should be treated uniquely.
- Be an expert in your data! Understand what inputs are easy and cheap to collect, and what matters to stakeholders.
- Even models that can perform intrinsic feature selection will suffer if flooded with rubbish. Dont always fall back on random forests assuming that they will be robust to non-informative predictors, using this as an excuse to skip data exploration and preprocessing.
I find the steps I have outlined to be a useful starting point for model building, and they should be treated as a basic template, rather than a strict step-by-step guide.
TL;DR- a useful framework outlining the steps I use to build predictive models.