21 Jan 2015

Churning with Caret: Non-linear Models

To briefly put this in context, I have been working through some exercises in Max Kuhn and Kjell Johnson's Applied Predictive Modelling book. In a previous post, I looked at applying a set of linear models to a customer churn data set. The predictive inputs displayed several non-linear traits, so we can be fairly confident that more complex, non-linear models will be able to provide better results.

As per usual, I've put my script on my github for the interested reader. Let's get stuck in!

Overview

The basic strategy we will take is to:

  • revisit our predictive inputs

  • highlight where non-linear models may give us advantages for this data set

  • fit and evaluate some models

As we saw last time, the churn data set comes to us in a nice clean format, so thankfully no data munging or cleansing is needed. And, as we will soon see, for non-linear models, we will not have to perform feature engineering like we did for the linear model case (when we calculated interaction (polynomial) terms).

Churn data set revisited

Last time, we deduced that the charges and minutes where perfectly correlated with one another, so removing the charges is not going to be particularly contraversial. We also saw that due to only a fraction of each customer coming from each state, that the state predictors, once converted to dummy variables, had near zero variance. Once again, we will create a full and reduced set of inputs so we can feed them differently to models which can and cannot accomodate non-informative predictors.

We also saw that the day minutes column displayed a degenenerate distribution for churners. This is where we should be rubbing our palms together with glee- non-linear models will be able to lap up features like this, unlike their linear friends, who may struggle. Further, for the count data, we saw a degenerate distribution for number_vmail_messages: non-linear models should be able to accomodate inputs like this.

So what are we building our model on? Just to remind us, I have some quick visualisations of the continuous predictors we will be using:

pred-cont

The discrete numeric predictors:

pred-count

And the categorical:

pred-cat

Preprocessing

One advantage with non-linear models is that the preprocessing can (sometimes) be a little simpler. For example, we do not have to worry about creating interaction terms to accomodate non-linear decision boundaries (also known as feature enginering). Also, I'm not particulary worried about predictors with skewed distributions.

There are some bits of preproccesing we will do- but thankfully caret::train() can take care of this for us when we actually fit the models. Namely, we will center and scale the predictive inputs for all the models. For neural networks, we will additionally perform a spatial sign transformation on all the inputs. From experience, (and examples in the APM book), this transformation can help to improve the quality of the model.

Filtering

This process is similar to the linear case. Once again, I will be doing unsupervised feature selection- that is, I won't be considering the relationship between the response and the predictors as I filter. I will build up a full and reduced set of predictors: the full set for models that can perform feature selection internally, and the reduced set for those that can be a little more sensitive.

For the reduced set, I will remove predictive factors with near zero variance (all the states in this case), and where pairwise correlations are greater than 0.9 I will remove one of the predictors. For the full set, I will only remove predictors with literally zero variance, and those that are full correlated with one another (we already removed the costs, remember).

Have a look at my post on linear models for the code snippet to perform this filtering.

Models

Max and Kjell's book explains the models I'm going to be using and the training process better than I ever could- either check it out or the caret website.

As we discussed last time, the caret::train() function provides a consistent interface to masses of regression and classification algorithms, and amongst (many) other things it takes care of resampling and preprocessing, the feeding in of tuning parameters, and defining the summary statistic you wish to use to select your model.

The models I will use are: quadratic discriminant analysis, flexible discriminant analysis with MARS hinge functions, a neural network, support vector machines (with radial and polynomial kernals), and a k-nearest-neighbors model. As you may have gathered last time, the interface caret provides makes it almost trivial to fit these models with no knowledge of how they work. This isn't an ideal strategy, as prior knowledge and understanding of models will give you a better understanding of their strengths and weaknesses, and situations where they may or may not be appropriate to apply. For example, in reality I never would have even tried applying linear models on the churn set, as the classification boundaries are clearly non-linear (which we deduced from our exploratory data analysis where we saw degenerate distributions in some of the important features).

I highly reccomend anyone interested beyond memorising the caret::train() syntax to check out a proper machine learning algorithms course- Andrew Ng's Machine Learning course on Coursera is a great one to start with.

Anyway, as an example, here is the syntax I used to train my neural network model:

#
# set up  grid of training parameters
# train up to a maximum of 10 layers
# with various weight decay parameters 
#
# choose number of weights- I have seen
# various suggestions of how many should be used,
# this is what APM suggests
#
nnetGrid <- expand.grid(size = 1:10,
                        decay = c(0.01, 0.03, 0.1, 0.3, 1))
maxSize <- max(nnetGrid$size)
numWts <- (maxSize * (length(reducedSet) + 1) + maxSize + 1)
#
# set seed and tune. Use reduced, and peform a spatial
# sign transformation on predictors. trace, maxit, MaxNWts
# are parameters specific for nnet
#
set.seed(476)
nnetTune <- train(x = trainInput[, reducedSet],
                  y = trainOutcome,
                  method = nnet,
                  metric = Sens,
                  preProc = c(center, scale, spatialSign),
                  tuneGrid = nnetGrid,
                  trace = FALSE,
                  maxit = 1000,
                  MaxNWts = numWts,
                  trControl = ctrl)

For neural networks, we are training over the number of hidden layers and the weight decay (the regularisation parameter). We can use the plot generic as nnetTune is of class train, to visualise how our choices of tuning parameter faired:

nnetTune

After fitting all the models, we see the improvement compared to the linear models is staggering: the two best non-linear models achieve a sensitivity of 0.71 and specificity of 0.99 (nnet) and a sensitivity of 0.75 and a specificity of 0.99 (fda) on the test set. Note that neural networks are not entirely stable, so your results may differ slightly if you run the code with a different seed (you are not guarenteed to find the optimum values when fitting parameters for a neural network). This compares to a sensitivity of 0.46 and a specificity of 0.97 for the best linear model, which was a penalised LDA model.

Variable Importance

The way neural networks and FDA interpret variable importance differs- see the documentation for caret::varImp() for a detailed discussion. It is interesting is to compare the two models, as we relied on FDA to perform feature selection for us. For example, we assumed that little information was given in the state for the neural network model, so we would expect the state of the customer to be relatively unimportant in the FDA model.

We can compare the top 15 predictors of the two models using caret::varImp(). This is really useful for 'model dissection'. The predictors are rated on an absolute scale here, so that the most important has a score of 100, and least important 0.

Firstly for the neural net (which in fact had only 14 predictors):

nnetvarimp

And for FDA (which had the full set of 65 predictors as inputs, but only seems to have used 11):

fda_varimp_philip_goddard

It is interesting to see how the different models rank predictive inputs differently: it isn't a general assumption that the most important input for one model will be the same for the others.

The FDA model confirmed our suspicions that the state gave mostly redundant information. Other than three states, NY, NJ and ME, the state gave no predictive input. Putting these inputs into the neural network would likely swamp the model and degrade it's performance. However, we cannot guarentee this; for example, we see that the area codes are not used by the FDA model, whereas at least one is by the neural net. For all we know, adding in information from some states could improve the nnet, or could degrade it! Variable importance should be considered on a case by case basis for different models, and this discussion leads towards more robust methods feature selection, which I will save for another time!

Model evaluation and comparisons

Well there we have it; we have fitted a selection of non-linear models to the same churn data set, as we did before for linear models. Time to do some simple evaluations of our models. As per last time, we can look at the distribution of the resamples:

resampplot

It's trivial to perform a t-test that confirms that the FDA is our model of choice, based upon the distribution of the resamples for the sensitivity during training.

#
# put objects of class train in a list
#
models <- list(qda = qdaTune,
               nnet = nnetTune,
               fda = fdaTune,
               svmR = svmRTune,
               svmP = svmPTune,
               knn = knnTune)
#
# use caret::resamples and plot
#
resamp <- resamples(models)
bwplot(resamp)
splom(resamp)
#
# t-test between our top two models:
# reject the null - fda is better than nnet 
# for sensitivity
#
t.test(resamp$values$`fda~Sens`,
       resamp$values$`nnet~Sens`,
       paired = TRUE)

We see that other than the k-nn model (which is so simplistic I am always suprised when it gives reasonable results), all the non-linear models do a fairly good job, and the more powerful ones far outperform the linear models we saw last time.

I also always like to look at the calibration curves: these will give us confidence that the predicted probabilities correspond to a real probability: that is, do 90% of the samples that we predict p(churn) to be in the region of 0.9 actually churn?

calplot

We see from the above calibration curve that even though the FDA model has the best predictive power using a cutoff of p(churn) = 0.5, it is in fact not as well calibrated as the neural net or SVM models. This type of observation should be important when choosing which model to use; if this type of thing matters, nnet would be preferable, unless you wish to apply techniques to recalibrate the predictions from the FDA model.

As per last time when we looked at linear models, we will wrap up by looking at the lift curves. On the plot below, I have included the two best non-linear models we considered, and the best linear model from last time. If you don't know what a lift curve shows, look at my post on linear models, or Max and Kjell's APM book.

liftplot_philip_goddard

It is really interesting to see that the non-linear models absolutely spank the penalised LDA model up to arond the 70% mark, where the most confident predictions of churning customers in fact are all correct! Beyond 80% of samples found, however, the ability of the non-linear models to find churning customers degrades, and in fact if you wanted to chose a model to find 90% of churning customers based on the predicted churn propensity, you would in fact be better off selecting the linear model!

Summary

Well, lets wrap everything up now. To answer the same question as last time, if we wanted to reach 80% of the churning customers (assuming a distribution similar to the test set), we would need to reach, with our best model, 15.4% of the total customers. In our test set, this corresponds to 257 customers, and 179 of these would in fact be churners. That is a lot better that we achieved with linear models. This model could allow some powerful business decisions to be made as to which customers are at risk of churn, so that they can be reached to try and retain them.

TL;DR- I did some churn modelling, using non-linear models. If you want my script of how I did it, it is here