10 Feb 2015

Home-brewing nnePtR: a neural network classifier

Andrew Ng's Machine Learning course on Coursera is hands down the best MOOC I have done to date. I've been interested in machine learning for a while now, and I thought making my own R package to build neural networks looked like a pretty decent challenge to get in some practice.

The requirements were as follows:

  • The neural network will perform classification only at this stage (I may come back to it for regression).
  • It will be an S4 class, so that I can include a generic function for prediction.
  • The user may define the number of hidden layers, the number of weights (neurons) per hidden layer, and the penalty (weight-decay) term.
  • There must be some resiliance to optimisation issues.
  • The fitting process must be done with speeds comparable to other neural network packages available in R.
  • The name of the package must be a clever play on words.

To satisfy one requirement (possibly the most difficult), I'm going to call it nnePtR: the P stands for 'Phil' and the 'R' stands for R. Its a small homage to my favorite TV show (if you can work out what it is, you, like me, need to grow up).

As always, my code is here. You can download and install the package with devtools::install_github(philipmgoddard/nneptr), or just browse the code online.

Overview

I'm not going to go through detailed theory in this post. Pretty much everything I know about implementing neural networks I learned in two weeks from Andrew Ng's MOOC- that isn't me being smug and showing off, rather me saying if I can do it you can too!

But overall, the basic idea is that we need our input parameters (an (m x n) matrix, where m is the number of training instances, and n the number of features) for the input layer, and the outcome as the output layer. The output layer will be an (m x p) matrix, where p is the number of possible classes. Each row in the output layer will sum to 1, and each column represent the probability of the input being classified to the class p. For binary outcomes (e.g. yes/no, 0/1, ...) we only need one column in the output, and the information represents the probability of being the 'positive' class.

Between the input and output layer, there are hidden layers. In principle, you can include as many of these as you like, but including extra layers doesn't neccesarily imply better performance.

When we run samples through the network to make predictions, the information is propogated through the layers dictated by weights (the theta matrices, if you want to use the notation from Andrew's course). We start at the input layer (a0), and move to the first hidden layer (a1) by multiplying the inputs by the weights that connect the layers to calculate z1, and then performing the sigmoid transformation on z1 to obtain a1. This process is repeated for every layer until we get to the output layer, a(N), where N is the number of layers in our network.

If we don't have the fitted weights to hand, we obviously have to calculate them! This optimisation exercise is usually performed efficiently by providing an optimiser function the values of the parameters to optimise over, the function that we are optimising (which will return a cost associated with the current set of parameters), and the gradient of the parameters that we are optimising over. As we have a matrix of parameters connecting every layer, the usual trick is to 'unroll' all these matrices into a single vector which is passed to the optimiser. This is then 'rolled' back up into the matrices of parameters when we calculate the cost and gradients via forward propogation and backwards propogation.

I'm getting a little ahead of myself now! We have discussed the forward propogation- the cost given a set of parameters is determined using the same penalty function that is usually used for logistic regression, plus an extra penalty term for regularisation. How do we get the gradient of the parameters that we wish to optimise over, though? This is calculated during the backward propogation stage. The theory behind this is a little tricky, but thankfully the implemention is fairly straightforward once you know what you are doing.

Right- I am going to refer any further discussion of the theory behind my implementation to Andrew's course. My main intention here is to show how I built my package, not to teach the theory behind neural networks!

Approach

Right, time for some planning. As I am using S4, I will need at the very least a class definition, a constructor function, and a generic for making predictions. The constructor is where the magic will happen, and the parameters for the network fitted. Other than the methods package (which I need for S4 methods), I will be writing this using base R only.

What will go in the slots for the objects of class nnePtR? I always like to hold enough information to ensure the objects are as informative and as useful as possible. At the very least, we need a slot to hold the fitted parameters. The fitted_params slot is a list, which will hold a matrix of parameters in each element. I also want to store the number of units per hidden layer (which will be a constant for each hidden layer), the number of hidden layers between the input and output layer, and the regularisation parameter. Because you are never guarenteed to find the global minimum for the parameters with a neural network, we should retain some information about the optimisation; lets keep the final cost and some information from optim(). And finally, to be thorough, lets keep a copy of the training input data and outcome, as well as the possible categories of the outcome.

The class definition is set with the following code:

#
# define the slots for the S4 class
#
setClass(
  Class = nnePtR,
  slots = list(
    input = matrix,
    outcome = factor,
    levels = character,
    n_layers = numeric,
    n_units = numeric,
    penalty = numeric,
    cost = numeric,
    fitted_params = list,
    info = list)
)

The next step is to define the constructor function. How will we actually fit the parameters of the network? We will need to be able to define functions to perform forward and backward propogation to calculate the cost (a scalar) and the gradient (a list of matrices corresponding to the theta martices). We will use R's optim() function to perform the minimisation, using the limited memory version of the Broyden-Fletcher-Goldfarb-Shanno algorithm (don't panic- it's already coded up in optim(), ready to go!)

To keep this as speedy as possible (especially for large training sets) I will need to vectorise the crap out of this code. R, being a loosely typed interpreted language can have serious speed issues when performing demanding operations. However, vectorised operations make use of optimised and compiled (presumably) C code, so are nearly as efficient as using C (I guess there must be some overhead from R calling the code). I highly suggest that you have installed a decent version of BLAS and LAPACK- it turns out it makes a hell of a difference to the speed that R can perform vectorised operations (hmm... if it is BLAS or LAPACK being used, then I lie- they are written in Fortran, not C).

I should adopt some decent development practice as I move along. I will be using git (with a master and develop branch) for version control, and thanks to Hadley Wickham's fantastic devtools package, include some tests to make sure I don't break my code without realising as I add and modify features.

Initialisation

Well, here goes! Inside the constructor function, the first thing that gets called is the nnetTrainSetup() function. This is a little boring, so I won't present the code (have a look in functions.R). Essentially, it defines template lists of matrices for the matrices that I need to perform the forward and backwards propogation. This way we don't need to re-initialize the size of the matrices at every iteration. For my 'a' matrices, I include bias units where appropriate in the templates.

These templates are defined entirely from: the number of hidden layers, the number of units per hidden layer, the number of training instances, the number of features, and the number of possible outcomes. The function also dummies up the outcomes into a matrix, that is, it creates an (m x p) matrix with a 1 in the row if the training input was of class p, and a 0 otherwise.

The parameters that define the network (held in the theta matrices) are randomly intialised using a rule of thumb I found on stackOverflow. It seems to give acceptable performance, so I will stick with it for now! The 'a' template for the neurons is initialised with NA's, other than for the zeroth layer, which is initialised with the input.

With me so far?

Solving for the optimum parameter values

Right, so now can we just sling everything into the optim function? Not so fast! I have included some R trickery here.

To optimise, we will need to use the backwards propogation algorithm. This actually includes a forward propogation with the current set of parameters to calculate the z and a matrices at each layer, and then the backwards propogation to determine the gradient (which is needed to optimise efficiently). I decided to write the backpropogation as a closure; this allows me to pass the templates for the theta and a templates into the environment holding the function holding the actual backwards propogation, meaning that I dont have to pass them at every iteration of the optimisation. The backwards propogation function calls the forward propogation function to calculate the a matrices, the z matrices, and the cost, and then performs the backwards propogation to calculate the gradient.

The forward propogations is done with code like this:

#
# forward propogation function. Remember that 
# a and Thetas are lists of matrices!
#
forwardPropogate <- function(Thetas, a, nLayers) {
  # define size z
  z <- a[2:length(a)]
  #
  for(i in 1:nLayers) {
    z[[i]] <- a[[i]] %*% t(Thetas[[i]])
    a[[i + 1]][, 2:ncol(a[[i + 1]])] <- sigmoid_c(z[[i]])
  }
  #
  z[[nLayers + 1]] <- a[[nLayers + 1]] %*% t(Thetas[[nLayers + 1]])
  a[[nLayers + 2]] <- sigmoid_c(z[[nLayers + 1]])
  #
  return(list(a = a,
              z = z))
}

and the backward propogation uses the a and z matrices to compute the gradient like so:

#
# define number of hidden layers
# number of training instances m
#
nLayers <- length(a) - 2
m <- nrow(outcome)
#
# forward propogation
#
tmp <- forwardPropogate(Thetas, a, nLayers)
a <- tmp[[1]]
z <- tmp[[2]]
#
# define sizes of objects base on templates
#
delta <- z
gradient <- Thetas
#
# back propogation to determine errors
#
delta[[nLayers + 1]] <- a[[nLayers + 2]] - outcome
for (i in nLayers:1) {
  delta[[i]] <- (delta[[i + 1]] %*% Thetas[[i + 1]])[, 2:ncol(Thetas[[i + 1]])] *
    sigmoidGradient(z[[i]])
}
#
# gradient
#
for (i in 1:(nLayers + 1)) {
  gradient[[i]] <- (t(delta[[i]]) %*% a[[i]]) / m
  gradient[[i]][, 2:ncol(gradient[[i]])] <- gradient[[i]][, 2:ncol(gradient[[i]])] +
    (lambda / m) * Thetas[[i]][, 2:ncol(gradient[[i]]), ]
}

The above snippet isn't my exact backpropogation function, it contains the gist of it in a way that is easier to read. You can see the exact code on my github. As I mentioned, I defined a closure for the back propogation function, so that the template lists of matrices are held in the enclosing environment. So you can initialise the backward propogation function via:

#
# backProp is a closure, so we pass the templates
# to the enclosing environment. This way, optim()
# will not need to be passed the templates at 
# every iteration
#
bp <- backProp(Thetas = templates$thetas_temp,
               a = templates$a_temp,
               lambda = lambda,
               outcome = templates$outcome_Mat)

So now we can call bp() like any old function- if you look at my code on github, the only argument it requires is the vector made up of the unrolled theta matrices.

Just to complicate things further, I have included even more trickery. The R optimisation function optim() requires a minimum of three inputs to optimise; the parameter to optimise over, the function to optimise over (which we want to return the cost to gauge how good a job is done at each iteration), and the gradient of the parameters that we are optimising over. As mentioned above, we can't pass a list of matrices to optim(), which is why I unroll the list of matrices into a vector, and roll them back up inside the function. In the file functions.R, I define functions that can unroll and roll matrices to and from lists.

So, we have the parameter that we wish to optimise sorted into a form compatible with optim(). But what about the cost and the gradient? To calculate the cost, we just need a forward propogation step. However, to calculate the gradient, we need to forward propogate, and then backwards propogate. In other words, if we pass functions to calculate the cost and gradient to optim() seperately, the forward propogation step is going to be done twice per iteration of the optmiser!

What I have done to get around this is to define splitfunction(); this closure caches the cost and gradient when we call bp() the first time to obtain the cost. When optim() calls the function a second time to calculate the gradient, it sees that the unrolled thetas haven't changed, so it calls the cached value of the gradient rather than recalculating the entire thing. Neat, right?

Have a look in functions.R and class-definition.R if this seems a little odd to you... I won't lie, the concept took me a while to get my head around. You need to understand how environments work in R to really appreciate this. But none the less, we define f2() by splitting bp(), and when we call optim() we can pass f2$fn to fn, and f2$gr to gr. Our final call to optim() looks like this:

#
# user can specify method, max iterations
# and amount of output that is desired
#
optim(par = unrollThetas,
      fn = f2$fn,
      gr = f2$gr,
      method = optim_method,
      hessian = FALSE,
      control = list(maxit = iters,
                     trace = trace))

handling optimisation errors

Now, optimising the parameters for a neural network is not like linear regression. The cost function is not convex, and there is no guarentee that the global minimum will be found.

I initially defined a seed for the random initialisation of the theta parameters; this was to ensure the results were reproducible for my tests. However, there can be optimistion failures if we are unlucky. From experience, it seems that for some values of the penalty term (being too small) the optimisation can fail. However, other times it can just be because the random initialisation was unlucky, and the optimisation 'blew up'. If the optimisation fails, there is a functional operator failwith() (coutersy of Hadley Wickham's Advanced R book) which allows the exception to occur without 'breaking' the execution of the code (think a Python try-except clause). If the exception is encountered, the seed is changed, and the process tries again. However, if the optimisation fails 5 times in a row, the execution is aborted with a warning.

returning our object

Once the optimisation is complete, we can rejoice! The constructor passes the fitted values and various other input parameters to the slots in an object of class nnePtR, which is then returned.

I included a print and summary generic, so you can extract some key information about the neural networks once you have fitted them. As an example, the summary generic looks like this:

#
# set the summary generic for objects of
# class nnePtR
#
setMethod(
  f = summary,
  signature = nnePtR,
  function(object) {
    cat(\nObject of class nnePtR)
    cat(\n\nNumber of hidden layers:)
    cat(object@n_layers)
    cat(\nNumber of units per hidden layer:)
    cat(object@n_units)
    cat(\nPenalty term:)
    cat(object@penalty)
    cat(\nNumber of training instances:)
    cat(nrow(object@input))
    cat(\nNumber of input features:)
    cat(ncol(object@input))
    cat(\nOutcome classes:)
    cat(object@levels)
    cat(\nNumber of outcome classes:)
    cat(length(object@levels))
    cat(\n\nFinal cost:)
    cat(object@cost)
    cat(\nOptimisation method:)
    cat(object@info[[1]])
    cat(\nMax iterations:)
    cat(object@info[[2]])
    cat(\nConvergence code:)
    cat(object@info[[3]])
  }
)

Essentially, it just pulls values from the slots in the object and gives them some pretty formatting.

Making predictions

I suppose this is the next logical step- whats the point of a predictive model if you cannot make predictions with it? The predict() generic needs defining. This function takes the fitted parameters (the list of matrices holding the fitted parameters) and performs a forward propogation to obtain the class predictions. Essentially, it is a wrapper around the forwardProp() function, which then just returns nicely formatted output. Have a look in generics.R for the definition of this function. The user can specify whether they want the predicted class to be returned, or the class probabilities. The below snippet fits a model on the iris data set, and then calculates the training set class probabilities.

#
# fit a neural network with the nnePtR constructor function
#
model <- nnetBuild(iris[, 1:4], iris[, 5],
                   nLayers = 2, nUnits = 10, lambda = 0.03)
#
# calculate probabilities
#
predict(model, newdata = iris[, 1:4], type = prob)

The proof is in the pudding: Kaggle digits competition

Well lets cut to the chase: how good is this model? I found the perfect test- the Kaggle digits competition. A training set of 42,000 hand written digits, with a test set of 28,000. Each diget is a 28 x 28 pixel image. Therefore, we have 784 features per input. Each feature gets a value in the range 0-255. Our aim is to classify each hand written digit into the range 0-9.

We will do some common sense preprocessing initially- I removed zero variance predictors (i.e. those in the outer edges of the canvas which never get written on), and near zero variance predictors (i.e. those that have the same value in more than 95% of the training inputs). This leaves us with n = 252 features (I suppose I could have investigated dimension reduction with PCA instead). Also, I normalised the inputs by centering and scaling. All of these steps can be done easily using functions from the caret package.

I performed a split for cross validation, placing 20% of the training into a cross validation set to verify what were good choices of training parameters. I did this by hand rather than using caret- I will investigate soon how to integrate custom models into the caret framework (I think I have done enough work for this post!).

The parameter choices that I settled on were 2 hidden layers between the input and output, with 200 units per layer. I chose the penalty term, lambda, to be 0.03.

We should also remember that the cost function for a neural network is not convex- it is fairly likely that the model will not find the global minimum of the cost function when fitting parameters. Due to this, I will adopt an 'averaging' approach; I will run the model several times with different seeds for the initialisation, and take the majority prediction for each digit.

I chose to run the model 21 times. It took about 4-5 minutes to perform 100 iterations with optim() for each run. For each set of fitted parameters, I made predictions on the test set, and then took the majority 'vote' between the 21 sets of predictions. When I submitted to Kaggle, I got a final result of 0.96814 prediction accuracy; at the time of submission I ranked 372 out of 876. Not bad, considering I hand-carved this model! To be honest, if I was looking to get to the top of the leader board I would have pulled a few very complex, well established model libraries off the shelf- but that wasnt the point of this; it was to write my own neural network package, and to validate that it performs well in a classification problem (and as an aside, the best academic score for the data set is in the region of 0.9977... there are some fishy scores floating around at the top of the leaderboard!).

Other thoughts

So there we have it. A home brewed S4 neural network package for R. It works well, it isn't too slow, and stands up to the test in a Kaggle competition.

My next challenge is to integrate this function with the caret package; the authors allow the flexibility to incorparate custom functions into the caret framework. Doing this will allow me access to the fantastic caret::train() function, which will take care of my resampling, and give me all the benefits that objects of the class train possess. I will come back to this in my next post, as I think I have written enough for the time being!!

TL;DR- I wrote a neural network package in R. Have a look here!