13 Jan 2016

Packages, Linear Models, and S4

I suppose this blog is going to touch on a number of things; how to implement linear regression, how to build S4 classes, and how to implement them in packages. So in terms of my categorising system, I'm not sure if this counts as an 'R Snippet' or 'Modelling' post... it straddles both.

I actually wrote this before I started Andrew Ng's Machine Learning course, so my sources are from various statistics lecture notes, forums and textbooks. My aim was to build a class that replicates the behaviour of stats::lm() (an S3 class), including various generic functions as well (e.g. plot()). Of course, my implementation isnt a 100% clone, but it is enough to learn a lot about how R fits linear regression models. I'll be assuming you have basic familiarity with R packages and the S4 system, or this post will get a little long...

As always, check out the package on my github; feel free to use, abuse, criticise, copy, or ignore entirely!

Overview

The theory is fairly standard for linear regression, so I wont dwell on it for too long, and suggest you read up on it yourself. In terms of my implementation, I will be using the 'Normal' or 'beta-hat' equation to determine the coefficients.

My strategy with R packages containing S4 classes is to have three (or more) files in the /R directory: one for the class definition, initialiser, and constructor, one for generics (e.g. plot(), summary() etc), and one for accessors (e.g. getters/ setters). If all of this stuff is new to you, I recommend reading this: some bits are a bit outdated, but overall it gives a pretty robust introduction to the S4 system.

So here goes!

metadata

The NAMESPACE file is my first port of call with any R package. To fill in for a package using S4 classes, the most important difference to usual practice that you have to remember is to use the Collate keyword to ensure that seperate files containing methods and class definitions are collated together. Have a look at my NAMESPACE file on github, and see Hadley Wickham's R Packages if you want more insight than my example provides.

class definition

In class-definition.R, I firstly define my class. S4 classes are quite formal compared to S3, and you need to define all the slots. We use the methods::setClass() function to define the class:

#
# Define the LinRegP class. Remember your
# Roxygen headers!
#
setClass(
  Class = LinRegP,
  slots = list(
    model = character,
    intercept = logical,
    outcome = numeric,
    covariates = data.frame,
    modelCoef = matrix,
    fittedVals = numeric,
    residuals = numeric,
    summaryStats = list,
    leverage = numeric)
)

Here, I have defined the class name, LinRegP (the P stands for Phil, as I lack imagination), and all the slots I intend my class to have. This list needs to define the class of each of the slots. Have a look at the Roxygen headers (see my github- I've ommited them here for brevety) or the package documentation for a description of each of the slots.

Next, I have defined an initialize method. I have pretty much only used it here to define the column names for the matrix representation of the model coefficients: in different circumstances you may have a more complex initialiser.

#
# initialize method. when debugging, it can be useful
# to make the methods write out as they run
#
setMethod(
  f = initialize,
  signature = LinRegP,
  definition = function(.Object, model, intercept, outcome, covariates,
                        modelCoef, fittedVals, residuals, summaryStats,
                        leverage) {
    # cat(--- LinRegP: initialiser --- \n)
    colnames(modelCoef) <- c(betaHat, stdErr, t, Pr(>|t|))
    ifelse(intercept,
           rownames(modelCoef) <- c(Intercept, names(covariates)),
           rownames(modelCoef) <- names(covariates))
    #
    # alternatively select slots with @
    #
    slot(.Object, model) <- model
    slot(.Object, intercept) <- intercept
    slot(.Object, outcome) <- outcome
    slot(Object, covariates) <- covariates
    slot(.Object, modelCoef) <- modelCoef
    slot(.Object, fittedVals) <- fittedVals
    slot(.Object, residuals) <- residuals
    slot(.Object, summaryStats) <- summaryStats
    slot(.Object, leverage) <- leverage
    return(.Object)
  }
)

We use methods::setMethod() to override the initialize generic, tell the function that it is associated to objects of class LinRegP, and then define the function. Remember that we use @ to access slots.

The final, and critically important part here is the constructor. I use this to essentially do all my calculations (i.e. fit the model), and then create a new object of class LinRegP and populate the slots. As I am using the Normal equation, I need a means to invert a matrix, so remember the Roxygen header to import MASS, so that I can access MASS::ginv(). My constructor function looks like this:

#
# Constructor function. It requires MASS, so 
# include the Roxygen header
#
lmp <- function(outcome, covariates, int = TRUE, df) {
  y <- df[, names(df) == outcome]
  x <- df[, names(df) %in% covariates, drop = FALSE]
  if (int) x <- cbind(1, x)
  nVar <- ncol(x)
  X <- data.matrix(x)
  #
  # normal or 'betahat' equation
  #
  XT <- t(X)
  XTX <- XT %*% X
  XTXi <- MASS::ginv(XTX)
  tmp <- XTXi %*% XT
  betaHat <- tmp %*% y
  #
  # fitted values
  #
  yhat <- rowSums(simplify2array(Map(function(w, z) w * z,
                                     as.list(x), as.list(betaHat))))
  #
  # statistics and values for diagnosing quality of model
  #
  resids <-  y - yhat
  s2 <- sum((resids) ^ 2) / (length(y) - nVar )
  stdErr <- sqrt(diag(s2 * XTXi))
  tstats <- betaHat / stdErr
  pvals <- pt(abs(tstats), nrow(X) - nVar, lower.tail = FALSE) * 2
  rsq <- sum((yhat - mean(y)) ^ 2) / sum((y - mean(y)) ^ 2)
  rsqMod <- 1 - (1 - rsq) * ((nrow(X) - 1) / (nrow(X) - ncol(X)))
  rmse <- sqrt(mean((resids) ^ 2))
  #
  # 'Hat' matrix for calculating leverage
  #
  H <- X %*% tmp
  lever <- diag(H)
  #
  # return a new instance of class LinRegP
  # filling slots with the calculated values
  #
  return(new(Class = LinRegP,
             model = (paste0(paste0(outcome, ~),
                             paste(covariates, collapse = +))),
             intercept = int,
             outcome = y,
             covariates = df[, names(df) %in% covariates, drop = FALSE],
             modelCoef = cbind(betaHat, stdErr, tstats, pvals),
             fittedVals = yhat,
             residuals = resids,
             summaryStats = list(Rsquared = rsq,
                                 ModRsquared = rsqMod,
                                 RMSE = rmse),
             leverage = lever))
}

I wont go into the ins and outs of calculating the regression coefficients for the sake of brevety- I suggest reading any standard text, and hopefully then nothing I have done will seem extraordinary. The approach I take with my constructor function approach is slightly different to the S3 stats::lm() approach. For starters, it doesnt take an object of class 'formula' as input. For my function, the input is a data frame containing the predictive factors and outcome, and the formals include outcome and covariates, which are character vectors containing the column names for the outcome (y) and predictive factors (x). I include a logical flag, int, to specify if an intercept term is desired.

The key message to take home is that a constructer serves to calculate slot values, populate the slots in a new instance of the class, and returns that object. You don't necessarily need them, although they do make your life a lot easier.

Thats all I will include for now in terms of class initialisation- but if you are being thorough there are ways to do error checking to ensure not any old rubbish can get initialised. See here for some more details.

generics

Next, its time to consider some generic methods. I'm going to define a show(), a summary(), a plot(), and a predict() generic in generics.R.

The first step is to include the Roxygen header:

#' @include class-definition.R
NULL

at the top of the file- this is so that the LinRegP class is defined for these methods when the package is built.

A similar pattern emerges for all of these: we use methods::setMethod(), specifiy the generic we wish to override and the signiture of the class the method corresponds to, and the define the function that the method should apply.

I define show() and summary() so that they (mostly) replicate the behavior of the generics in the S3 stats::lm() function. My plot() generic differs slightly from plot.lm(); as I was just doing this out of interest, I did not dwell to much on replicating the plots and behaviour exactly. The predict() generic behaves identically to predict.lm(), so you can use it to make your predictions as you would normally. As an example, here is the predict generic:

#
# Set the predict generic method for class LinRegPG
#
setMethod(
  f = predict,
  signature = LinRegP,
  definition = function(object, newdata,  ...) {
    if(object@intercept) newdata <- cbind(1, newdata)
    rowSums(simplify2array(Map(function(x, y) x * y,
                               as.list(newdata),
                               as.list(object@modelCoef[, 1]))))
  }
)

The generic works in a very straightforward manner: it checks if an intercept is to be included, and then simply multiplies the predictors by the calculated coefficients, and sums the result to return the fitted values.

accessors

Now, S4 classes behave differently to S3 in many ways, and one noticeable one is that you cannot use $ to work your way through the slots in the class. It is bad practice to use @ to access slots directly, so it is usual to define accessors.

As this example is rather contrived (I would never use this over stats::lm()- this was, like many things I do, an intellectual excercise), I have included an accessor to get the fit coefficients from an object of class LinRegP (a getter).

The code is in accessors.R, and as before, we start with:

#' @include class-definition.R
NULL

The difference to the generic methods in generics.R is that I need to define new generics, and then describe their behaviour, rather than overwriting an existing generic. This code defines a new generic and the method:

#
# set the generic
#
setGeneric(getCoef,
           function(object){
             standardGeneric(getCoef)
           })
#
# define the method
#
setMethod(getCoef,
          signature = LinRegP,
          function(object){
            out <- object@modelCoef[, 1]
            return(out)
          })

Fairly straightforward stuff!

Example

Finally a test. I get identical coefficients, t-statistics, p-values etc. when running:

#
# test on mtcars data set
#
data(mtcars)
#
# fit a linear model with stats::lm()
#
model1 <- lm(mpg ~ wt + qsec + hp, data = mtcars)
summary(model1)
model1$coefficients
#
# load lmPG package
#
library(lmPG)
#
# fit a linear model with lmp()
#
model2 <- lmp(outcome = mpg,
              covariates = c(wt, qsec, hp),
              int = TRUE,
              df = mtcars)
summary(model2)
getCoef(model2)

I will consider that a success! I will leave you to enjoy plotting and predicting.

Other thoughts

Well there we have it. An illustrative example of both how to peform linear regression, and to use the S4 system. I hope you found this useful! I wrote this initially to make sure I could actually implement linear regression, and I found it a really useful excercise. I really recommend checking out Andrew Ng's Machine Learning course if you are interested in this kind of thing!

One comment- S4 objects don't seem to like living in the Global Environment. Unless they are defined in a package, I seem to get really annoying error messages when using RStudio's autocomplete when tabbing in the interactive console. Defining my classes in packages entirely obviates that issue for me- just in case you ever experience the same problem and want an option for a fix.

TL;DR- A home-brewed S4 class for linear regression. The code is here