10 OCT 2015

The Magic of Rcpp

Since having read Hadley Wickham's Advanced R and R Packages, I had been meaning to start playing around with integrating some C++ into R code as a useful package. Attending a seminar at EARL 2015 with Dirk Eddelbuettal (the author of the Rcpp package) impressed me enough to get started, both to get practice with package development and to brush up on my (exteremely rusty) C++ programming.

Overview

The C++ functions I am going to create are a couple of exercises from Advanced R- I'm not going to blow anyones imagination with some crazy, out-there functions. They are all implementations of base R functions (which are already written in C, so really fast already). This was a warm up excercise for me, both for package development and C++ integration with R.

In this post, I have used different syntax highlighting for C++ code, to try and make it obvious what is what. As usual, the code is on my github, should you want to pick it apart.

Setup

I fully, highly, and strongly suggest that you use RStudio for developing your R packages. It's such a nice IDE, and with the devtools package and git integration, it makes package development about as easy as it could possibly be.

To get started, hit Project (top right corner), New Project, New Directory, R Package, and select package w/RCpp. Select the directory for your project, and hit Create Project. Optionally you could have ticked the create git repository box, but I prefer to do this directly from the command line. I believe the option in RStudio will create a .gitignore file for you, remember to make your own if you do it from the command line (if you are doing it from the terminal I will assume you know what a gitignore file is, so wont elaborate).

Right, step one complete! We now have an empty project. You will notice that we have an extra /src directory (compared to a normal R package)- this is where C++ code lives.

Now I always start at the DESCRIPTION file, which contains the package metadata. I suggest you consult Hadley's R Packages book for an in-depth discussion of how to do this- they are fairly straightforward.

Now I never even touch NAMESPACE, as the Roxygen headers (once again, see Hadley's book if you dont know what these are) that I use as I go along magically take care of all of this. Make sure you select the option to generate documentation with Roxygen headers. As I'm only going to include C++ functions, I wont be putting anything in /R - so I can delete hello.R. Note that RcppExports.R will be automatically generated as we continue, so don't delete the directory though.

As an aside, you can also follow Hadley's advice and use the clang++ compiler. The GNU g++ compiler is absolutely fine though- clang++ just gives slightly more informative error messages.

Right, lets get started with some C++. In the /src directory lives C++ code. There is a file RcppExports.cpp that will be automatically generated. I've made a file called firstFunctions.cpp to put my stuff in.

The first thing that is needed is a header at the top, and to declare the namespace:

//
// include header file and namespace
//
#include <Rcpp.h>
using namespace Rcpp;

Easy so far! Now, in C++ comments look like this: // (not this: #). So the Roxygen headers that look like this in R #' now look like this //'

The Roxygen headers follow identical syntax to normal R functions (e.g. @description, @param etc. etc.) Just remember immediately before your functions to include the line:

// [[Rcpp::export]]

Now, this is where people like me who learned to code in a compiled language get to feel smug. All you people used to dynamic languages are in for a shock: you have to declare types implicitly now. C++ is a static language, so needs compiling before runtime (not at runtime, like R). This applies to input and output types, as well as any variable used in your function. I would refer to Hadley's chapter in Advanced R for a run through. Explicit type declaration may seem like a huge pain, but the speed up you get from compiled code is definitely worth it.

I wrote a bunch of functions, but lets look at varRcpp() as an example. Before we get stuck in, just a quick aside- I've avoided standard libraries where possible, or iterators here to keep things as clear as possible. At the expense of efficiency comes transparency.

A basic strategy to calculate the variance requires two passes of the input vector; firstly to calculate the mean, and then a second pass to calculate the variance. Heres my code:

//' varRcpp
//' @description c++ version of base var(). Uses a 'two-pass' approach
//' @param x a numeric vector
// [[Rcpp::export]]
double varRcpp(NumericVector x) {
  int n = x.size();
  double tmp = 0.0;
  double mean;
  //
  // do with two passes:
  // first calculate the mean
  //
  for(int i = 0; i < n; i++) {
    tmp += x[i];
  }
  mean = tmp / n;
  tmp = 0.0;
  //
  // now calculate the variance
  //
  for(int i = 0; i < n; i++){
    tmp += pow((x[i] - mean), 2);
  }
  return (tmp / (n - 1));
}

First, we declare the output to be of type double. We then declare the function name and it's formals: in this case a NumericVector called x (as a warning, I think NumericVector isnt a standard C++ type- I'm pretty sure it is syntactic sugar for Rcpp). We then declare variables, all with their type! We declare n, the number of elements in x to be an integer, and assign it by using the size() method. We also declare a tempory double, and a double to hold the mean.

The syntax for looping is different from R- you just have to get used to it. There are many online references for C++ programming, all written by people much more informed than me, so I wont try and lecture you at this time. Just one key thing to remember- C++ arrays are 0 based! I am fully aware that this code isnt vectorised (i.e. the looping could be done in parallel- standard libraries would allow for this). If I were trying to deploy a super efficient implementation, I would certainly rewrite it so that it is.

Now my function does not accept NA's in the input. I am in two minds about creating high peformance functions that accept NA's- the whole point is that they are supposed to be high peformance. If they are checking constantly for NA's, that will rack up a lot of extra operations (especially if the input is large), and thus peformance will suffer. I am more inclined to write high peformance functions that make assumptions about the input- that is, that there are no NA's. Have a look at the rangeRcpp(), minRcpp() and maxRcpp() functions I wrote that can deal with NA's- they are convoluted, and however you treat the NA's, it costs you in peformance (and needlessly so, if you know the input has no NA's).

I think a better strategy is to write seperate functions that check for NA's, and process accordingly. Then, if you are worried that NA's exist, run the input through these, and then and only then plug the input through the function. That way, the focus of the high peformance functions can be on the task at hand, and not dealing with NA's. However, this is just my opinion, so don't quote it as gospal.

Package compiliation

Well, once you have the code ready, its time to build your package. On my linux machine, hitting Ctrl+Shift+B and devtools will magically build the package, compile the documentation from your Roxygen headers, and load it into the current namespace (i.e. you can use your functions!). Hit Ctrl+Shift+E to run R CMD check- this will run various checks, and give you (sometimes) helpful error messages. Its quite anal about documentation, so you may have to tidy that up. But once that passes, there you go! An R package with some high peformance C++ code. Pretty straightforward, right?

Tests

This is a key part of software development which I have been so lax with in the past. During my PhD, I was never introduced to the concept, and when I stumbled across bugs I would manually and painstakingly have to try and work out what I did. Tests are great- modeify a function, run your tests, if they pass carry on, if not, work out what you broke!

devtools::testthat() is here to the rescue for testing. They are so easy to write. The basic idea is: put the answer to a known problem that your function should produce, and just check if your function produces that answer. In fact, in software development this is a common strategy to building code. It's much easier to design code to pass a series of tests, and then you have hit the key criteria of what you are aiming for, and more importantly, to define that your code is good enough.

To add tests, run:

# 
# run this in the R console
# while working on your project
#
devtools::use_testthat()

This creates the tests directory and files for you to add your tests to. The below code has a series of tests for the maxRcpp() and minRcpp function, verifying they produce the same output as the base R function. To run, hit Ctrl+Shit+T. I have been really lazy in not building tests for all my functions, but hey, this is just practice (I will still be using base R functions over mine!). The below code is the contents of /tests/testthat/test.R

library(cppStarter)
context(Simple tests)
#
# Some basic tests for maxRcpp
#
test_that(maxRcpp provides same results as base R max, {
  expect_equal(maxRcpp(c(1, 4, 6, 2, 5)),
               max(c(1, 4, 6, 2, 5)))
  expect_equal(maxRcpp(c(1, 4, 6, 2, 5), na_rm = TRUE),
               max(c(1, 4, 6, 2, 5), na.rm = TRUE))
  expect_equal(maxRcpp(c(1, 4, 6, 2, 5, NA)),
               max(c(1, 4, 6, 2, 5, NA)))
  expect_equal(maxRcpp(c(1, 4, 6, 2, 5, NA), na_rm = TRUE),
               max(c(1, 4, 6, 2, 5, NA), na.rm = TRUE))
  expect_equal(maxRcpp(c(NA, 1, 4, 6, 2, 5, NA)),
               max(c(NA, 1, 4, 6, 2, 5, NA)))
  expect_equal(maxRcpp(c(NA, 1, 4, 6, 2, 5, NA), na_rm = TRUE),
               max(c(NA, 1, 4, 6, 2, 5, NA), na.rm = TRUE))
  expect_equal(maxRcpp(c(NA)),
               max(c(NA)))
  expect_equal(maxRcpp(c(NA), na_rm = TRUE),
               max(c(NA), na.rm = TRUE))
})
#
# Some basic tests for minRcpp
#
test_that(minRcpp provides same results as base R min, {
  expect_equal(minRcpp(c(1, 4, 6, 2, 5)),
               min(c(1, 4, 6, 2, 5)))
  expect_equal(minRcpp(c(1, 4, 6, 2, 5), na_rm = TRUE),
               min(c(1, 4, 6, 2, 5), na.rm = TRUE))
  expect_equal(minRcpp(c(1, 4, 6, 2, 5, NA)),
               min(c(1, 4, 6, 2, 5, NA)))
  expect_equal(minRcpp(c(1, 4, 6, 2, 5, NA), na_rm = TRUE),
               min(c(1, 4, 6, 2, 5, NA), na.rm = TRUE))
  expect_equal(minRcpp(c(NA, 1, 4, 6, 2, 5, NA)),
               min(c(NA, 1, 4, 6, 2, 5, NA)))
  expect_equal(minRcpp(c(NA, 1, 4, 6, 2, 5, NA), na_rm = TRUE),
               min(c(NA, 1, 4, 6, 2, 5, NA), na.rm = TRUE))
  expect_equal(minRcpp(c(NA)), min(c(NA)))
  expect_equal(minRcpp(c(NA), na_rm = TRUE),
               min(c(NA), na.rm = TRUE))
})

Easy, right? Now every time you muck around with your code, you have tests to make sure that you didnt break anything!

Other Thoughts

Overall, these are simple, contrived examples of no real use, other than some practice (that in itself has some real use- to me at least!). I would highly recommend refering to Hadley's books to accompany this post, just in case I have glossed over some crucial part of the process.

The usage of C++ here is really simplistic as well. I plan to investigate useful features of the language, such as the Standard Template Library, and get to grip with features such as iterators. Dirk also seems to highly recommend Armidillo- a C++ linear algebra library. Another key feature is object oriented programming- C++ is widely known for this feature.

I have a copy of Dirk's book, which is yet another book on my ever-growing to-read list. I have a feeling it will be well worth it once I get around to it!

TL;DR- an R package with some basic C++ functions. Have a look here if you are interested.