14 Sep 2015

Max Kuhn and Kjell Johnson's Applied Predictive Modelling

There are very few (non-fiction) books I read cover to cover and religiously work through examples and exercises- but this is one I found to well be worth my time. I tracked this book down after completing the Johns Hopkins Practical Machine Learning Course on Coursera- seeing as one of the authors is the creator of the caret package I had been introduced to in the course, it seemed well worth investigating.

The caret package itself provides a much-needed universal interface to the many predictive statistical models available in various R packages (too many to count!), allowing models to be trained across various tuning parameters using a simple interface. Resampling and model evaluation is made straightforward as well.

Overview

The book is split into several key sections. It starts with exploratory data analysis, data preprocessing and resampling before diving into a mammoth section on regression models.

The authors start by discussin various metrics for model evaluation, before diving into linear models, non-linear models, and finally tree and rule based models. Despite the fact the emphasis of the book is on Applied predictive modelling, the authors give fairly comprehensive and accessible descriptions of how the models work, without diverging into a hardcore statistical background.

I found particularly useful the worked examples- building machine learning algorithms requires a knowledge of the theory, experience, and some degree of art and intuition. Being able to use Max and Kjell's experience and well presented worked examples as a stepping stone is something I found invaluable.

The book also includes some fascinating case studies. The one I enjoyed the most was the designed experiment to predict which mixes of concrete would yield the strongest mix. The technique used was part model building, part optimisation problem- in fact I found the example so interesting I may write a seperate blog post about it.

After the section on regression models, the book moves on to (no suprises here) classification models. The presentation is consistent with the section on regression models- firstly a discussion of metrics used to evaluate models, before moving on to discuss linear, nonlinear and tree/rule based techniques. A running example from an old Kaggle competition is used as a case study, and what I found quite usedful with this is that they include their script for data preprocessing (warning- there is a tiny error in this script when processing dates- possibly due to me running a newer version of lubridate). This script really hits home that data preprocessing can be a more time consuming part of the model building cycle than the actual model fitting! Further, it demonstrates the point that experience and intuition is needed in some places- in this case, to decide how to set up the resampling and create the train/test spilt.

After working through regression and classification modelling, the authors come to discuss a few miscellenious (but vitally important) bits and peices; variable importance, how to tackle class, a breif introduction to unsupervised feature selection, and a discussion of factors which may effect model peformance. These sections are the icing on the cake for his book, as it gives some insight to those nagging questions which pop up when modelling, when you get stuck wondering what is the most sensible way to tackle unusual problems.

Overall, this book is fantastic. It took me a while to work through (an hour or so on the train every day for several months), but well worth the effort, as it gives a huge amount of useful insight delivered in a very accessible way.

Other thoughts

The only critisms I have for this book are that there more topics I would have loved to be included! Namely, time series models, feature engineering and model ensembles are areas that I think Max and Kjell could really deliver some useful insights. So I will cross my fingers and hope for a second edition!

This book has definitely fuelled my interest in machine learning. It gave me the confidence to enter my first Kaggle competition, and has spurred me to increase my knowledge of the subject- Andrew Ng's famous Machine Learning course on Coursera is definitely high on my to-do list for the new year.

TL;DR- if you are interested in using R for predictive modelling, you will not regret tracking down this book.