30 Sep 2015

Coursera Johns Hopkins Data Science Specialisation

This was the specialisation that started my interest in online courses. I was just wrapping up my PhD, and had a bit of a panic as I specialised in coding in Fortran- not too handy for most jobs out there. This series was recommended to me as a great way to learn R for data science, so I began it enthusiastically in January 2015.

Overall, it was brilliant. Roger Peng, Jeff Leek and Brian Caffo have all been promoted to rank of hero. The course is well taught, accessable, and as well as teaching me a lot it has nudged me in the right direction for to expand my knowledge in the areas it introduces. So here we go- a quick run down of all the courses!

The Data Scientist’s Toolbox

A little bit of appetite whetting and introducing the course. If you have never used git, might be worth doing. If you don't know how to install R or RStudio, it will show you how to do so. For a 4 week course, it took me two evenings to complete. If you plan on working through the specialisation, I suggest taking this alongside R Programming as well.

R Programming

This course was exceptional. Ok, to be fair coming from my background R should be fairly easy to pick up, but this course introduces the language, and rapidly moves through the basics including subsetting, functions, functionals, debugging and R packages. The swirl interactive programming assignments are a really nice touch.

One thing I particularly liked about the specialisation is the grading- a mix of quizzes, peer reviewed assignments, and some using an autograder. The peer reviewed assignments are particularly helpful (as long as someone helpful is marking them)- you get a good look at other ways to tackle the same problem, and learn a lot from that.

A few months after finishing this course I got hold of Hadley Wickham's Advanced R book- this was a great next step to continue developing what I learnt from this course.

Getting and Cleaning Data

The course does exactly what it says on the tin. I think one of the most useful parts for me was the introduction to Hadley Wickham's packages such as lubridate, dplyr, tidyR and reshape2. If I were to put on my critical hat I would lean towards the camp of thought that Python is much more flexible than R for web scraping, but none the less the course demonstrates a number of ways to 'get' data from the web and databases.

The 'cleaning' part was a great way to get stuck in with a real dataset- this was my first attempt with the dplyr package in the peer reviewd assignment. The hands on assigment was really helpful, and I think I learned a lot from this course.

The one part I keep meaning to revisit is the breif introduction to the data.table package and get to grips with it- from the odd conference I have attended people have suggested it is superior to dplyr for data munging. To be fair I have yet to encounter a situation that dplyr cannot tackle, but having an extra trick up my sleeve cant hurt.

Exploratory Data Analysis

This course felt like treason to my academic days- until I completed it I was a die-hard gnuplotter. R's three main graphics packages are well introduced- base R graphics, lattice and ggplot2. The course goes on to discuss singular value decomposition/ principle component analysis, as well as clustering. Both PCA and clustering are revisited again in the Practical Machine Learning course, but none the less the basic concepts and applications are well presented.

Statistical Inference

This (and the Regression Models course) is the one that seems to catch people off guard. To be honest, with a bit of dedication and access to stack overflow you can pretty much stumble through the first four parts of the specialisation. For this course, you need to put your maths hat on.

I LOVE the way Brian Caffo teaches. Opinion on the course forum seems to be mixed, but part of me thinks those with a negative view are only interested in statistics as far as blindly plugging data into statistical software, and dont really care to understand.

Thats not to say I found this course easy. My statistics background is peiced together from quantum mechanics and statistical physics- hypothesis testing and t-tests was completely new to me. I learned a lot in this course, and I think it was due to the fact that I did find it challenging that I later decided to do Brian's Mathematical Biostatistical Boot Camps 1 and 2 (I REALLY recommend these courses).

Chances are if you didnt complete this course, you wont find regression models to be too much fun (unless you already knew it all. And in that case, I wonder why you took the course in the first place...)

Regression Models

Brian strikes again with another fantastic course. Obviously, the main focus is on linear regression, starting at univariate, then on to multivariate. Brian discusses ANOVA, ANCOVA, as well as dissecting fitted models. The course also briefly looks at logistic regression, and GLM's.

I learned a hell of a lot from this course. I enojoyed the way it was presented- although I think it might have been better to have linear algebra as a prerequisite as the formalism is a little neater that way (I didnt know this before- I looked this up after having completed the course).

I really liked as well that Brian emphasised that the models are being built for interpretability rather than predictive power- he references ahead to the Machine Learning course for those people who are purely interested in predicting (and perhaps, black box models that they dont have to worry about understanding).

Practical Machine Learning

In this course, the difficultly level takes a step back. If you completed Regression Models and Statistical Inference, you will breeze through the rest of the specialisation. Jeff Leek (correctly) asserts that this course is the most fun in the series- but thats only because the hard work you have put in starts to pay off to gain tangible results.

Jeff emphasises use of Max Kuhn's (fantastic) caret package, and works through building classification and regression models using various algorithms. The emphasis is on supervised learning (as opposed to unsupervised learning and forecasting- although these areas do get touched upon). The course focuses on application rather than construction of models.

Since completing this course, I dug up Max Kuhn's and Kjell Johnson's Applied Predictive Modelling book- this leads on very well from this course, and I can guarentee you will be a boss at using the caret package if you work through that book. I'm going to blog about it soon, so watch this space for my review!

I will emphasise that this course (and APM book) are focused on PRACTICAL machine learning- there is some discussion in both about how the algorithms work, but it isnt the main focus. I plan to complete Andrew Ng's famous Machine Learning course in the new year to get hands on experience at writing some algorithms.

Reproducable Research

This course should be compulsory for anyone beginning a Master's degree or PhD in and science. Reproduceable research is so key in any analysis, and I learned so many good practices from this course. Roger Peng delivers the course really well- at no point was this boring or like a tedious list of practices you should adhere to.

It was fairly easy compared to some of the others in the specialisation, but the lessons it teaches are important none the less. The peer reviewed assignments give you some great practice with R markdown, and some hands on experience with a simple data analysis to allow you to flex your newly developed R skills

Developing Data Products

I found this course to be the perfect wrap up to the specialisation- along the lines of know you know how to do stuff, how should you deliver it?

The course starts off with Shiny- RStudio's neat little instant 'R-to-web-app' tool. The course gives a great introduction, and you build your very own app as a peer reviewed assignment.

My opinion of Shiny has always been mixed- I think it's fine for usage to demonstrate the results of an investigation (as the JH team advocate), but I have met several people who use it for products. I personally am not convinced it is robust enough for usage as a product... best to use for a prototype, then get a proper web developer to built you a 'real' app.

The course goes on to introduce Slidify, which I will definitely be using to replace Microsoft Powerpoint. To put in context, I suppose it is essentially to Powerpoint what R Markdown is to Word.

The final part of the course goes on to discuss R Packages and object-oriented programming in R. Starting with packages, it delivers a great introduction and a worked example to build your own R package. I found the course was enough to get my teeth stuck in, then I moved on to Hadley Wickham's R Packages book for a more comprehensive guide.

Classes and objects in R is... weird. To those of you who can program in C++ and Python, you probably wont like it. It's easy, but you won't like it. The course emphasises use of the S4 system (the RC system in R is more like conventional OO programming), and once again I found the example was enough to get me started, and then I turned to other sources for a more in depth guide (if slightly outdated).

Other Thoughts

Overall, I thoroughly enjoyed this course. In the short space of nine months or so, I have transitioned from being clueless to being fairly competant (if I may say so myself). I have a full time job as a data scientist, dabble in Kaggle competitions, and can definitely hold my own in terms of knowledge at the various R conferences I have attended.

The team at JH University have done a fantastic job, and I would recommend this course to anyone who wanted to improve their quantatitive data analysis skills, regardless of whether they were in academia or an aspiring data scientist. It is just a fantastic specialisation; I have learnt so much, and have no regrets at all about the number of hours I dedicated to completing it!

TL;DR- if you are interested in learning R and/or introductions to statistics and machine learning, I cannot recommend this series of courses strongly enough.