1 Dec 2015

Coursera Data Manipulation at Scale: Systems and Algorithms

This course, taught by Bill Howe from the University of Washington had caught my eye quite some time ago, but due to my seemingly never ending MOOC to-do list, it took me a while to get around to doing it.

All I can say is that I had no regrets at all! Bill introduces a HUGE area with a decent breadth, and what I thought was a suitable level of detail (as realistically, it might take more than a MOOC for us to become experts using Hadoop). The course makes a point to focus in on concepts, rather than specific technologies (a point that is hit home by how Spark has massively overtaken Pig for distributed data analytics over the past few years).

Before you get stuck in, I suggest a bit of Python practice if you are rusty or if R is your primary tool. Charles Severence has a great series of introductory courses that I completed last year, and found it to be well taught and quite informative.

Week 1

The first week serves more as appetite whetting and course admin than anything else- I found the lectures quick and easy to digest. The real fun is in the Python assignment- you get unleashed peforming sentiment analysis on the live Twitter stream. To me, this seemed a little daunting at first as it was a little out of my comfort zone, but I found Python really pleasant to program in, and the assignment served as a demonstration of how simple concepts can be applied (almost suprisingly) easily to 'real' data. In fact, its almost scary how easy it is to do this... but thats what you get for Tweeting your hopes, dreams and fears for the world to see!

Week 2

This week could have been dreadfully boring... but is saved by the fact that Bill focuses on the concepts or relational algebra rather than plugging SQL queries into a database. I have never been in a position where my role has been to plug in SQL queries day in day out, so I wouldnt claim to be 'fluent' in database queries, but having a lot of experience with R it is clear to me what I am trying to achieve. The emphaisis on relational algebra therefore made this week quite interesting, and put in context many things that I think get taken for granted.

The assignment was a little more challenging this week due to my bumbling SQL- but really interesting, and well worth the effort to complete. The application for sparse matrix multiplication in particular got me thinking about relational algebra in ways I had never considered it before.

Week 3

This was MapReduce week! Bill explains in great detail the process of mapping, shuffling and reducing. The need for these techniques are well motivated, and several implementations are discussed (MapReduce, Hadoop, Pig, Hive and Spark).

The programming assigment aims to make you 'think' MapReduce- it is actually a Python assignment, with no parallel computing involved (perhaps to shatter the beleif that MapReduce is a magical process that happens on the Cloud- it is in fact a programming pattern that happens to scale up very nicely, and it is just as valid on a single machine, if not particularly efficient). Really interesting, and it helped to reinforce the concepts taught.

Week 4

This was the only part of the course that I felt dragged on a bit. Bill discusses in great length many NoSQL systems, but I found that after a while, without any practical programming examples to play with, it got a little dull. I suppose the key things to take home are no schemas, no transactions, eventual consistency and CAP theorum.

The course did pick up a bit when it got to the subject of graphs- I have been looking for a while for a decent introduction to the basic concepts, and this was it. The short discussion on the Prism system was fascinating, and somewhat astonishing to me as to how easy it can be to implement mass survelleince.

Final Thoughts

Overall, I really enjoyed this course. It gives an introductory insight to 'Big' data, rather than purely data science. The only part I found quite disappointing was the optional assigment using AWS and Pig to process some massive datasets. The instructions were really outdated, so it took quite a while to get everything up and running. I think more of the Big Data would have been really useful- even if it is optional due to the cost of AWS- any experience processing large data sets on the Cloud will be of huge benefit to anyone looking to gain experience or seek employment in those areas.

I look forward to completing the remaining courses in the specialisation in 2016- they are somewhere near the top of my never ending to-do list...

TL;DR- an excellent introduction to large scale data analytics. You need to be able to speak Python.