24 Mar 2017

pycaret- a python framework for classification and regression training

Coming from an R background, Max Kuhn's caret package and book, Applied Predictive Modelling opened up supervised machine learning to me through it's simple, intuitive and flexible API.

When started using Python, I was initially impressed by the Scikit-Learn package; unlike R, I suddenly had access to a well writted machine learning library, containing a wide range of well writted and maintained algorithms, accompanied with various functions and classes for preprocessing, cross validation and model selection.

However, I have to confess that I struggled to develop a work flow that was as smooth and intuitive as the one I had when using R. Firstly, I found there to be too many functions tucked away in different parts of the namespace that required an encyclopedic knowledge of the API to access. And even with the nifty Pipeline objects available, with limited documentation and examples available, I never found the time to experiment with them for long enough to easily get the same kind of results out that I could with the caret::train() function.

The above isn't intended to be a criticsm of scikit-learn: I find it to be an incredibly powerful, well written package. However, it just isn't as quick and easy to pick up and experiment the usage of a handful of models as caret. My typical flow with caret is to make use of it's helper functions to preprocess, train and evaluate a bunch of models, and then home in on the most promising candidates (and if necessary at this stage, dig into the underlying package to fine-tune).

So with this in mind, and a long commute to pass time, I started a little project of my own, with the aim to replicate the parts of caret I found most useful in a Python framework. At worst, it would enable be to gain a practical understanding of how to build such a framework and practice my Python, and at best I have hopefully produced something that other people will find useful.

Have a look at the project demo and repository and the latest release (currently 0.0.1.dev1), and see what you think!

Basic ideas

For those of you unfamiliar with caret, it is a library of wrapper functions providing a consistent interface to the many, many R libraries containing machine learning functions. Aside from the incredibally useful helper functions availble for preprocessing and evaluating algorithms, the real stroke of genius lies in the trainControl() and train() functions. The former is used to set up, amongst other things, the cross validation scheme and model selection criteria, and the latter.

Scikit-learn contains most of the well known models in a single package- but there are exceptions. For example, there is no xgboost model, MARS, or deep learning (only in the most recent release were neural networks added). Therefore, there is some value in bringing models from multiple packages together into a single framework.

An example

I was thinking about the best way to showcase the functionaliy of pycaret, and I figured the best way was a little demonstration. I have reworked an analysis using the Churn dataset from the R package C5.0. This dataset has been the focus of a couple of older posts of mine, such as this and this.

It seems that Github can magically host Jupyter Notebooks: despite my reservations of using these for my day to day work (give me an IDE please!), they are a highly useful for showcasing and teaching. Have a look here for a demo.

The demo assumes partially that you have some experience with caret, so a lot will look familiar. If you have never used caret before, hopefully you can appreciate the convenience that has been added with the framework.

In future I may do a piece drilling into the internals of the package, but for the time being any interested people can peruse the source code. If you like what you see, or spot an area to improve, why not consider contributing to the project?

looking forward

As you may guess, being the initial developmemt release, I need a lot of feedback to priotitise additional features and identify and fix any bugs that may be present. I also have my own backlog of functionality to add: for instance preproccessing is currently not implemented (think center, scale, PCA etc), and the hyperparameter selection is an obvious candidate for parallel processing. Additionally, for models that support it, multi-class classification is another obvious next step.

I think the project is at a stage where it is usable, and the amount of effort I put into it will be dictated by how useful people find it. If there is an uptake of users providing feedback and suggesting improvements I will be more than happy to dedicate time to maintaining the project, heading towards the milestone of version 0.0.1 being released on PyPi. So please, clone, fork, report issues, suggest functionality, and if you think this is a good idea and want to get involved, do so!

TL;DR- I wrote my first python package, pycaret. It's a machine learning framework inspired by the R caret package. Please try it out! I've made a demo here and the repository is here.