Search this blog

Friday, 19 February 2016

Unit testing RapidMiner processes using R

I've used R a lot over the last few years. Partly in my day to day activity, but also as part of my efforts as co-editor with Dr. Markus Hofmann for our recent new book "Text Mining and Visualization: Case Studies Using Open-Source Tools", available at all good bookstores ;)

I have also made some R packages and these even contain unit tests implemented with the splendid testthat package. Any minute now, I might even become an R expert.

Unit tests are vital to the health and sanity of the creator of any solution. Once something gets big and is subject to a lot of change, it is extremely important to know whether changes haven't accidentally broken something. RapidMiner processes are no different and it struck me that it would be worth implementing some basic unit testing for these. It could be done using RapidMiner itself but, for fun, I thought I would use R.

Here is a small example process that illustrates this. It starts by estimating the performance of a classifier on the Iris data set. The estimated performance is converted to an example set using the "Performance to Data" operator and is then passed to the "Execute R" operator.

This is the performance vector.


The R script confirms that all the parts of the performance vector are as expected. It uses a base R function called "all.equal()" to do this (I decided not to use testthat to avoid having to install the library so others can get going more quickly).

Here is the R script.


As you can see, the script checks that all the parts of the example set are as expected. For example, this line


confirms that the values for accuracy and kappa are 0.94 and 0.91 respectively with a tolerance of 0.01. The R script then outputs a data frame with the results. When all is well the result looks like this.


By changing the parameters of the earlier operators, the results will change. This can be picked up by examination of the result. Here is an example when the number of cross validations is set to 3.


In other words, the performance has changed and this gets detected automatically. Obviously, this is a simple example but you can see how it could be extended.

As I mentioned above, you could do this in pure RapidMiner but it would need quite a number of operators to realise it. The R integration in RapidMiner is relatively easy to use and so it deserves the time of day when considering how to tackle problems.

R also has a tremendous range of packages and a future post will touch cautiously on how to access a database directly.

No comments:

Post a Comment