Data Science With RapidMiner: February 2016

Sunday, 21 February 2016

Reading from a SQL Server database

Reading from a SQL Server database is easy using R.

Here's a process that shows how to do this from within a RapidMiner process.

Of course, you can use the built in "Read Database" operator to read from a database, but there are restrictions in the community version. By using R you can get partially round the restrictions but you should always be aware of your license agreement. Just because you can get round the license does not mean that the terms no longer apply. If you do something that would normally trigger the purchase of an additional license then you still need to. I'm not a lawyer thankfully but, you have been warned.

Having said that, there are situations where you have to try things to prove viability and get political buy-in before committing to a more serious plan where money is to be spent. Political buy-in, as everyone knows, can sometimes takes a very long time and even the most trivial objection can completely de-rail progress. Removing the ability to make a full prototype is just such a potential trivial objection.

Having said all of that, the method the process uses here will have some subtle differences in the way it interacts with the database when compared to the "Read Database" operator. This means it might not work for some reason as yet unknown. Simple advice, don't rely on it.

Enough words, on with the process.

The process has two parts, the first sets some macros that are used within the second. It's a little known fact that you can use macros in this way but it's extremely powerful and allows the code to work in lots of places. The macros themselves are shown in the following table.

Change these to match what you have in your environment. Note that I am using SQL Server authentication so this means you have to set up your environment like this. I am led to believe that built-in authentication is possible but I have not tried it.

The R code itself is shown here.

Additional points:

Install the RJDBC package into your environment, rJava may also be required.
Download the Microsoft JDBC drivers from here (note that care is always needed with downloads such as these because the vendors keep changing their Web sites).
If you are running on Ubuntu, the process will still work but there are some changes to do as shown in the R code.
I have not tried it on a Mac.
Change the query to whatever you want. The example here queries the system table.

The end result is an example set. The query shown in the example yields this.

You will see that the attribute names have been created automatically and a basic mapping to types has been done. The following shows part of the statistics for this example set.

One mapping that would need additional downstream work is the create_date attribute. It looks like it has been transformed into a polynominal. Closer inspection would, no doubt, reveal other foibles.

The example set can then be used in the normal way

In summary, you can see that it is very easy to access SQL Server using R. It is therefore easy to do it from within RapidMiner.

Friday, 19 February 2016

Unit testing RapidMiner processes using R

I've used R a lot over the last few years. Partly in my day to day activity, but also as part of my efforts as co-editor with Dr. Markus Hofmann for our recent new book "Text Mining and Visualization: Case Studies Using Open-Source Tools", available at all good bookstores ;)

I have also made some R packages and these even contain unit tests implemented with the splendid testthat package. Any minute now, I might even become an R expert.

Unit tests are vital to the health and sanity of the creator of any solution. Once something gets big and is subject to a lot of change, it is extremely important to know whether changes haven't accidentally broken something. RapidMiner processes are no different and it struck me that it would be worth implementing some basic unit testing for these. It could be done using RapidMiner itself but, for fun, I thought I would use R.

Here is a small example process that illustrates this. It starts by estimating the performance of a classifier on the Iris data set. The estimated performance is converted to an example set using the "Performance to Data" operator and is then passed to the "Execute R" operator.

This is the performance vector.

The R script confirms that all the parts of the performance vector are as expected. It uses a base R function called "all.equal()" to do this (I decided not to use testthat to avoid having to install the library so others can get going more quickly).

Here is the R script.

As you can see, the script checks that all the parts of the example set are as expected. For example, this line

confirms that the values for accuracy and kappa are 0.94 and 0.91 respectively with a tolerance of 0.01. The R script then outputs a data frame with the results. When all is well the result looks like this.

By changing the parameters of the earlier operators, the results will change. This can be picked up by examination of the result. Here is an example when the number of cross validations is set to 3.

In other words, the performance has changed and this gets detected automatically. Obviously, this is a simple example but you can see how it could be extended.

As I mentioned above, you could do this in pure RapidMiner but it would need quite a number of operators to realise it. The R integration in RapidMiner is relatively easy to use and so it deserves the time of day when considering how to tackle problems.

R also has a tremendous range of packages and a future post will touch cautiously on how to access a database directly.

Sunday, 14 February 2016

Making processes more robust: Confirming attribute types.

If you want a process to be robust and handle inputs gracefully, it makes sense to get it to check that example sets passed to it contain attributes of the correct name and the correct type. This is particularly true for processes exposed using the Server web interface.

Here's a process that takes an example set, determines the types and roles of each attribute, and outputs an example set for further processing. From there, it is a simple matter to work out if the input contains the required attribute names and types, and from there take appropriate action.

The key part is some Groovy scripting which works out the Role and Type of each attribute. The following table shows the output of the process.

Examination of the table reveals some interesting things. You can see that a regular string (also known as a polynominal) has a role of "regular" with a Type of 1. Text attributes have Type equal to 5; they are not the same as polynominal. Integers are Type 3. There is no Type 2. Of course, I could go on and examination of the code would reveal what the Types corresponding to 2, 7 and 8 are.

Once the output is an example set, it is easy to check that it contains attributes with the correct name and type to allow subsequent processing to occur. By using the "Filter Examples" operator it would, for example, be possible to confirm that a label of type integer is present in the input data and report and error (perhaps using the "Throw Exception" operator). This allows processes to be more robust to the vagaries of wacky input data.

Of course, this is one step along the road of making processes more robust. Another important step is unit testing to ensure that processes don't get damaged by well intentioned edits. That will be the subject of a future post.

Sunday, 7 February 2016

Outputting two example sets from a Groovy script

Here is a toy process using Groovy that takes two input example sets and simply outputs them again after swapping them.

The key point is that the Groovy code shows how two outputs can be created. This is easy for a developer to do but not immediately obvious to someone with no development expertise.

Manipulating times: reference information

This is a post I saved as draft as a work in progress. It's pre version 7 but should still be applicable.

Examples showing attributes being generated

Results

First post of 2016

I haven't posted for a while. This is partially because I am very busy and it is also because I haven't created anything new using RapidMiner that was new and interesting enough to share. The community license puts certain things off limits which is a pity since it would be nice to try them out and share the results.

With the release of version 7, I will spend some time reviewing how it is and post things of interest as I find them.

In the meantime, there are a number of posts that I saved as draft that I will revisit and publish. These are all pre version 7 but they should still be applicable. The first is about times and gives some reference information about how to get these behaving as you want. I'll publish it after this one.

Data Science With RapidMiner

Search this blog