Data Science With RapidMiner: 2016

Friday, 10 June 2016

Using Genetic Algorithms to find global extremes in arbitrary functions

Genetic algorithms offer the possibility to find global maxima and minima in arbitrary functions as the inputs to the functions are varied. RapidMiner has a number of evolutionary operators and one in particular is "Optimize Parameters (Evolutionary)" that, on the face of it, allows the parameters of operators contained within it to be varied as a performance vector is calculated based on how the varied parameters cause the inner operators to behave. There is a slight difficulty because not all parameters can be exposed for control by the Optimize operator so a work-around is needed. We'll come back to that shortly.

Firstly, let's choose a function to optimize. A good candidate is the Rastrigin function that, in two dimensions, has the following form.

f(i1, i2) = 20 + i1*i1 + i2*i2 - 10*(cos(2*pi*i1) + cos(2*pi*i2))

Within the range -5.12 to 5.12 for both i1 and i2, the function has many local minima which makes finding the lowest a challenging problem for some techniques. The following graph shows the function.

By inspection and exploration we can find that the global minimum is 0 when i1 and i2 are both 0. The process to generate this data is here.

Now let's see if we can find this minimum using a RapidMiner process. The process is here.

The first part of the process shown above uses a "Generate Data by User Specification" operator to generate a small example set that needs a label and a regular attribute in order for the "Optimize Parameters (Evolutionary)" operator to work.

The inner operators inside the Optimize operator are shown below.

The two operators labelled i1 and i2 are there purely to allow parameters to be passed from the control part of the Optimize operator. This is the work around I mentioned earlier. Basically, operators i1 and i2 expose parameters that can be seen from the Optimize operator and can be varied. The parameter settings for the Optimize operator are shown in the following.

This shows that the i1.constant value is allowed to vary between -5.12 and 5.12; the Optimize operator chooses the values as it proceeds.

Returning to the inner operators, the values of the parameters are accessed using a "Generate Attributes" operator. The first two attributes show how to get the values from the i1 and i2 operators. The third attribute calculates the Rastrigin value and the fourth attribute copies this into a result attribute (not strictly necessary - this process used to have other functions which I deleted to make the cut down process for this blog post).

From this point, the "Optimize" operator needs a performance vector to work on. The simplest thing to use is the "Extract Performance" operator to extract a specific value from the example set containing the Rastrigin result. This is shown below and the optimization direction is set to minimize since we are looking for a minimum.

The final "Log" operator records what happens.

If we run this process the Log result is shown in the following plot.

This shows the performance tending to 0, the global minimum, as the process proceeds and this corresponds to i1 and i2 both being 0. Plotting the log result as a scatter plot shows clusters of blue points which show how the genetic algorithm successfully keeps to areas where there are minima.

Genetic algorithms are not guaranteed to find the global minimum and in fact, even the process in this blog occasionally does not find it. If you set the parameter "specify population size" in "Optimize Parameters (Evolutionary)" to a small number like 5, you should observe this because fewer individuals are used to test the solution space and so the chance of being near the global minimum is low. Note that a little known feature of the "Process" operator is that setting "random seed" to -1 causes a new random seed to be selected for each run thereby ensuring that each run produces different results.

Of course, there is no way to detect when the global extreme is not found in the more normal case when the answer is not known beforehand. As usual, be sceptical and alive to the possibility that the result is not the best and try different parameter settings to see how results change. As mentioned, the population size affects this as do the "crossover prob" and "tournament fraction" parameters. There is no free lunch of course and an exhaustive search will always take longer.

In summary, we can see that RapidMiner can, with a bit of a work around, be used to optimize arbitrary functions. My experiments show that it seems to do a good job in most cases. A future post will show the same thing using R.

Sunday, 21 February 2016

Reading from a SQL Server database

Reading from a SQL Server database is easy using R.

Here's a process that shows how to do this from within a RapidMiner process.

Of course, you can use the built in "Read Database" operator to read from a database, but there are restrictions in the community version. By using R you can get partially round the restrictions but you should always be aware of your license agreement. Just because you can get round the license does not mean that the terms no longer apply. If you do something that would normally trigger the purchase of an additional license then you still need to. I'm not a lawyer thankfully but, you have been warned.

Having said that, there are situations where you have to try things to prove viability and get political buy-in before committing to a more serious plan where money is to be spent. Political buy-in, as everyone knows, can sometimes takes a very long time and even the most trivial objection can completely de-rail progress. Removing the ability to make a full prototype is just such a potential trivial objection.

Having said all of that, the method the process uses here will have some subtle differences in the way it interacts with the database when compared to the "Read Database" operator. This means it might not work for some reason as yet unknown. Simple advice, don't rely on it.

Enough words, on with the process.

The process has two parts, the first sets some macros that are used within the second. It's a little known fact that you can use macros in this way but it's extremely powerful and allows the code to work in lots of places. The macros themselves are shown in the following table.

Change these to match what you have in your environment. Note that I am using SQL Server authentication so this means you have to set up your environment like this. I am led to believe that built-in authentication is possible but I have not tried it.

The R code itself is shown here.

Additional points:

Install the RJDBC package into your environment, rJava may also be required.
Download the Microsoft JDBC drivers from here (note that care is always needed with downloads such as these because the vendors keep changing their Web sites).
If you are running on Ubuntu, the process will still work but there are some changes to do as shown in the R code.
I have not tried it on a Mac.
Change the query to whatever you want. The example here queries the system table.

The end result is an example set. The query shown in the example yields this.

You will see that the attribute names have been created automatically and a basic mapping to types has been done. The following shows part of the statistics for this example set.

One mapping that would need additional downstream work is the create_date attribute. It looks like it has been transformed into a polynominal. Closer inspection would, no doubt, reveal other foibles.

The example set can then be used in the normal way

In summary, you can see that it is very easy to access SQL Server using R. It is therefore easy to do it from within RapidMiner.

Friday, 19 February 2016

Unit testing RapidMiner processes using R

I've used R a lot over the last few years. Partly in my day to day activity, but also as part of my efforts as co-editor with Dr. Markus Hofmann for our recent new book "Text Mining and Visualization: Case Studies Using Open-Source Tools", available at all good bookstores ;)

I have also made some R packages and these even contain unit tests implemented with the splendid testthat package. Any minute now, I might even become an R expert.

Unit tests are vital to the health and sanity of the creator of any solution. Once something gets big and is subject to a lot of change, it is extremely important to know whether changes haven't accidentally broken something. RapidMiner processes are no different and it struck me that it would be worth implementing some basic unit testing for these. It could be done using RapidMiner itself but, for fun, I thought I would use R.

Here is a small example process that illustrates this. It starts by estimating the performance of a classifier on the Iris data set. The estimated performance is converted to an example set using the "Performance to Data" operator and is then passed to the "Execute R" operator.

This is the performance vector.

The R script confirms that all the parts of the performance vector are as expected. It uses a base R function called "all.equal()" to do this (I decided not to use testthat to avoid having to install the library so others can get going more quickly).

Here is the R script.

As you can see, the script checks that all the parts of the example set are as expected. For example, this line

confirms that the values for accuracy and kappa are 0.94 and 0.91 respectively with a tolerance of 0.01. The R script then outputs a data frame with the results. When all is well the result looks like this.

By changing the parameters of the earlier operators, the results will change. This can be picked up by examination of the result. Here is an example when the number of cross validations is set to 3.

In other words, the performance has changed and this gets detected automatically. Obviously, this is a simple example but you can see how it could be extended.

As I mentioned above, you could do this in pure RapidMiner but it would need quite a number of operators to realise it. The R integration in RapidMiner is relatively easy to use and so it deserves the time of day when considering how to tackle problems.

R also has a tremendous range of packages and a future post will touch cautiously on how to access a database directly.

Sunday, 14 February 2016

Making processes more robust: Confirming attribute types.

If you want a process to be robust and handle inputs gracefully, it makes sense to get it to check that example sets passed to it contain attributes of the correct name and the correct type. This is particularly true for processes exposed using the Server web interface.

Here's a process that takes an example set, determines the types and roles of each attribute, and outputs an example set for further processing. From there, it is a simple matter to work out if the input contains the required attribute names and types, and from there take appropriate action.

The key part is some Groovy scripting which works out the Role and Type of each attribute. The following table shows the output of the process.

Examination of the table reveals some interesting things. You can see that a regular string (also known as a polynominal) has a role of "regular" with a Type of 1. Text attributes have Type equal to 5; they are not the same as polynominal. Integers are Type 3. There is no Type 2. Of course, I could go on and examination of the code would reveal what the Types corresponding to 2, 7 and 8 are.

Once the output is an example set, it is easy to check that it contains attributes with the correct name and type to allow subsequent processing to occur. By using the "Filter Examples" operator it would, for example, be possible to confirm that a label of type integer is present in the input data and report and error (perhaps using the "Throw Exception" operator). This allows processes to be more robust to the vagaries of wacky input data.

Of course, this is one step along the road of making processes more robust. Another important step is unit testing to ensure that processes don't get damaged by well intentioned edits. That will be the subject of a future post.

Sunday, 7 February 2016

Outputting two example sets from a Groovy script

Here is a toy process using Groovy that takes two input example sets and simply outputs them again after swapping them.

The key point is that the Groovy code shows how two outputs can be created. This is easy for a developer to do but not immediately obvious to someone with no development expertise.

Manipulating times: reference information

This is a post I saved as draft as a work in progress. It's pre version 7 but should still be applicable.

Examples showing attributes being generated

Results

First post of 2016

I haven't posted for a while. This is partially because I am very busy and it is also because I haven't created anything new using RapidMiner that was new and interesting enough to share. The community license puts certain things off limits which is a pity since it would be nice to try them out and share the results.

With the release of version 7, I will spend some time reviewing how it is and post things of interest as I find them.

In the meantime, there are a number of posts that I saved as draft that I will revisit and publish. These are all pre version 7 but they should still be applicable. The first is about times and gives some reference information about how to get these behaving as you want. I'll publish it after this one.

Data Science With RapidMiner

Search this blog