Search this blog

Saturday, 5 October 2013

Bulk export of processes

I am doing something at the moment which requires me to to export a load of processes contained in folders within a single repository to a single disk location. Rather than do it one by one which is error prone and time consuming, I decided to make a RapidMiner process to do the export.

It turns out that the files in the repository with the extension .rmp are valid xml files that can be imported so all I did was point a Loop Files operator at the folder where my repository was and looked for files ending in .rmp. Inside this loop, I used the Generate Macro operator to generate the new location and the Copy File operator to copy the file from the repository to the new location.

The location of the repository and the location where the files to be copied are defined in macros in the process context. Set these to the values you want. Note that on Windows machines it is necessary to use double backslashes to delimit folders.

The process is here.

Parsing macros with values containing backslashes

Let's imagine we are looping through some files and we happen to know that the files are buried somewhere in folders called chapterNN where NN is a number. How can we extract information about the location for each file individually?

Within a Loop Files operator, the parent_path macro would take values like this within the loop (I'm assuming Windows obviously).

c:\myGiantFolderStructure\mybook\chapter01
c:\myGiantFolderStructure\mybook\chapter02
...
c:\myGiantFolderStructure\mybook\chapter11

If we wanted to extract the last part of the folder name including the number to use in the loop we could try the Generate Macro operator with the replaceAll function.

The basic idea would be to configure the operator like so.

chapter = replaceAll("%{parent_path}", ".*(chapter.*)", "$1")

This matches the entire macro with a regular expression but locates the word "chapter" and whatever follows it inside a capturing group. This replaces the entire value of the parent_path macro. The result should be a new macro called chapter with values from chapter01 upwards.

Unfortunately, this doesn't work because the backslashes cause trouble. I'm guessing but I think the replaceAll function (and its siblings, replace and concat) try to parse the string within the macro and get confused by treating the backslash as an escape character.

Fortunately, there is a solution: a little known function called macro. This function simply returns the string representation of a named macro.

The expression would then look like this.

chapter = replaceAll(macro("parent_path"), ".*(chapter.*)", "$1")

Knowing that backslashes are processed enables to us work out how to pass them simply by escaping them. If we felt we needed a more sophisticated match inside the capturing group to ensure we picked up numbers, we would do the following.

chapter = replaceAll(macro("parent_path"), ".*(chapter\\d+)", "$1")

This matches only if there is at least one number after the word "chapter". The double backslash becomes a single backslash when the regular expression is evaluated and \d+ means one or more numbers.



Friday, 30 August 2013

Pivoting and De-Pivoting

In response to a comment on this post, I made the following process that creates a simple example set, pivots it and then de-pivots it. The end result: the de-pivot result and the original are the same.

The input example set to the de-pivot operation is shown here.


The key parameters for the de-pivot operator are as follows. The first shows that the de-pivot operation will produce examples with an attribute called "name" with values that are nominal and which will not include any missing values from the input example set.


Each example in the result will also contain another attribute and this is dictated by the following parameters.


The regular expression finds all attributes that match. In this case, there are 4.

The de-pivot operation considers each example in the input example set in turn and combines that with the result of the regular expression. The intersection of the example and the matched attribute produces a new attribute value whose name is "value" in this case. For the example here, there are 12 possibilities so the full result would contain 12 examples but this is normally reduced by clearing the check box "keep missings". As mentioned above, one final point is that the "create nominal index" check box must be set in order to get nominal values in the results.

To make the result match the original, there are a couple of sundry operators that rename the values and re-order the attributes.

If I'm honest, I always forget the details of how de-pivoting works so I just adopt the trial and error approach until it looks right.