Search this blog

Saturday, 5 October 2013

Bulk export of processes

I am doing something at the moment which requires me to to export a load of processes contained in folders within a single repository to a single disk location. Rather than do it one by one which is error prone and time consuming, I decided to make a RapidMiner process to do the export.

It turns out that the files in the repository with the extension .rmp are valid xml files that can be imported so all I did was point a Loop Files operator at the folder where my repository was and looked for files ending in .rmp. Inside this loop, I used the Generate Macro operator to generate the new location and the Copy File operator to copy the file from the repository to the new location.

The location of the repository and the location where the files to be copied are defined in macros in the process context. Set these to the values you want. Note that on Windows machines it is necessary to use double backslashes to delimit folders.

The process is here.

Parsing macros with values containing backslashes

Let's imagine we are looping through some files and we happen to know that the files are buried somewhere in folders called chapterNN where NN is a number. How can we extract information about the location for each file individually?

Within a Loop Files operator, the parent_path macro would take values like this within the loop (I'm assuming Windows obviously).

c:\myGiantFolderStructure\mybook\chapter01
c:\myGiantFolderStructure\mybook\chapter02
...
c:\myGiantFolderStructure\mybook\chapter11

If we wanted to extract the last part of the folder name including the number to use in the loop we could try the Generate Macro operator with the replaceAll function.

The basic idea would be to configure the operator like so.

chapter = replaceAll("%{parent_path}", ".*(chapter.*)", "$1")

This matches the entire macro with a regular expression but locates the word "chapter" and whatever follows it inside a capturing group. This replaces the entire value of the parent_path macro. The result should be a new macro called chapter with values from chapter01 upwards.

Unfortunately, this doesn't work because the backslashes cause trouble. I'm guessing but I think the replaceAll function (and its siblings, replace and concat) try to parse the string within the macro and get confused by treating the backslash as an escape character.

Fortunately, there is a solution: a little known function called macro. This function simply returns the string representation of a named macro.

The expression would then look like this.

chapter = replaceAll(macro("parent_path"), ".*(chapter.*)", "$1")

Knowing that backslashes are processed enables to us work out how to pass them simply by escaping them. If we felt we needed a more sophisticated match inside the capturing group to ensure we picked up numbers, we would do the following.

chapter = replaceAll(macro("parent_path"), ".*(chapter\\d+)", "$1")

This matches only if there is at least one number after the word "chapter". The double backslash becomes a single backslash when the regular expression is evaluated and \d+ means one or more numbers.