It's often required to compare an attribute's value in one example with a value in another example either in the past or in the future.
The Lag Series is the one to use to get previous values brought forward to now. Looking into the future is somewhat harder (and of course you can argue that in a Data Mining context, it's cheating to use future values when these are the things we are trying to predict).
One approach is to go forward in time and use the lag operator to bring forward the values that are now in the past. Then go back in time and use the values brought forward as the new now.
Another approach is to use the Generate Id operator combined with the Join operator. There is a little known parameter called "offset" that allows numerical ids to be generated from a given starting value. If this operator is applied twice with different offsets (with some accompanying gymnastics to make the ids regular attributes with different names) followed by Join using these ids, the result is an example where past and future values can be brought to the present (although some more gymnastics are needed to get the names right).
Here's an example process showing this with the Lag Series operator for comparison (make sure you install the Series extension for the Lag Series operator).
Maybe those nice RapidMiner R&D chaps will add a negative lag :)
Search this blog
Monday, 18 November 2013
Saturday, 5 October 2013
Bulk export of processes
I am doing something at the moment which requires me to to export a load of processes contained in folders within a single repository to a single disk location. Rather than do it one by one which is error prone and time consuming, I decided to make a RapidMiner process to do the export.
It turns out that the files in the repository with the extension .rmp are valid xml files that can be imported so all I did was point a Loop Files operator at the folder where my repository was and looked for files ending in .rmp. Inside this loop, I used the Generate Macro operator to generate the new location and the Copy File operator to copy the file from the repository to the new location.
The location of the repository and the location where the files to be copied are defined in macros in the process context. Set these to the values you want. Note that on Windows machines it is necessary to use double backslashes to delimit folders.
The process is here.
It turns out that the files in the repository with the extension .rmp are valid xml files that can be imported so all I did was point a Loop Files operator at the folder where my repository was and looked for files ending in .rmp. Inside this loop, I used the Generate Macro operator to generate the new location and the Copy File operator to copy the file from the repository to the new location.
The location of the repository and the location where the files to be copied are defined in macros in the process context. Set these to the values you want. Note that on Windows machines it is necessary to use double backslashes to delimit folders.
The process is here.
Parsing macros with values containing backslashes
Let's imagine we are looping through some files and we happen to know that the files are buried somewhere in folders called chapterNN where NN is a number. How can we extract information about the location for each file individually?
Within a Loop Files operator, the parent_path macro would take values like this within the loop (I'm assuming Windows obviously).
c:\myGiantFolderStructure\mybook\chapter01
c:\myGiantFolderStructure\mybook\chapter02
...
c:\myGiantFolderStructure\mybook\chapter11
If we wanted to extract the last part of the folder name including the number to use in the loop we could try the Generate Macro operator with the replaceAll function.
The basic idea would be to configure the operator like so.
chapter = replaceAll("%{parent_path}", ".*(chapter.*)", "$1")
This matches the entire macro with a regular expression but locates the word "chapter" and whatever follows it inside a capturing group. This replaces the entire value of the parent_path macro. The result should be a new macro called chapter with values from chapter01 upwards.
Unfortunately, this doesn't work because the backslashes cause trouble. I'm guessing but I think the replaceAll function (and its siblings, replace and concat) try to parse the string within the macro and get confused by treating the backslash as an escape character.
Fortunately, there is a solution: a little known function called macro. This function simply returns the string representation of a named macro.
The expression would then look like this.
chapter = replaceAll(macro("parent_path"), ".*(chapter.*)", "$1")
Knowing that backslashes are processed enables to us work out how to pass them simply by escaping them. If we felt we needed a more sophisticated match inside the capturing group to ensure we picked up numbers, we would do the following.
chapter = replaceAll(macro("parent_path"), ".*(chapter\\d+)", "$1")
This matches only if there is at least one number after the word "chapter". The double backslash becomes a single backslash when the regular expression is evaluated and \d+ means one or more numbers.
Within a Loop Files operator, the parent_path macro would take values like this within the loop (I'm assuming Windows obviously).
c:\myGiantFolderStructure\mybook\chapter01
c:\myGiantFolderStructure\mybook\chapter02
...
c:\myGiantFolderStructure\mybook\chapter11
The basic idea would be to configure the operator like so.
chapter = replaceAll("%{parent_path}", ".*(chapter.*)", "$1")
This matches the entire macro with a regular expression but locates the word "chapter" and whatever follows it inside a capturing group. This replaces the entire value of the parent_path macro. The result should be a new macro called chapter with values from chapter01 upwards.
Unfortunately, this doesn't work because the backslashes cause trouble. I'm guessing but I think the replaceAll function (and its siblings, replace and concat) try to parse the string within the macro and get confused by treating the backslash as an escape character.
Fortunately, there is a solution: a little known function called macro. This function simply returns the string representation of a named macro.
The expression would then look like this.
chapter = replaceAll(macro("parent_path"), ".*(chapter.*)", "$1")
Knowing that backslashes are processed enables to us work out how to pass them simply by escaping them. If we felt we needed a more sophisticated match inside the capturing group to ensure we picked up numbers, we would do the following.
chapter = replaceAll(macro("parent_path"), ".*(chapter\\d+)", "$1")
This matches only if there is at least one number after the word "chapter". The double backslash becomes a single backslash when the regular expression is evaluated and \d+ means one or more numbers.
Subscribe to:
Posts (Atom)