Search this blog

Sunday, 12 May 2013

Saving an example set with the details of the process that created it

Often when there is a lot of data to process, it helps to store intermediate results in the repository.

This allows long multi-step processes to proceed through a series of checkpoints so that if an error occurs, you are not forced to go back to the beginning.

Of course, it does require a certain discipline to be clear what each example set is and where it came from. I often fall into the lazy trap of calling example sets "temp1", "temp2" and so on. This makes it difficult to know what you are dealing with.

To get round this, I created a Groovy script that outputs the entire process XML into a macro. I then use the macro as an annotation that I associate with the example set. I can then store the example set in the repository and if later I want to check how I generated the data, I can simply load it, extract the XML and use it as the basis for recreating the original process in order to help me understand where the data came from.

The Groovy script is only 3 lines long and is shown below.

import com.rapidminer.*;
operator.getProcess().getMacroHandler().addMacro("processXML", operator.getProcess().toString());
return input;

The macro that gets created in this case is called "processXML" and can be used in the normal way.


  1. Hi Andrew,

    Would you accept a rapidminer challenge?
    How would you generate a weekending date from any given date?

    Assume the week ends on a Sunday.


  2. You would use the week number represented by "w" in the various date parsing functions.

    date_parse_custom(yearWeek,"yyyy w")

    Although there are some gymnastics before then.

    I'll make a separate post with an example.

  3. Hi how do I save the output or results from my process, so that I can access them again later. So for example if I want to see my decision tree again later without running the whole process again.

    1. The Groovy script creates a macro that contains the XML of the process that is currently running - not anything else such as the decision tree model. The process XML can be added as an annotation (with some gymnastics) to data stored in the repository. Later, when retrieving the data, it is possible (again with gymnastics) to get the annotation. This could then be stored as a file and I suppose in principle this could be run.

      To store things in the repository, you don't need to do anything as complicated. The Store operator is your friend.