Search this blog

Loading...

Tuesday, 25 March 2014

XSLT

I had to transform some XML from one format to another recently for a cool thing that I am doing at the moment - more details soon with luck ;).

The XML in question was contained in a lot of files and I couldn't face editing them so I decided to embrace the power of XSLT and get the transformation done automatically using RapidMiner.

It turned out to be really quite easy and in the interests of giving something back I made a simple version of the process that shows it working.

Here is the process.

This is the input XML (a copy of the XML for a RapidMiner process as it happens)


Close examination shows it contains 5 operators.

The XSLT document input to the "Process XSLT" operator is shown below.


This finds all the operators with name attributes within the input XML and for each writes out the name and also finds and writes out the corresponding class attribute all within a new XML document.

The result of this is shown here.


It has correctly found the 5 operators in the original XML document and has written the name and class for each in a new XML document.

Very cool and powerful and it turned out to be quite easy. The hard bit is knowing how to create the XSLT transform.

Saturday, 8 March 2014

The Write Special Format operator

Here is a complicated process that does quite a lot...
Deep breath, here we go...
  1. Loop through all the process files in a repository (make sure you point it to your location) and read each one in as a document.
  2. Convert the documents to normal examples within a single example set.
  3. Create an attribute called `description` that contains the text within the top-level comment for each process. This uses Ninja XPath (actually it doesn't but I wanted to use the word Ninja).
  4. Do gymnastics to reformat the contents of the attribute. This uses Ninja regular expressions (actually it doesn't but the usual rule is all regular expressions require Ninja like skills). Newlines and linefeeds are included - this is perhaps the interesting part.
  5. Rename to make the name of the attribute easier to understand.
  6. Select only the attributes of interest.
  7. Filter out all where there is no description.
  8. Write the example set to a file (make sure you set this to somewhere you want to write files). This uses the `Write Special Format` operator which was the only way I could find to get this to work.
Why?

The comment view can contain html. This allows basic formatting and structure to be defined so that when a process is opened, the text in the comment view is displayed with this formatting and structure applied. The process above allows the html tags to be extracted so they can be re-used if you want to use the formatting and structure somewhere else like a document or index. Time is precious, anything that allows re-use and avoids typing is good.

The `Write Special Format` operator is especially nice because it allows precise control to be exerted over what is to be written. Dare I say Ninja like control?

Wednesday, 12 February 2014

A "feature" of the Performance (Regression) operator

I noticed a feature of the Performance (Regression) operator whereby it never reports a value for correlation less than 0. This would happen if a label and prediction are negatively correlated with one another. Furthermore, the squared correlation is also reported as zero in this case.

This example process shows this.

I'm using the Correlation Matrix operator as a way of checking the answer I get. If the sign of the calculation is changed in the Generate Attributes operator you will find that in the negative case the correlation should be -0.763 whilst in the positive it should be 0.706.

The output from the Performance (Regression) operator is 0 for correlation and squared correlation in the negative case. How odd since there are other criteria that can take negative values such as Kendall Tau and Spearman Rho.

I can't work out if this is a deliberate feature or some other thing ;)

No matter, at least I know.