Search this blog

Sunday, 1 May 2011

How to read a PMML file to determine the attributes shown on a decision tree

Here is a process to read a PMML file created by RapidMiner.

The PMML file in this case is a model created by running a decision tree algorithm on the iris data set. Here's a snapshot - the highlighted parts are to help explain later.

The process writes this to c:\temp\Iris.xml.pmml and then reads it back in (the subprocess operator allows this to be synchronised). The file is handled as a document by RapidMiner.

Next, the process uses the following XPath to split the document into chunks (the PMML is probably not required).

/xmlns:PMML//xmlns:SimplePredicate

The xmlns is a namespace and this is provided by the following name value pair in the "cut document" operator.


This value is provided in the raw XML and it is important to get this correct. The "assume html" checkbox is unchecked.

The XPath itself simply finds all XML nodes somewhere beneath the PMML node that correspond to "SimplePredicate". By inspection of the PMML, this looked to be the correct way of determining the fields used on the decision tree.

The result of this XPath is to create multiple documents each containing a small section of the original document. Each section corresponds to the XML for a specific SimplePredicate entry.

Within the "cut document" operator, the inner operator extracts information using more XPath. This time, the XPath is looking for an attribute named "field" and the XPath to do this is as follows.


A namespace is not required here because it looks like the document fragments don't refer to one.

Next the process turns the documents into an example set and the number of rows in this corresponds to the number of times the SimplePredicate XPath query cut the file into smaller documents. The example set is returned and after some processing the final output is a list of the attributes used in the decision tree.

In the example, attributes a3 and a4 can be seen on the decision tree and these are also output as an example set.

No comments:

Post a Comment