Search this blog

Saturday 25 October 2014

Windowing and Processing Documents

The text mining extension contains an operator called "Window Document". It takes a document that has been split into tokens (typically words) and creates a collection of new documents from it. Each new document contains a fixed number of tokens corresponding to a "window length" parameter and the movement of the window that moves through the document is dictated by a "step size" parameter. A meta data attribute called "window" is created for each new document; this corresponds to the window within the original document.

So for example, this text

"The cat sat on the mat"

could be split into three windows each of size two if window length is set to two and step size is set to two.

window: 0 - "The cat"
window: 2 - "sat on"
window: 4 - "the mat"

Here's a simple process that illustrates windowing and processing. It's worth noting that the "Process Documents" operator is able to take a collection of documents as input. Note that the process uses version 6.1 of RapidMiner studio so some manual version number editing would be needed to run it in older versions. Note too that you must have the Text Processing extension installed.

The process illustrates a tiny pitfall for the unwary. If one of the tokens is "window" and if the parameter "add meta information" is set to true for the "Process Documents" operator, the resulting example set contains an attribute with the name "window_0". This is because the meta data for the window creates a special attribute in the final example set with name "window" and this would clash with the attribute corresponding to the token. If the parameter "add meta information" is set to false, the attribute corresponding to the token is called "window". In other words, the example set changes in a subtle way depending on the setting of a parameter which can lead to problems.

It's a very small point but I happened to stumble over it recently as I was preparing my contribution to an upcoming text mining book. Here's a teaser because it looks nice :). It is comparing three novels by Jane Austen and how the shape of word frequencies varies for consecutive windows through the books.

The red line is for Mansfield Park, the blue is for Sense and Sensibility and the green is for Pride and Prejudice.


No comments:

Post a Comment