Search this blog

Monday, 15 April 2013

Counting words in lots of documents

In response to a request contained in a comment for this post, I've modified the process to count total words and unique words for multiple files.

It does this by using the Loop Files operator to iterate over all the files in a folder.


The Loop operator outputs a collection and the Append operator joins them into a single example set.

Inside the Loop operator, the Read Document operator reads the current file and converts it into a document.


Words and unique words are counted as before and the final operator adds an attribute based on the file name contained in the macro provided by the outer Loop operator.

An example result looks like this.


Download the process here and set the directory and filter parameters of the Loop Files operator to the location you want.


Friday, 5 April 2013

Finding text needles in document haystacks

I had to find how many times a sentence occurred within a large set of documents recently and rather than use a search tool or write some software I used RapidMiner.

Here are the bare bones XML of the process to do this with pictures to help explain (the numbers are shown by clicking on the operator execution order within the RapidMiner GUI).


 The basic elements are
  1. A document is created to contain the text-to-look-for - the text needles.
  2. A word list is created from these using the process documents operator. 
  3. A document containing text to search through is created - the document haystack.
  4. The document is processed and only the provided word list items are included in the resulting document vector. This is set to output term-occurrences so the end result is a count of the number of times the text-to-look-for appeared in the document.
There are some points to note.

The text-to-look for is shown as the parameters to the first create document operator (labelled 1 above) shown here.


The document to look in contains a fragment of text copied from page 391 of the RapidMiner manual (labelled 3 above).


The first process documents operator (labelled 2) itself contains the following operators.


The tokenize operator simply uses anything but alphanumeric and space as a token boundary. This has the effect of creating each of the provided phrases as valid tokens. The replace tokens operator replaces all occurrences of space with underscore to match what the n-gram generation operator will produce later.

The final process documents operator (labelled 4) contains the following operators.


This tokenizes but by virtue of using the word list from the previous operator, only these will be considered in the final output example set once the generate n-gram operator has combined tokens together.

The end result is shown below.


The end result shows how many times the text appears in the document.

One advantage this approach has is that it seems to execute very quickly.


Wednesday, 6 March 2013

Append converts text attributes to polynominal

I stumbled on a foible of the "Append" operator the other day.

Attributes of type text in the input example sets are converted to type polynominal in the output example set. This has the effect that subsequent document processing will ignore the attributes.

It took a few minutes for me to work out why so I hope this will save any others these minutes. It's easy to fix, simply use the operator "Convert Nominal to Text" on the output example set.

Edit: this has been fixed in the latest release.I didn't think it was dreadfully serious but thanks to the RM developer chaps anyway.