Search this blog

Tuesday, 25 October 2011

Counting words in a document

Here's an example that counts the total number of words in a document followed by the total number of unique words.

It does this by using the "cut document" operator with the following regular expression.

(\w+)

This splits the document into words and each word is returned as a document based on the result of the capturing group; the brackets define the capturing group; everything inside these is returned as a value. The "Documents To Data" operator converts all the documents, one for each word, into examples in an example set. The text field name is set to "word" and this is used in the later operators.

An "Extract Macro" operator obtains the number of examples. This is the same as the count of all words. An aggregation is performed to count words and another "Extract Macro" operator determines the number of unique words from the resulting example set. These macros are reported as log values using the "Provide Macro As Log Value" operators and the log file is converted to an example set using "Log To Data".

Other regular expressions could be used if you want to ignore numbers and inside the "Cut Document" operator it is possible to have other filtering operators such as stemming.

1 comment:

  1. "a picture is worth a thousand words".
    much better with the xml process.
    but thx anyway

    ReplyDelete