Search this blog

Tuesday 25 October 2011

Counting words in a document

Here's an example that counts the total number of words in a document followed by the total number of unique words.

It does this by using the "cut document" operator with the following regular expression.

(\w+)

This splits the document into words and each word is returned as a document based on the result of the capturing group; the brackets define the capturing group; everything inside these is returned as a value. The "Documents To Data" operator converts all the documents, one for each word, into examples in an example set. The text field name is set to "word" and this is used in the later operators.

An "Extract Macro" operator obtains the number of examples. This is the same as the count of all words. An aggregation is performed to count words and another "Extract Macro" operator determines the number of unique words from the resulting example set. These macros are reported as log values using the "Provide Macro As Log Value" operators and the log file is converted to an example set using "Log To Data".

Other regular expressions could be used if you want to ignore numbers and inside the "Cut Document" operator it is possible to have other filtering operators such as stemming.

Wednesday 12 October 2011

Generate Macro bonus features

I was pleased to discover that some of the functions available in the operator "Generate Attributes" are also available in the "Generate Macro" operator.

For example the following functions work.

concat
contains
matches
index
str
upper
lower
escape_html
replace

and I imagine that similar text processing functions will also work.

I tried date_now() and some other date functions and I got a result that looked like an error (actually quite an interesting error). So I assume date functions cannot be used.