Search this blog

Tuesday, 9 October 2012

Counting words and sentences in documents

I discovered a neat trick to allow the number of tokens within a document to be calculated and extracted in many different ways in one "Process Documents" operation. 

An example process is here. This calculates details of words and sentences within a small document.

It works by applying the "Tokenize" operator to fresh copies of the document and then using the "Aggregate Token Length" operator to extract various items of meta data relating to the tokens that have been created.

The following graphic shows the detail within the "Process Documents" operator with the execution order shown.

The execution order is important. The first chain of operators labelled from 2 to 5 extracts information relating to sentences. The resulting tokens are then thrown away but the meta data is retained. The second chain from 6 to 9 extracts information relating to words. The meta data is added to the example set returned by the "Process Documents" operator but the example set will be based on the word tokenization at the end of step 9. This means that meta data relating to sentences can be included in word vectors based on words. This can be extended to tokenize in arbitrary ways.

The result for the example looks like this.

This shows there are 9 sentences each of length 137 characters in the example document. In addition there are 184 words of average length 5.527 characters.

No comments:

Post a Comment