Search this blog

Monday, 15 April 2013

Counting words in lots of documents

In response to a request contained in a comment for this post, I've modified the process to count total words and unique words for multiple files.

It does this by using the Loop Files operator to iterate over all the files in a folder.


The Loop operator outputs a collection and the Append operator joins them into a single example set.

Inside the Loop operator, the Read Document operator reads the current file and converts it into a document.


Words and unique words are counted as before and the final operator adds an attribute based on the file name contained in the macro provided by the outer Loop operator.

An example result looks like this.


Download the process here and set the directory and filter parameters of the Loop Files operator to the location you want.


4 comments:

  1. Thanks Andrew for this new post.
    how can we do arithematic operations on the results?
    For eg: how to compute ( uniquewords/totalwords ) and get the result value for each document?

    Dev

    ReplyDelete
    Replies
    1. Hello Dev

      You can use the Generate Attributes operator to calculate these. This operator will automatically calculate the value for each example in the example set.

      regards

      Andrew

      Delete
  2. Exactly what I was looking for, thanks Andrew.

    ReplyDelete