Search this blog

Saturday, 26 November 2011

Counting words in a document: a better way

A much easier way to count words in a document compared to this previous post is to use the word list that is generated by text processing operators.

This example shows this. It uses the "Wordlist to Data" operator and then does some light gymnastics to calculate the sum and count of words to produce the desired results.

Tuesday, 22 November 2011

Normalizing Rows

The "Normalize" operator normalizes each numerical column (called an attribute in RapidMiner's terminology) within an example set to the desired range. Sometimes, you might want to normalize all the numerical values within a row (called an example within an example set in RapidMiner's terminology). I had to do this while understanding how term frequencies are calculated during document vector creation.

A nifty trick is to transpose the example set, normalize and then transpose it back again and the process here shows a very simple example.

This also shows the result of a document processing step which results in term frequencies for comparison. I recently found that RapidMiner performs a cosine normalization when producing term frequencies (it divides by the square root of the sum of the squares of the frequencies within a row which is equivalent to a document) and I wanted to see what differences show up when the sum of the frequencies is used instead (answer: not much with the data I was playing with).

Edit: added RapidMiner equivalent terminology for rows and columns