Search this blog

Thursday, 25 October 2012

Applying the same processing to multiple example sets

The picture below shows a process that computes multiple different aggregations of an example set.


When doing exploratory data analysis it is often useful to see what the data looks like from many different angles and part of this can involve generating new attributes based on existing attributes.

This graphic shows one of the results where two range attributes are calculated based on the difference between the minimum and maximum for the age and earnings attributes for each grouping.


With aggregation, the names of the generated attributes contain parentheses that indicate how the attribute was generated. In the example, the attribute name for the average of the ages for a particular group is "average(age)". This is fine until you want to calculate something from this attribute at which point, the parentheses make using of the "Generate Attributes" operator difficult. This is still OK because you have to rename the attribute to remove these parentheses but doing this many times on different example sets soon becomes onerous and error prone especially if you have to make changes later on.

To help with this, I created this process. This contains three aggregations being performed for different attribute groupings. The attributes "age" and "earnings" are aggregated so that minimum and maximum values are calculated. For these aggregated values, the differences are calculated to produce ranges. In order to avoid having to cut and paste the operator chain to perform these calculations, the process allows them to be defined once so that each example set is applied to this. By doing this, the overhead of maintaining multiple copies is reduced as well as possibility of making mistakes.

This picture shows what is inside the subprocess.


This picture shows what is inside the "Loop Collection" operator



The process works as follows.

  • The three aggregated example sets are connected to a subprocess 
  • Inside this subprocess, the "Collect" operator creates a collection from them
  • The "Loop Collection" operator iterates over all the members: in this case three times
  • The inner operators to the "Loop Collection" operator perform the renaming and attribute generation
  • The "Multiply" operator creates three copies of the collection
  • The "Select" operators choose one of the examples to pass to the output ports of the subprocess
The end result is new attributes in the example sets all generated in the same way. Changes to this calculation can be done in one place thereby making life a bit easier.

Tuesday, 9 October 2012

Counting words and sentences in documents

I discovered a neat trick to allow the number of tokens within a document to be calculated and extracted in many different ways in one "Process Documents" operation. 

An example process is here. This calculates details of words and sentences within a small document.

It works by applying the "Tokenize" operator to fresh copies of the document and then using the "Aggregate Token Length" operator to extract various items of meta data relating to the tokens that have been created.

The following graphic shows the detail within the "Process Documents" operator with the execution order shown.

The execution order is important. The first chain of operators labelled from 2 to 5 extracts information relating to sentences. The resulting tokens are then thrown away but the meta data is retained. The second chain from 6 to 9 extracts information relating to words. The meta data is added to the example set returned by the "Process Documents" operator but the example set will be based on the word tokenization at the end of step 9. This means that meta data relating to sentences can be included in word vectors based on words. This can be extended to tokenize in arbitrary ways.

The result for the example looks like this.


This shows there are 9 sentences each of length 137 characters in the example document. In addition there are 184 words of average length 5.527 characters.