Data Science With RapidMiner: Rename

Showing posts with label Rename. Show all posts

Sunday, 21 April 2013

Rename by generic names and creating a model: which attributes are used?

The Rename by Generic Names operator can be used to rename attributes so they follow a simple naming convention with a generic stem and an incrementing counter. You would use this for example to get rid of punctuation or mathematical symbols that would prevent the Generate Attributes operator from working.

As an example, if you had regular attributes like

Average(height)
Minimum(height)
Variance(height)

The rename would yield something like

att1
att2
att3

This is more usable but is less understandable.

The other day, I stumbled on an odd side effect of this when building linear regression models on renamed attributes. Fortunately, I don't think it's a problem but there's a workaround anyway.

Firstly then, here is an ultra simple process that builds a linear regression model on some fake data which has had its attributes renamed generically. The example set it produces looks like this.

The regression model looks like this.

How odd; the names of the attributes before the rename have been used to describe the model. This causes confusion but as far as I can tell the models and weights seem to be fine when they are used in a process. The names are in fact still in the example set and can be seen from the Meta Data View by showing the Constructions column. This points to using the Materialize Data operator as a workaround. By adding this operator just after the rename, the model comes out as follows.

Less confusing for a human.

Thursday, 25 October 2012

Applying the same processing to multiple example sets

The picture below shows a process that computes multiple different aggregations of an example set.

When doing exploratory data analysis it is often useful to see what the data looks like from many different angles and part of this can involve generating new attributes based on existing attributes.

This graphic shows one of the results where two range attributes are calculated based on the difference between the minimum and maximum for the age and earnings attributes for each grouping.

With aggregation, the names of the generated attributes contain parentheses that indicate how the attribute was generated. In the example, the attribute name for the average of the ages for a particular group is "average(age)". This is fine until you want to calculate something from this attribute at which point, the parentheses make using of the "Generate Attributes" operator difficult. This is still OK because you have to rename the attribute to remove these parentheses but doing this many times on different example sets soon becomes onerous and error prone especially if you have to make changes later on.

To help with this, I created this process. This contains three aggregations being performed for different attribute groupings. The attributes "age" and "earnings" are aggregated so that minimum and maximum values are calculated. For these aggregated values, the differences are calculated to produce ranges. In order to avoid having to cut and paste the operator chain to perform these calculations, the process allows them to be defined once so that each example set is applied to this. By doing this, the overhead of maintaining multiple copies is reduced as well as possibility of making mistakes.

This picture shows what is inside the subprocess.

This picture shows what is inside the "Loop Collection" operator

The process works as follows.

The three aggregated example sets are connected to a subprocess
Inside this subprocess, the "Collect" operator creates a collection from them
The "Loop Collection" operator iterates over all the members: in this case three times
The inner operators to the "Loop Collection" operator perform the renaming and attribute generation
The "Multiply" operator creates three copies of the collection
The "Select" operators choose one of the examples to pass to the output ports of the subprocess

The end result is new attributes in the example sets all generated in the same way. Changes to this calculation can be done in one place thereby making life a bit easier.

Wednesday, 1 August 2012

Deleting attributes with 2 valid values and the rest missing

For some reason the other day, I can't remember why, I had to delete attributes with two valid values and all the rest missing. It followed on from this process.

So I made this process that finds all attributes with a specific number of valid values and removes them (the attributes).

This example uses sample data so deletes attributes with 7 valid values but I'm sure you'll get the idea.

Data Science With RapidMiner

Search this blog

Sunday, 21 April 2013

Rename by generic names and creating a model: which attributes are used?

Thursday, 25 October 2012

Applying the same processing to multiple example sets

Wednesday, 1 August 2012

Deleting attributes with 2 valid values and the rest missing

About Me

Labels

Blog Archive