Data Science With RapidMiner: Weights

Showing posts with label Weights. Show all posts

Tuesday, 24 February 2015

Finding those useless attributes and making sure they are really useless

The "Remove Useless Attributes" operator does what it says and removes attributes that are useless. The default for numbers is to remove those that have zero deviation. This is fair enough since it means these attributes are the same for all examples; there's nothing they are bringing to the party. For nominal values, the default is to remove an attribute where all its values are the same. Again, fair enough.

What happens if you remove some attributes and you want to know which ones? You might ask why and that's a good question. All I can say is that it turns out that there are situations where no one will believe you. The conversation goes like this.

"Where are those attributes that I lovingly made?"
"They don't add any value"
"What?! Noooo"

Anyway, you get the picture.

Here's a process that finds the useless attributes and outputs an example set so that you can confirm that they really should be allowed to leave.

It uses the "Data to Weights" operator on the example set after the useless attributes have been sent home. The "Select by Weights" operator is then applied to the original example set containing all the attributes but with the "Weight Relation" set to be less than 1.0 and crucially "deselect unknown" is unchecked. This has the nice effect that the returned example set contains the attributes that were marked as useless.

Sunday, 21 July 2013

Scaling attribute values using weights

Here's a process that multiplies each value of an attribute within one example set by a constant in another example set. The constants are specific for each attribute and the process uses weights derived from the example set. In effect, a matrix multiplication is happening.

At a high level, the process works as follows.

The Iris data set is used with weights being produced using "Weight By Information Gain"
These weights are transformed into an example set and stored for later use inside a Loop operator
A subprocess is used to make sure everything works in the right order (this technique is also used inside the Loop).
A "Loop Attributes" operator iterates over all attributes and generates a new attribute based on multiplying the existing value by a weight. The attribute name is required to be contained in the weights example set.
The weight for each example is calculated with a combination of filtering and macro extraction.

Saturday, 29 January 2011

Creating test data with attributes that match the training data

When applying a model to a test set, it is (usually) important that the number and names of attributes match those used to create the model.

When extracting features from data where the attribute names depend on the data, it can often be the case that test data both lacks all the attributes of the model and may have additional ones.

The example here shows the following

Training data with attributes att2 to att10 and a label
Test data with attributes att1 and att2
Attribute att1 is removed from the test data by using weights from the training data
Attributes att2 to att10 and the label are added to the the test data by using a join operator
The resulting test data contains attributes att2 to att10 and a label. Only att2 has a value, all others are missing

Data Science With RapidMiner

Search this blog

Tuesday, 24 February 2015

Finding those useless attributes and making sure they are really useless

Sunday, 21 July 2013

Scaling attribute values using weights

Saturday, 29 January 2011

Creating test data with attributes that match the training data

About Me

Labels

Blog Archive