Search this blog

Loading...

Tuesday, 24 February 2015

Finding those useless attributes and making sure they are really useless

The "Remove Useless Attributes" operator does what it says and removes attributes that are useless. The default for numbers is to remove those that have zero deviation. This is fair enough since it means these attributes are the same for all examples; there's nothing they are bringing to the party. For nominal values, the default is to remove an attribute where all its values are the same. Again, fair enough.

What happens if you remove some attributes and you want to know which ones? You might ask why and that's a good question. All I can say is that it turns out that there are situations where no one will believe you. The conversation goes like this.

"Where are those attributes that I lovingly made?"
"They don't add any value"
"What?! Noooo"

Anyway, you get the picture.

Here's a process that finds the useless attributes and outputs an example set so that you can confirm that they really should be allowed to leave.

It uses the "Data to Weights" operator on the example set after the useless attributes have been sent home. The "Select by Weights" operator is then applied to the original example set containing all the attributes but with the "Weight Relation" set to be less than 1.0 and crucially "deselect unknown" is unchecked. This has the nice effect that the returned example set contains the attributes that were marked as useless.

Thursday, 8 January 2015

RapidMiner Server and Elasticsearch with Lucene

Elasticsearch, Logstash and Kibana: a set of most excellent open source tools that are very good at consolidating log files and other data into a central location (the Logstash part), storing and indexing them to make a scalable search platform (Elasticsearch and Lucene) and providing a neat Web front end (Kibana).

RapidMiner Server produces log files and sometimes when running processes, errors can be hard to find so I decided to import the server log files into Elasticsearch to see if the Lucene search engine capability that would result could speed me up a bit.

I am not able to share the exact technical details but it is relatively easy and involves using Logstash Forwarder on the RapidMiner Server host machine. Logstash Forwarder is set up to monitor the RapidMiner Server log file and will forward any new lines to the Logstash server.

Logstash is set up to receive all events and applies filtering to these to combine some log entries into single events. The Multiline rule I used is that any line that does not begin with a valid timestamp should be included with the previous line (hints; the regular expression pattern is "^%{TIME}", "negate" is true and "what" is set to "previous"). This step alone has a tremendous benefit as it neatens up the log file so that each line now has a timestamp and sorting by time will not miss any straggler or blank lines.

Once in Elasticsearch, the events have become documents and the Kibana Web interface can be used to search them. One of the cool things is that the Lucene information retrieval library is built into Elasticsearch. This means it is possible to do queries like this.

"Marking successfully" ~ 3

This matches if the two words are within three words of each other.

It turns out that RapidMiner Server reports this if a process succeeds...

"Marking process as successfully completed"

and reports this if a process failed...

"Marking process as completed with exception"

This allows two queries to be defined to filter out everything except the log lines containing these error messages.

In fact, here's a screenshot showing the results of me running some RapidMiner Server tests where some fail and some succeed.


The histogram view shows the count of the matched events aggregated by minute and the table view gives the log details.

I can safely say that no developer was harmed in the making of this dashboard.

All in all, it makes it a bit easier to spot when something has gone wrong but of course, I am only scratching the surface of what is possible. Elasticsearch has an impressive array of text indexing capabilities and it has a completely open JSON interface. I could imagine connecting the output of Elasticsearch to RapidMiner and making models to prescribe some corrective medicine when a problem is detected in RapidMiner Server. As time and motivation permits I will attempt this although it might become too cool to share for free ;)


Thursday, 1 January 2015

English stop words

I wondered recently what is the exact word list used by the Filter Stopwords (English) operator.

I consulted the code for version 5.3 and I made the following file. It contains 395 words in total including some interesting ones ones like "wert" - the imperfect subjunctive of "were" found in Shakespearean English and "summat" - Yorkshire dialect for "something". I'm not sure these would always be stop words.

I assume the operator hasn't changed in the latest version but if it has, the list can be used with the Filter Stopwords (Dictionary) operator to make a facsimile of version 5.3. It would also be possible to use the list as the basis for your own stop word filtering operator and publishing the list would make research more reproducible.