Search this blog


Thursday, 8 January 2015

RapidMiner Server and Elasticsearch with Lucene

Elasticsearch, Logstash and Kibana: a set of most excellent open source tools that are very good at consolidating log files and other data into a central location (the Logstash part), storing and indexing them to make a scalable search platform (Elasticsearch and Lucene) and providing a neat Web front end (Kibana).

RapidMiner Server produces log files and sometimes when running processes, errors can be hard to find so I decided to import the server log files into Elasticsearch to see if the Lucene search engine capability that would result could speed me up a bit.

I am not able to share the exact technical details but it is relatively easy and involves using Logstash Forwarder on the RapidMiner Server host machine. Logstash Forwarder is set up to monitor the RapidMiner Server log file and will forward any new lines to the Logstash server.

Logstash is set up to receive all events and applies filtering to these to combine some log entries into single events. The Multiline rule I used is that any line that does not begin with a valid timestamp should be included with the previous line (hints; the regular expression pattern is "^%{TIME}", "negate" is true and "what" is set to "previous"). This step alone has a tremendous benefit as it neatens up the log file so that each line now has a timestamp and sorting by time will not miss any straggler or blank lines.

Once in Elasticsearch, the events have become documents and the Kibana Web interface can be used to search them. One of the cool things is that the Lucene information retrieval library is built into Elasticsearch. This means it is possible to do queries like this.

"Marking successfully" ~ 3

This matches if the two words are within three words of each other.

It turns out that RapidMiner Server reports this if a process succeeds...

"Marking process as successfully completed"

and reports this if a process failed...

"Marking process as completed with exception"

This allows two queries to be defined to filter out everything except the log lines containing these error messages.

In fact, here's a screenshot showing the results of me running some RapidMiner Server tests where some fail and some succeed.

The histogram view shows the count of the matched events aggregated by minute and the table view gives the log details.

I can safely say that no developer was harmed in the making of this dashboard.

All in all, it makes it a bit easier to spot when something has gone wrong but of course, I am only scratching the surface of what is possible. Elasticsearch has an impressive array of text indexing capabilities and it has a completely open JSON interface. I could imagine connecting the output of Elasticsearch to RapidMiner and making models to prescribe some corrective medicine when a problem is detected in RapidMiner Server. As time and motivation permits I will attempt this although it might become too cool to share for free ;)

Thursday, 1 January 2015

English stop words

I wondered recently what is the exact word list used by the Filter Stopwords (English) operator.

I consulted the code for version 5.3 and I made the following file. It contains 395 words in total including some interesting ones ones like "wert" - the imperfect subjunctive of "were" found in Shakespearean English and "summat" - Yorkshire dialect for "something". I'm not sure these would always be stop words.

I assume the operator hasn't changed in the latest version but if it has, the list can be used with the Filter Stopwords (Dictionary) operator to make a facsimile of version 5.3. It would also be possible to use the list as the basis for your own stop word filtering operator and publishing the list would make research more reproducible.

Thursday, 27 November 2014

Sentiment Analysis: British politicians compared with a Happiness Histogram.

I'm currently making some text mining videos one of which is about sentiment analysis. For fun, I thought I would analyse the sentiment of speeches given at their respective party conferences by three current British politicians, David Cameron, Nick Clegg and Ed Miliband to see what we can learn. Of course, and I stress, this is by no means an exhaustive and thorough analysis; it's just a bit of fun.

I used RapidMiner and the Text Mining and WordNet extensions. Specifically, the WordNet 3.0 and the SentiWordNet 3.0.0 database. I divided each text into tokens (i.e. words) and then split the text into consecutive equal sized parts with 100 words in each. I then used the Extract Sentiment (English) operator to score each of the parts with a sentiment. This ranges between +1 for positive and -1 for negative. I used a dash of R to draw some of the graphs below with the advanced charts of RapidMiner being used for the last one.

Let's compare the three speeches using a histogram of the sentiments - the Happiness Histogram. The colours represent the parties (Ed Miliband: Red, Nick Clegg: Orange, David Cameron: Blue). The graphs show the sentiment distribution for each 100 word part of the document and you can see that the values range between +0.1 and -0.04. With 100 words you would not expect very high scores because the sentiment calculation simply applies a sentiment value to each word and averages for all words. Nonetheless, the variations are slightly more than would be expected from random sampling; I did some brief checking to confirm this.

This next graph compares them directly.
We notice that the speeches are resolutely perky in that they are always more positive than negative on average. The Miliband speech has an outlying region of happiness (ironically to the right) whereas the other two are more middle of the road,

Now let's see how sentiment varies as we move through the speeches.

This graph is a moving average of 10 data points (i.e. 1000 words) for each of the 3 speeches with the colours as before. The minutes axis corresponds to a speaking rate of 125 words a minute which is what I observed the speeches averaged to. This means the first moving average starts at 1000 words or at about 8 minutes in.

It's quite interesting to see how the different politicians vary sentiment. Ed Miliband approaches the end of the speech in a series of steps gradually getting happier with mini-spells of relative gloom. Nick Clegg seems to get more and more positive but perhaps peaks too early and ends on a down. David Cameron starts happy, gets gloomy then quickly recovers but again maybe too early and ends on a down. Perhaps Messrs Clegg and Cameron have to temper what they say with the realism of being in government.

It is also possible to correlate the extremes of the sentiment with the words being used. There is a wealth of detail and interesting things to note but time prevents me from detailing this today and so I will save that for another post.