Search this blog


Monday, 4 May 2015

Which UK political party is happiest? Update

There is a general election this coming Thursday in the UK and I thought it would be most interesting to compare the manifesto sentiment of 6 of the parties involved.

Firstly, I downloaded the manifestos and chopped them into sequential 50 word chunks and calculated an average sentiment of each chunk.

For each party, I also created a random manifesto by shuffling the original and again chopping into 50 word chunks to calculate an average sentiment for each chunk. I repeated this on 50 different random manifestos for each party for statistical reasons coming later.

I then placed the sentiments into bins of width 0.04 to create the "histogram of happiness". By plotting the result we can see how the manifestos vary from random as illustrated with this plot for one of the parties.

The random points, shown in red, represent the average of the 50 random manifestos with one standard deviation shown as the red bar while the blue bars are the sentiment as measured by the intended word order. The variations are more than can be explained by random chance and we can see for the sentiment between -0.16 and -0.12 there are more 50 word chunks and between 0 and 0.04 there are fewer.

We can calculate a z score for each bin by subtracting the manifesto score from the random result and dividing by the standard deviation of the random result. For the graph above this results in the following graph.

Generally speaking, anything with an absolute z-score greater than 2 has a 1 in 20 chance of happening so the graph above shows that the variations are definitely not random. This is just as well since I'm sure the political parties want to persuade with something that is not just random.

It's quite tricky to compare the 6 parties in a neat way because the graph gets a bit messy. So I decided to focus only on the negative z scores. These represent chunks that happen more often than random and are likely to get noticed more. In other words, uttering something negative or positive is more noticeable than not uttering something.

With this in mind, I combined all the 6 parties to see how they compare.

This graphic is showing only those parts of the manifesto distribution which are more represented than random sampling by two standard deviations. Note that the x axis is not continuous and the smallest circle represents a z score of -2.04 (for the SNP).

What can we see from this? The Green Party has sections of chirpiness but offsets this with sections of negativity. The SNP is both positive and negative but to a lesser extent than the Greens. The Liberal and Labour parties are mostly negative while the Conservatives are showing slight positivity.  By a process of elimination, UKIP has the most positive manifesto. The likelihood of finding a 50 word chunk in their manifesto with a sentiment between 0.24 and 0.28 is significantly greater than random. I declare them the happiest.

Is this going to predict the election? I doubt it but it's likely there are teams of policy wonks drafting these manifestos so it would be funny to make sentiment another thing for them to worry about.

Update: it turns out the Conservatives unexpectedly won. I refined the picture above to bring out the differences between positive and negative: green means more positive than random, red means more negative.  It shows that the Conservative manifesto is resolutely the most middle of the road. Given that elections in Britian are fought on the middle ground I really should have predicted this.

Tuesday, 24 February 2015

Finding those useless attributes and making sure they are really useless

The "Remove Useless Attributes" operator does what it says and removes attributes that are useless. The default for numbers is to remove those that have zero deviation. This is fair enough since it means these attributes are the same for all examples; there's nothing they are bringing to the party. For nominal values, the default is to remove an attribute where all its values are the same. Again, fair enough.

What happens if you remove some attributes and you want to know which ones? You might ask why and that's a good question. All I can say is that it turns out that there are situations where no one will believe you. The conversation goes like this.

"Where are those attributes that I lovingly made?"
"They don't add any value"
"What?! Noooo"

Anyway, you get the picture.

Here's a process that finds the useless attributes and outputs an example set so that you can confirm that they really should be allowed to leave.

It uses the "Data to Weights" operator on the example set after the useless attributes have been sent home. The "Select by Weights" operator is then applied to the original example set containing all the attributes but with the "Weight Relation" set to be less than 1.0 and crucially "deselect unknown" is unchecked. This has the nice effect that the returned example set contains the attributes that were marked as useless.

Thursday, 8 January 2015

RapidMiner Server and Elasticsearch with Lucene

Elasticsearch, Logstash and Kibana: a set of most excellent open source tools that are very good at consolidating log files and other data into a central location (the Logstash part), storing and indexing them to make a scalable search platform (Elasticsearch and Lucene) and providing a neat Web front end (Kibana).

RapidMiner Server produces log files and sometimes when running processes, errors can be hard to find so I decided to import the server log files into Elasticsearch to see if the Lucene search engine capability that would result could speed me up a bit.

I am not able to share the exact technical details but it is relatively easy and involves using Logstash Forwarder on the RapidMiner Server host machine. Logstash Forwarder is set up to monitor the RapidMiner Server log file and will forward any new lines to the Logstash server.

Logstash is set up to receive all events and applies filtering to these to combine some log entries into single events. The Multiline rule I used is that any line that does not begin with a valid timestamp should be included with the previous line (hints; the regular expression pattern is "^%{TIME}", "negate" is true and "what" is set to "previous"). This step alone has a tremendous benefit as it neatens up the log file so that each line now has a timestamp and sorting by time will not miss any straggler or blank lines.

Once in Elasticsearch, the events have become documents and the Kibana Web interface can be used to search them. One of the cool things is that the Lucene information retrieval library is built into Elasticsearch. This means it is possible to do queries like this.

"Marking successfully" ~ 3

This matches if the two words are within three words of each other.

It turns out that RapidMiner Server reports this if a process succeeds...

"Marking process as successfully completed"

and reports this if a process failed...

"Marking process as completed with exception"

This allows two queries to be defined to filter out everything except the log lines containing these error messages.

In fact, here's a screenshot showing the results of me running some RapidMiner Server tests where some fail and some succeed.

The histogram view shows the count of the matched events aggregated by minute and the table view gives the log details.

I can safely say that no developer was harmed in the making of this dashboard.

All in all, it makes it a bit easier to spot when something has gone wrong but of course, I am only scratching the surface of what is possible. Elasticsearch has an impressive array of text indexing capabilities and it has a completely open JSON interface. I could imagine connecting the output of Elasticsearch to RapidMiner and making models to prescribe some corrective medicine when a problem is detected in RapidMiner Server. As time and motivation permits I will attempt this although it might become too cool to share for free ;)