Search this blog

Loading...

Thursday, 27 November 2014

Sentiment Analysis: British politicians compared with a Happiness Histogram.

I'm currently making some text mining videos one of which is about sentiment analysis. For fun, I thought I would analyse the sentiment of speeches given at their respective party conferences by three current British politicians, David Cameron, Nick Clegg and Ed Miliband to see what we can learn. Of course, and I stress, this is by no means an exhaustive and thorough analysis; it's just a bit of fun.

I used RapidMiner and the Text Mining and WordNet extensions. Specifically, the WordNet 3.0 and the SentiWordNet 3.0.0 database. I divided each text into tokens (i.e. words) and then split the text into consecutive equal sized parts with 100 words in each. I then used the Extract Sentiment (English) operator to score each of the parts with a sentiment. This ranges between +1 for positive and -1 for negative. I used a dash of R to draw some of the graphs below with the advanced charts of RapidMiner being used for the last one.

Let's compare the three speeches using a histogram of the sentiments - the Happiness Histogram. The colours represent the parties (Ed Miliband: Red, Nick Clegg: Orange, David Cameron: Blue). The graphs show the sentiment distribution for each 100 word part of the document and you can see that the values range between +0.1 and -0.04. With 100 words you would not expect very high scores because the sentiment calculation simply applies a sentiment value to each word and averages for all words. Nonetheless, the variations are slightly more than would be expected from random sampling; I did some brief checking to confirm this.



This next graph compares them directly.
We notice that the speeches are resolutely perky in that they are always more positive than negative on average. The Miliband speech has an outlying region of happiness (ironically to the right) whereas the other two are more middle of the road,

Now let's see how sentiment varies as we move through the speeches.


This graph is a moving average of 10 data points (i.e. 1000 words) for each of the 3 speeches with the colours as before. The minutes axis corresponds to a speaking rate of 125 words a minute which is what I observed the speeches averaged to. This means the first moving average starts at 1000 words or at about 8 minutes in.

It's quite interesting to see how the different politicians vary sentiment. Ed Miliband approaches the end of the speech in a series of steps gradually getting happier with mini-spells of relative gloom. Nick Clegg seems to get more and more positive but perhaps peaks too early and ends on a down. David Cameron starts happy, gets gloomy then quickly recovers but again maybe too early and ends on a down. Perhaps Messrs Clegg and Cameron have to temper what they say with the realism of being in government.

It is also possible to correlate the extremes of the sentiment with the words being used. There is a wealth of detail and interesting things to note but time prevents me from detailing this today and so I will save that for another post.

Wednesday, 29 October 2014

Using Groovy to extract the last part of a folder structure

Imagine you are using "Loop Files" to find files one by one and import them perhaps using the "Read CSV" operator. The "Loop Files" operator provides macros such as file_path, file_name and so on to allow you to create meta data with the example set.

So if you have a folder name like this...
c:\users\andrew\bigdata\lotsofdata\subregion\
where each subregion contains many files and there are many different subregions. It makes sense to label all the files for a subregion. This can be done by using the folder name which is contained in the parent_path macro provided by the "Loop Files" operator. There is a lot of redundant information that it would be sensible to get rid of and I suppose it would be possible using some heavy combination of macro and attribute manipulation operators but I decided to write some Groovy to do it. The resulting script is simple.
String filePath = operator.getProcess().macroHandler.getMacro("parent_path")
String lastPart = filePath.tokenize('\\').last()
operator.getProcess().getMacroHandler().addMacro("subregion", lastPart);
It assumes a macro called parent_path which contains the folder name. The tokenize function splits this into tokens separated by "\" and the last one is returned using the last function. A macro called subregion is then created. This can be used as a normal macro.

Saturday, 25 October 2014

Windowing and Processing Documents

The text mining extension contains an operator called "Window Document". It takes a document that has been split into tokens (typically words) and creates a collection of new documents from it. Each new document contains a fixed number of tokens corresponding to a "window length" parameter and the movement of the window that moves through the document is dictated by a "step size" parameter. A meta data attribute called "window" is created for each new document; this corresponds to the window within the original document.

So for example, this text

"The cat sat on the mat"

could be split into three windows each of size two if window length is set to two and step size is set to two.

window: 0 - "The cat"
window: 2 - "sat on"
window: 4 - "the mat"

Here's a simple process that illustrates windowing and processing. It's worth noting that the "Process Documents" operator is able to take a collection of documents as input. Note that the process uses version 6.1 of RapidMiner studio so some manual version number editing would be needed to run it in older versions. Note too that you must have the Text Processing extension installed.

The process illustrates a tiny pitfall for the unwary. If one of the tokens is "window" and if the parameter "add meta information" is set to true for the "Process Documents" operator, the resulting example set contains an attribute with the name "window_0". This is because the meta data for the window creates a special attribute in the final example set with name "window" and this would clash with the attribute corresponding to the token. If the parameter "add meta information" is set to false, the attribute corresponding to the token is called "window". In other words, the example set changes in a subtle way depending on the setting of a parameter which can lead to problems.

It's a very small point but I happened to stumble over it recently as I was preparing my contribution to an upcoming text mining book. Here's a teaser because it looks nice :). It is comparing three novels by Jane Austen and how the shape of word frequencies varies for consecutive windows through the books.

The red line is for Mansfield Park, the blue is for Sense and Sensibility and the green is for Pride and Prejudice.