Search this blog

Loading...

Thursday, 9 July 2015

Finding quartiles

Here's a process that finds the upper, middle and lower quartiles of a real valued special attribute within an example set and discretizes all the real values into the corresponding bins. It assumes there is one special attribute only. Additional special attributes would need to be de-selected as an extra step before being processed.

The process works as follows. After sorting the example set, it uses various macro extraction and manipulation operators to work out how many examples there are, determine the index corresponding to the quartile locations and from there the values of the attributes at these locations. These values are set as macros that are used in the "Discretize by User Specification" operator as boundaries between the quartile ranges in order to place each example into the correct bin.

The main work happens in a subprocess which makes the process easier to read and allows the operators to be moved to other processes more easily. The very useful operator "Rename by Generic Names" is used. This allows the macro manipulation operators to work without having to be concerned about the name of the special attribute which again allows the operators to be more portable when used in other processes.

Monday, 4 May 2015

Which UK political party is happiest? Update

There is a general election this coming Thursday in the UK and I thought it would be most interesting to compare the manifesto sentiment of 6 of the parties involved.

Firstly, I downloaded the manifestos and chopped them into sequential 50 word chunks and calculated an average sentiment of each chunk.

For each party, I also created a random manifesto by shuffling the original and again chopping into 50 word chunks to calculate an average sentiment for each chunk. I repeated this on 50 different random manifestos for each party for statistical reasons coming later.

I then placed the sentiments into bins of width 0.04 to create the "histogram of happiness". By plotting the result we can see how the manifestos vary from random as illustrated with this plot for one of the parties.



The random points, shown in red, represent the average of the 50 random manifestos with one standard deviation shown as the red bar while the blue bars are the sentiment as measured by the intended word order. The variations are more than can be explained by random chance and we can see for the sentiment between -0.16 and -0.12 there are more 50 word chunks and between 0 and 0.04 there are fewer.

We can calculate a z score for each bin by subtracting the manifesto score from the random result and dividing by the standard deviation of the random result. For the graph above this results in the following graph.



Generally speaking, anything with an absolute z-score greater than 2 has a 1 in 20 chance of happening so the graph above shows that the variations are definitely not random. This is just as well since I'm sure the political parties want to persuade with something that is not just random.

It's quite tricky to compare the 6 parties in a neat way because the graph gets a bit messy. So I decided to focus only on the negative z scores. These represent chunks that happen more often than random and are likely to get noticed more. In other words, uttering something negative or positive is more noticeable than not uttering something.

With this in mind, I combined all the 6 parties to see how they compare.



This graphic is showing only those parts of the manifesto distribution which are more represented than random sampling by two standard deviations. Note that the x axis is not continuous and the smallest circle represents a z score of -2.04 (for the SNP).

What can we see from this? The Green Party has sections of chirpiness but offsets this with sections of negativity. The SNP is both positive and negative but to a lesser extent than the Greens. The Liberal and Labour parties are mostly negative while the Conservatives are showing slight positivity.  By a process of elimination, UKIP has the most positive manifesto. The likelihood of finding a 50 word chunk in their manifesto with a sentiment between 0.24 and 0.28 is significantly greater than random. I declare them the happiest.

Is this going to predict the election? I doubt it but it's likely there are teams of policy wonks drafting these manifestos so it would be funny to make sentiment another thing for them to worry about.

Update: it turns out the Conservatives unexpectedly won. I refined the picture above to bring out the differences between positive and negative: green means more positive than random, red means more negative.  It shows that the Conservative manifesto is resolutely the most middle of the road. Given that elections in Britian are fought on the middle ground I really should have predicted this.


Tuesday, 24 February 2015

Finding those useless attributes and making sure they are really useless

The "Remove Useless Attributes" operator does what it says and removes attributes that are useless. The default for numbers is to remove those that have zero deviation. This is fair enough since it means these attributes are the same for all examples; there's nothing they are bringing to the party. For nominal values, the default is to remove an attribute where all its values are the same. Again, fair enough.

What happens if you remove some attributes and you want to know which ones? You might ask why and that's a good question. All I can say is that it turns out that there are situations where no one will believe you. The conversation goes like this.

"Where are those attributes that I lovingly made?"
"They don't add any value"
"What?! Noooo"

Anyway, you get the picture.

Here's a process that finds the useless attributes and outputs an example set so that you can confirm that they really should be allowed to leave.

It uses the "Data to Weights" operator on the example set after the useless attributes have been sent home. The "Select by Weights" operator is then applied to the original example set containing all the attributes but with the "Weight Relation" set to be less than 1.0 and crucially "deselect unknown" is unchecked. This has the nice effect that the returned example set contains the attributes that were marked as useless.