Search this blog


Sunday, 30 August 2015

Using RapidMiner to read data from HBase

HBase is a database within the Hadoop ecosystem. Here's a very simple example RapidMiner process that connects to an HBase server and reads a value.

The process uses the RapidMiner Python operator and a package called 'happybase'.

As always when integrating systems together, there is a lot of leg-work to do to get things working. This starts with a running Hadoop cluster with HBase as well as some data. For this toy example, I created the world's simplest table called 'test' containing two rows. For example, from the HBase shell, the 'scan' command yields the following.

hbase(main):002:0> scan 'test'
ROW                   COLUMN+CELL                                               
 row1                 column=cf:a, timestamp=1440837877452, value=value1        
 row2                 column=cf:b, timestamp=1440837887539, value=value2        
2 row(s) in 0.0290 seconds

To allow remote access, Thrift must be started to allow remote connections to get to HBase. This is typically done by running the following command within the HBase installation on the machine running HBase.

./bin/hbase thrift start

The final step is to ensure that remote requests to the default Thrift port (9090 by default) are not blocked by the firewall on the HBase machine.

The RapidMiner process can now be run. The Python code within the RapidMiner process is shown below. Change the script to match the values in your environment as you need.

import pandas as pd
import happybase

def rm_main():

def dict_to_dataframe(d):
df.set_index(0, inplace=True)
return df

# use the name or IP address where HBase is running
connection = happybase.Connection('')
# use a table name in the database
# this scans the database and prints to the log
for key, data in table.scan():
print key, data
# this selects a row containing row1
row1 = table.row('row1')
return dict_to_dataframe(row1)

I'm by no means a Python expert so I don't expect this is the world's best example. Nonetheless, it shows the possibilities.

When run in my environment, the returned example set is as follows.

I've only scratched the surface of what could be done using the 'happybase' package but I hope this gives you some ideas about what you might be able to do.

Thursday, 9 July 2015

Finding quartiles

Here's a process that finds the upper, middle and lower quartiles of a real valued special attribute within an example set and discretizes all the real values into the corresponding bins. It assumes there is one special attribute only. Additional special attributes would need to be de-selected as an extra step before being processed.

The process works as follows. After sorting the example set, it uses various macro extraction and manipulation operators to work out how many examples there are, determine the index corresponding to the quartile locations and from there the values of the attributes at these locations. These values are set as macros that are used in the "Discretize by User Specification" operator as boundaries between the quartile ranges in order to place each example into the correct bin.

The main work happens in a subprocess which makes the process easier to read and allows the operators to be moved to other processes more easily. The very useful operator "Rename by Generic Names" is used. This allows the macro manipulation operators to work without having to be concerned about the name of the special attribute which again allows the operators to be more portable when used in other processes.

Monday, 4 May 2015

Which UK political party is happiest? Update

There is a general election this coming Thursday in the UK and I thought it would be most interesting to compare the manifesto sentiment of 6 of the parties involved.

Firstly, I downloaded the manifestos and chopped them into sequential 50 word chunks and calculated an average sentiment of each chunk.

For each party, I also created a random manifesto by shuffling the original and again chopping into 50 word chunks to calculate an average sentiment for each chunk. I repeated this on 50 different random manifestos for each party for statistical reasons coming later.

I then placed the sentiments into bins of width 0.04 to create the "histogram of happiness". By plotting the result we can see how the manifestos vary from random as illustrated with this plot for one of the parties.

The random points, shown in red, represent the average of the 50 random manifestos with one standard deviation shown as the red bar while the blue bars are the sentiment as measured by the intended word order. The variations are more than can be explained by random chance and we can see for the sentiment between -0.16 and -0.12 there are more 50 word chunks and between 0 and 0.04 there are fewer.

We can calculate a z score for each bin by subtracting the manifesto score from the random result and dividing by the standard deviation of the random result. For the graph above this results in the following graph.

Generally speaking, anything with an absolute z-score greater than 2 has a 1 in 20 chance of happening so the graph above shows that the variations are definitely not random. This is just as well since I'm sure the political parties want to persuade with something that is not just random.

It's quite tricky to compare the 6 parties in a neat way because the graph gets a bit messy. So I decided to focus only on the negative z scores. These represent chunks that happen more often than random and are likely to get noticed more. In other words, uttering something negative or positive is more noticeable than not uttering something.

With this in mind, I combined all the 6 parties to see how they compare.

This graphic is showing only those parts of the manifesto distribution which are more represented than random sampling by two standard deviations. Note that the x axis is not continuous and the smallest circle represents a z score of -2.04 (for the SNP).

What can we see from this? The Green Party has sections of chirpiness but offsets this with sections of negativity. The SNP is both positive and negative but to a lesser extent than the Greens. The Liberal and Labour parties are mostly negative while the Conservatives are showing slight positivity.  By a process of elimination, UKIP has the most positive manifesto. The likelihood of finding a 50 word chunk in their manifesto with a sentiment between 0.24 and 0.28 is significantly greater than random. I declare them the happiest.

Is this going to predict the election? I doubt it but it's likely there are teams of policy wonks drafting these manifestos so it would be funny to make sentiment another thing for them to worry about.

Update: it turns out the Conservatives unexpectedly won. I refined the picture above to bring out the differences between positive and negative: green means more positive than random, red means more negative.  It shows that the Conservative manifesto is resolutely the most middle of the road. Given that elections in Britian are fought on the middle ground I really should have predicted this.