Search this blog

Sunday, 30 August 2015

Using RapidMiner to read data from HBase

HBase is a database within the Hadoop ecosystem. Here's a very simple example RapidMiner process that connects to an HBase server and reads a value.

The process uses the RapidMiner Python operator and a package called 'happybase'.

As always when integrating systems together, there is a lot of leg-work to do to get things working. This starts with a running Hadoop cluster with HBase as well as some data. For this toy example, I created the world's simplest table called 'test' containing two rows. For example, from the HBase shell, the 'scan' command yields the following.

hbase(main):002:0> scan 'test'
ROW                   COLUMN+CELL                                               
 row1                 column=cf:a, timestamp=1440837877452, value=value1        
 row2                 column=cf:b, timestamp=1440837887539, value=value2        
2 row(s) in 0.0290 seconds

To allow remote access, Thrift must be started to allow remote connections to get to HBase. This is typically done by running the following command within the HBase installation on the machine running HBase.

./bin/hbase thrift start

The final step is to ensure that remote requests to the default Thrift port (9090 by default) are not blocked by the firewall on the HBase machine.

The RapidMiner process can now be run. The Python code within the RapidMiner process is shown below. Change the script to match the values in your environment as you need.

import pandas as pd
import happybase

def rm_main():

def dict_to_dataframe(d):
df=pd.DataFrame(d.items())
df.set_index(0, inplace=True)
return df

# use the name or IP address where HBase is running
connection = happybase.Connection('192.168.1.76')
# use a table name in the database
table=connection.table('test')
# this scans the database and prints to the log
for key, data in table.scan():
print key, data
# this selects a row containing row1
row1 = table.row('row1')
return dict_to_dataframe(row1)


I'm by no means a Python expert so I don't expect this is the world's best example. Nonetheless, it shows the possibilities.

When run in my environment, the returned example set is as follows.


I've only scratched the surface of what could be done using the 'happybase' package but I hope this gives you some ideas about what you might be able to do.