The process uses the RapidMiner Python operator and a package called 'happybase'.
As always when integrating systems together, there is a lot of leg-work to do to get things working. This starts with a running Hadoop cluster with HBase as well as some data. For this toy example, I created the world's simplest table called 'test' containing two rows. For example, from the HBase shell, the 'scan' command yields the following.
hbase(main):002:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:a, timestamp=1440837877452, value=value1
row2 column=cf:b, timestamp=1440837887539, value=value2
2 row(s) in 0.0290 seconds
To allow remote access, Thrift must be started to allow remote connections to get to HBase. This is typically done by running the following command within the HBase installation on the machine running HBase.
./bin/hbase thrift start
The final step is to ensure that remote requests to the default Thrift port (9090 by default) are not blocked by the firewall on the HBase machine.
The RapidMiner process can now be run. The Python code within the RapidMiner process is shown below. Change the script to match the values in your environment as you need.
import pandas as pd
import happybase
def rm_main():
def dict_to_dataframe(d):
df=pd.DataFrame(d.items())
df.set_index(0, inplace=True)
return df
# use the name or IP address where HBase is running
connection = happybase.Connection('192.168.1.76')
# use a table name in the database
table=connection.table('test')
# this scans the database and prints to the log
for key, data in table.scan():
print key, data
# this selects a row containing row1
row1 = table.row('row1')
return dict_to_dataframe(row1)
I'm by no means a Python expert so I don't expect this is the world's best example. Nonetheless, it shows the possibilities.
When run in my environment, the returned example set is as follows.
I've only scratched the surface of what could be done using the 'happybase' package but I hope this gives you some ideas about what you might be able to do.