Data Science With RapidMiner

Sunday, 7 February 2016

First post of 2016

I haven't posted for a while. This is partially because I am very busy and it is also because I haven't created anything new using RapidMiner that was new and interesting enough to share. The community license puts certain things off limits which is a pity since it would be nice to try them out and share the results.

With the release of version 7, I will spend some time reviewing how it is and post things of interest as I find them.

In the meantime, there are a number of posts that I saved as draft that I will revisit and publish. These are all pre version 7 but they should still be applicable. The first is about times and gives some reference information about how to get these behaving as you want. I'll publish it after this one.

Sunday, 30 August 2015

Using RapidMiner to read data from HBase

HBase is a database within the Hadoop ecosystem. Here's a very simple example RapidMiner process that connects to an HBase server and reads a value.

The process uses the RapidMiner Python operator and a package called 'happybase'.

As always when integrating systems together, there is a lot of leg-work to do to get things working. This starts with a running Hadoop cluster with HBase as well as some data. For this toy example, I created the world's simplest table called 'test' containing two rows. For example, from the HBase shell, the 'scan' command yields the following.

hbase(main):002:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:a, timestamp=1440837877452, value=value1
row2 column=cf:b, timestamp=1440837887539, value=value2
2 row(s) in 0.0290 seconds

To allow remote access, Thrift must be started to allow remote connections to get to HBase. This is typically done by running the following command within the HBase installation on the machine running HBase.

./bin/hbase thrift start

The final step is to ensure that remote requests to the default Thrift port (9090 by default) are not blocked by the firewall on the HBase machine.

The RapidMiner process can now be run. The Python code within the RapidMiner process is shown below. Change the script to match the values in your environment as you need.

import pandas as pd

import happybase

def rm_main():

def dict_to_dataframe(d):

df=pd.DataFrame(d.items())

df.set_index(0, inplace=True)

return df

# use the name or IP address where HBase is running

connection = happybase.Connection('192.168.1.76')

# use a table name in the database

table=connection.table('test')

# this scans the database and prints to the log

for key, data in table.scan():

print key, data

# this selects a row containing row1

row1 = table.row('row1')

return dict_to_dataframe(row1)

I'm by no means a Python expert so I don't expect this is the world's best example. Nonetheless, it shows the possibilities.

When run in my environment, the returned example set is as follows.

I've only scratched the surface of what could be done using the 'happybase' package but I hope this gives you some ideas about what you might be able to do.

Thursday, 9 July 2015

Finding quartiles

Here's a process that finds the upper, middle and lower quartiles of a real valued special attribute within an example set and discretizes all the real values into the corresponding bins. It assumes there is one special attribute only. Additional special attributes would need to be de-selected as an extra step before being processed.

The process works as follows. After sorting the example set, it uses various macro extraction and manipulation operators to work out how many examples there are, determine the index corresponding to the quartile locations and from there the values of the attributes at these locations. These values are set as macros that are used in the "Discretize by User Specification" operator as boundaries between the quartile ranges in order to place each example into the correct bin.

The main work happens in a subprocess which makes the process easier to read and allows the operators to be moved to other processes more easily. The very useful operator "Rename by Generic Names" is used. This allows the macro manipulation operators to work without having to be concerned about the name of the special attribute which again allows the operators to be more portable when used in other processes.

Data Science With RapidMiner

Search this blog

Sunday, 7 February 2016

First post of 2016

Sunday, 30 August 2015

Using RapidMiner to read data from HBase

Thursday, 9 July 2015

Finding quartiles

About Me

Labels

Blog Archive