Data Science With RapidMiner: 2014

Thursday, 27 November 2014

Sentiment Analysis: British politicians compared with a Happiness Histogram.

I'm currently making some text mining videos one of which is about sentiment analysis. For fun, I thought I would analyse the sentiment of speeches given at their respective party conferences by three current British politicians, David Cameron, Nick Clegg and Ed Miliband to see what we can learn. Of course, and I stress, this is by no means an exhaustive and thorough analysis; it's just a bit of fun.

I used RapidMiner and the Text Mining and WordNet extensions. Specifically, the WordNet 3.0 and the SentiWordNet 3.0.0 database. I divided each text into tokens (i.e. words) and then split the text into consecutive equal sized parts with 100 words in each. I then used the Extract Sentiment (English) operator to score each of the parts with a sentiment. This ranges between +1 for positive and -1 for negative. I used a dash of R to draw some of the graphs below with the advanced charts of RapidMiner being used for the last one.

Let's compare the three speeches using a histogram of the sentiments - the Happiness Histogram. The colours represent the parties (Ed Miliband: Red, Nick Clegg: Orange, David Cameron: Blue). The graphs show the sentiment distribution for each 100 word part of the document and you can see that the values range between +0.1 and -0.04. With 100 words you would not expect very high scores because the sentiment calculation simply applies a sentiment value to each word and averages for all words. Nonetheless, the variations are slightly more than would be expected from random sampling; I did some brief checking to confirm this.

This next graph compares them directly.

We notice that the speeches are resolutely perky in that they are always more positive than negative on average. The Miliband speech has an outlying region of happiness (ironically to the right) whereas the other two are more middle of the road,

Now let's see how sentiment varies as we move through the speeches.

This graph is a moving average of 10 data points (i.e. 1000 words) for each of the 3 speeches with the colours as before. The minutes axis corresponds to a speaking rate of 125 words a minute which is what I observed the speeches averaged to. This means the first moving average starts at 1000 words or at about 8 minutes in.

It's quite interesting to see how the different politicians vary sentiment. Ed Miliband approaches the end of the speech in a series of steps gradually getting happier with mini-spells of relative gloom. Nick Clegg seems to get more and more positive but perhaps peaks too early and ends on a down. David Cameron starts happy, gets gloomy then quickly recovers but again maybe too early and ends on a down. Perhaps Messrs Clegg and Cameron have to temper what they say with the realism of being in government.

It is also possible to correlate the extremes of the sentiment with the words being used. There is a wealth of detail and interesting things to note but time prevents me from detailing this today and so I will save that for another post.

Wednesday, 29 October 2014

Using Groovy to extract the last part of a folder structure

Imagine you are using "Loop Files" to find files one by one and import them perhaps using the "Read CSV" operator. The "Loop Files" operator provides macros such as file_path, file_name and so on to allow you to create meta data with the example set.

So if you have a folder name like this...

c:\users\andrew\bigdata\lotsofdata\subregion\

where each subregion contains many files and there are many different subregions. It makes sense to label all the files for a subregion. This can be done by using the folder name which is contained in the parent_path macro provided by the "Loop Files" operator. There is a lot of redundant information that it would be sensible to get rid of and I suppose it would be possible using some heavy combination of macro and attribute manipulation operators but I decided to write some Groovy to do it. The resulting script is simple.

String filePath = operator.getProcess().macroHandler.getMacro("parent_path")
String lastPart = filePath.tokenize('\\').last()
operator.getProcess().getMacroHandler().addMacro("subregion", lastPart);

It assumes a macro called parent_path which contains the folder name. The tokenize function splits this into tokens separated by "\" and the last one is returned using the last function. A macro called subregion is then created. This can be used as a normal macro.

Saturday, 25 October 2014

Windowing and Processing Documents

The text mining extension contains an operator called "Window Document". It takes a document that has been split into tokens (typically words) and creates a collection of new documents from it. Each new document contains a fixed number of tokens corresponding to a "window length" parameter and the movement of the window that moves through the document is dictated by a "step size" parameter. A meta data attribute called "window" is created for each new document; this corresponds to the window within the original document.

So for example, this text

"The cat sat on the mat"

could be split into three windows each of size two if window length is set to two and step size is set to two.

window: 0 - "The cat"
window: 2 - "sat on"
window: 4 - "the mat"

Here's a simple process that illustrates windowing and processing. It's worth noting that the "Process Documents" operator is able to take a collection of documents as input. Note that the process uses version 6.1 of RapidMiner studio so some manual version number editing would be needed to run it in older versions. Note too that you must have the Text Processing extension installed.

The process illustrates a tiny pitfall for the unwary. If one of the tokens is "window" and if the parameter "add meta information" is set to true for the "Process Documents" operator, the resulting example set contains an attribute with the name "window_0". This is because the meta data for the window creates a special attribute in the final example set with name "window" and this would clash with the attribute corresponding to the token. If the parameter "add meta information" is set to false, the attribute corresponding to the token is called "window". In other words, the example set changes in a subtle way depending on the setting of a parameter which can lead to problems.

It's a very small point but I happened to stumble over it recently as I was preparing my contribution to an upcoming text mining book. Here's a teaser because it looks nice :). It is comparing three novels by Jane Austen and how the shape of word frequencies varies for consecutive windows through the books.

The red line is for Mansfield Park, the blue is for Sense and Sensibility and the green is for Pride and Prejudice.

Monday, 6 October 2014

RapidMiner Resources advanced videos

After a bit of work. I'm pleased to say I've completed the RapidMiner Resources advanced videos and they'll be available on the RapidMiner Resources site soon.

I maintain meta data about the videos and operators and for fun, I've made a process using this data and a new operator I've discovered called "Transition Graph". This is a candidate for "operators that deserve to be better known" because it allows pretty graphs to be drawn.

The meta data I keep records the main operators each video uses as well as the overall running time of the video and the course which it is classified as. Here's a process that takes this data and allows different graphs to be drawn to show which operators are used in which video as well as which video uses which operator.

A brief note on the names - I've prepended "o" for operator and "v" for video to make things clear.

Here is a graph showing that the "Generate Macro" operator is important in 5 videos

Here is another graph that shows the most important operators used by the video called "Macros".

Here's another that shows which operators are covered by each course and what overlap there is.

The process reads a CSV file (here) to generate these graphics. Of course, as time goes on, I will add new videos so the data in the process is a snapshot as at early October 2014. Nonetheless, please feel free to download the process and data and play around with the results to see the videos I have created and the operators that are covered.

The next videos to do are about text mining...

Monday, 25 August 2014

Mandelbrot

Here is a process to plot the Mandelbrot set. It's based on the one that was successful at the recent RapidMiner World conference.

It makes pretty pictures like this.

Various macros control the execution of the process. With the following settings,

yPoints: 80
xPoints: 120
iterations: 200
xmin: -0.95
xmax: -0.855
ymin: 0.2375
ymax: 0.3275

a zoomed in view like this is produced - how cool.

I noticed a feature of the advanced plotter that limits the number of points that get plotted. This is a configuration setting found at Tools->Preferences->Gui->rapidminer.gui.plotter.rows.maximum. This is 5,000 by default. If you want to see all the points for the settings above then set this to 9,600.

The process itself is in 2 main parts.

Firstly, the sub-process creates the x and y axes which I called x0 and y0. This is done using the operators "Generate Data", "Generate ID", and "Normalize" for the x and y axes. These are then joined using the "Cartesian Product" operator to produce all possible combinations of the x and y axes. The resulting example set is stored in the process context using the "Remember" operator.

Secondly, the "Loop" operator uses the "Recall" operator to get the latest example set to work on and performs the necessary calculations to generate the Mandelbrot set. The result of each iteration is remembered in the process context so the next loop iteration can carry on. There is some cunning filtering to reduce the amount of effort in each loop. Note the "Materialize Data" operator. This is often needed and does no harm if it is included.

At the end of the loop operation, nothing is output from the "Loop" operator itself. The output from the main process is simply a "Recall" operator which uses the last example set that was worked on inside the loop operation.

By having nothing output from the loop operation, the memory impact of this process is reduced.

Sunday, 3 August 2014

New videos coming soon

I've created another set of videos. These are slightly more advanced and tend to combine more operators together to tell a story.

Here's a graphic using RapidMiner's advanced plotting capabilities that shows the video names and the main operators explained during the video.

They'll be available on the RapidMinerResources site very soon.

I plan to do some new ones over the next few months and the question is what do I choose?

My current candidate list is.

Groovy Dark Arts
Text Processing
Web Mining
Time Series in more detail
RapidMiner Server

Each would translate to between 10 and 20 videos.

To help me decide which one I will do next, I'd be happy to get feedback. So please leave a comment and it will certainly help me.

Edit: I took the liberty of doing a mini survey at the RapidMiner World conference. The results are shown here

I'll take notice of this and give some focus to Text Mining and RapidMiner Server.

Saturday, 5 July 2014

Copy your license before doing a RapidMiner Studio upgrade

Edit: just successfully downloaded 6.0.008 without a problem so I'll delete this in a while.

Before you install the new version - 6.0.006, be sure to copy your license key.

This is accessed by going to Help->Manage Licenses->Enter License.

Copy the text there into a suitable safe place.

When you install the latest RapidMiner Studio version, a nag screen will appear. You can escape this by entering the license text you carefully saved.

It turns out that if you run a stupid version of Internet Explorer (for me 8.* - I have no choice in this) then the license key does not show up on the RapidMiner site. This had the brief and tiresome side effect of locking me into a loop of despair.

With luck, this is a "swivel eyed mad feature" aka a "bug". If it is, I'll delete this post.

Sunday, 29 June 2014

Installing RapidMinerServer

As promised - how to get started with RapidMiner Server. Other operating systems are available :)

Tuesday, 24 June 2014

Coming soon: Installing RapidMiner Server

Very soon, I'll be posting a video that shows how to install RapidMiner Server. It gets you from 0 to Server in not many minutes at all.

A bonus feature is the steps you need to follow to install all your favourite extensions.

Monday, 16 June 2014

RapidMiner Master Certificate and other news

I'm pleased to announce that I've successfully passed the RapidMiner Master Certificate examination.

It took a fair amount of revising and I'm very pleased with the outcome.

I will now be able to devote more time to the next set of RapidMinerResources videos of which these are currently under construction...

Installing RapidMiner Server
Loop Parameters Grid
Pivot
De-Pivot
Loop Collection
Nested Loops

Check back often to get more information about these and what others I am planning.

Sunday, 18 May 2014

Random walk in 3D

Here's a process that draws a pretty picture of a random walk in 3 dimensions in 3D.

OK - you have to work at it. But by squinting at the screen and going cross-eyed so the right side image appears in your left eye and the other in your right, you should see a 3 dimensional view of a random walk.

You may have to adjust the width of your window and who knows what else to make the images appear side by side. Persistence is valuable.

The colour of the points is set by the id. Blue is near 0 and red is at the end of the series. In this case, 2000 data points.

The process uses macros and the Loop Example operator to calculate a random amount to add to each data point. Use of the Integrate operator builds a cumulative example set where each example depends on the ones before it.

The process has the random seed set to -1. This means that it is very unlikely that the picture shown here will ever be recreated again.

Thursday, 1 May 2014

New video content

I've been busy working with Dr Markus Hofmann to create some brand new RapidMiner videos.

It total there are 43 videos totalling 10 hours. Each one is viewable on demand.

The videos give a detailed worked example for the most important RapidMiner operators as well as some more general ones. We'll be adding more over the coming months.

Visit the rapidminerresources.com site to get all the information.

Monday, 21 April 2014

How to read the contents of a file into a macro

I won't bore you with the "why", but suffice to say there are certain situations where it is useful to have the contents of a file contained in a macro. Obviously don't read a multi-Gb file into a macro, it might struggle.

Here's a process to do it.

It's very simple in fact. The trick is to use the "Read Document" operator followed by "Documents to Data". This has the effect of reading the entire contents of a file into a single named attribute in a one row example set. This by itself is useful but from there, it's a simple matter to use "Extract Macro" to make a macro equal to the value of the attribute for the single row.

Tuesday, 25 March 2014

XSLT

I had to transform some XML from one format to another recently for a cool thing that I am doing at the moment - more details soon with luck ;).

The XML in question was contained in a lot of files and I couldn't face editing them so I decided to embrace the power of XSLT and get the transformation done automatically using RapidMiner.

It turned out to be really quite easy and in the interests of giving something back I made a simple version of the process that shows it working.

Here is the process.

This is the input XML (a copy of the XML for a RapidMiner process as it happens)

Close examination shows it contains 5 operators.

The XSLT document input to the "Process XSLT" operator is shown below.

This finds all the operators with name attributes within the input XML and for each writes out the name and also finds and writes out the corresponding class attribute all within a new XML document.

The result of this is shown here.

It has correctly found the 5 operators in the original XML document and has written the name and class for each in a new XML document.

Very cool and powerful and it turned out to be quite easy. The hard bit is knowing how to create the XSLT transform.

Saturday, 8 March 2014

The Write Special Format operator

Here is a complicated process that does quite a lot...

Deep breath, here we go...

Loop through all the process files in a repository (make sure you point it to your location) and read each one in as a document.
Convert the documents to normal examples within a single example set.
Create an attribute called `description` that contains the text within the top-level comment for each process. This uses Ninja XPath (actually it doesn't but I wanted to use the word Ninja).
Do gymnastics to reformat the contents of the attribute. This uses Ninja regular expressions (actually it doesn't but the usual rule is all regular expressions require Ninja like skills). Newlines and linefeeds are included - this is perhaps the interesting part.
Rename to make the name of the attribute easier to understand.
Select only the attributes of interest.
Filter out all where there is no description.
Write the example set to a file (make sure you set this to somewhere you want to write files). This uses the `Write Special Format` operator which was the only way I could find to get this to work.

Why?

The comment view can contain html. This allows basic formatting and structure to be defined so that when a process is opened, the text in the comment view is displayed with this formatting and structure applied. The process above allows the html tags to be extracted so they can be re-used if you want to use the formatting and structure somewhere else like a document or index. Time is precious, anything that allows re-use and avoids typing is good.

The `Write Special Format` operator is especially nice because it allows precise control to be exerted over what is to be written. Dare I say Ninja like control?

Wednesday, 12 February 2014

A "feature" of the Performance (Regression) operator

I noticed a feature of the Performance (Regression) operator whereby it never reports a value for correlation less than 0. This would happen if a label and prediction are negatively correlated with one another. Furthermore, the squared correlation is also reported as zero in this case.

This example process shows this.

I'm using the Correlation Matrix operator as a way of checking the answer I get. If the sign of the calculation is changed in the Generate Attributes operator you will find that in the negative case the correlation should be -0.763 whilst in the positive it should be 0.706.

The output from the Performance (Regression) operator is 0 for correlation and squared correlation in the negative case. How odd since there are other criteria that can take negative values such as Kendall Tau and Spearman Rho.

I can't work out if this is a deliberate feature or some other thing ;)

No matter, at least I know.

Friday, 3 January 2014

Complex numbers

I stumbled upon an unadvertised feature in the generate attributes and generate macro operators in RapidMiner Studio 6. It's possible to use the square root of -1, denoted by the reserved symbol i, in the function expressions when creating attributes and macros. There are also some new functions:

re()
im()
conj()

Use of a complex number to generate an attribute results in the real part being used only whereas generating a complex macro like this

CN = -3+2*i

results in a macro value that looks like this:

(-3.0, 2.0)

This would be -3+2i.

Can a macro, once created, be used as an argument to create other macros or attributes?

The answer is no but there is some fiddling around that can make some things work.

If my macro is generated like this

CN = -3+2*i

I can create another macro like this

CN_1 = replace("%{CN}", ", ", "+i*")

I then get this value for CN_1

(-3+i*2)

I can then put this result in an attribute as a string. I can also use functions like re() and abs() directly on this so for example I can create an attribute like this (note that I am passing a string to the abs function that happens to be a valid complex number)

CNLen = abs(%{CN_1})

Sure enough, the attribute contains the expected value of 3.606.

I'm not sure if this is a taste of things to come or whether it's unintended. I suppose it would be cool to have the ability to manipulate complex numbers.

Data Science With RapidMiner

Search this blog

Thursday, 27 November 2014

Sentiment Analysis: British politicians compared with a Happiness Histogram.

Wednesday, 29 October 2014

Using Groovy to extract the last part of a folder structure

Saturday, 25 October 2014

Windowing and Processing Documents

Monday, 6 October 2014

RapidMiner Resources advanced videos

Monday, 25 August 2014

Mandelbrot

Sunday, 3 August 2014

New videos coming soon

Saturday, 5 July 2014

Copy your license before doing a RapidMiner Studio upgrade

Sunday, 29 June 2014

Installing RapidMinerServer

Tuesday, 24 June 2014

Coming soon: Installing RapidMiner Server

Monday, 16 June 2014

RapidMiner Master Certificate and other news

Sunday, 18 May 2014

Random walk in 3D

Thursday, 1 May 2014

New video content

Monday, 21 April 2014

How to read the contents of a file into a macro

Tuesday, 25 March 2014

XSLT

Saturday, 8 March 2014

The Write Special Format operator

Wednesday, 12 February 2014

A "feature" of the Performance (Regression) operator

Friday, 3 January 2014

Complex numbers

About Me

Labels

Blog Archive