Data Science With RapidMiner

Saturday, 6 May 2023

Reading more examples than your licence allows

Recently, I found a way to read more examples than your license allows.

With the free version of RapidMiner Studio, example sets are limited to 10,000 rows. Using the Python or R scripting operators, it is of course possible to read more than this but as soon as the example sets are returned to RapidMiner, the license limit is imposed.

However, if the data is processed into 10,000 row batches, it is possible to place these batches into a collection. Common processing can be applied to each batch by using a loop collections operator.

Of course, if you append the collection entries and the result is greater than your license limit, restrictions will happen.

The Python code looks a bit like this.

df = pandas.read_csv('mybigdata.csv')

batch1 = df[0:10000]

batch2 = df[10000:20000]

return batch1, batch2

Make sure you connect two outputs from the Python operator to a Collect operator and you will have 20,000 rows in your collection consisting of 2 x 10,000 rows.

I could have written the whole thing in Python of course.

Needless to say, RapidMiner might get upset with such breaches of their licencing, so you should not use this unless you are willing to take any consequences.

Wednesday, 8 June 2022

Fetching stock data using a parameterised Execute R operator

I'm currently delivering data science lectures at the University of Chichester and RapidMiner is part of what I use to teach. And very good it is too. I recently found myself helping my students to get some up to date stock market data. Rather than manually downloading this, I thought I would use RapidMiner with the tidyquant R package and do it automatically. The Finance and Economics Extension seems to be out of date so isn't an option.

My idea was to define a list of stock symbols such as "AAPL", "BTC-USD" and so on and run the Execute R operator in a loop with each symbol individually.

It turns out there isn't a way to parameterise the Execute R operator so I had to invent one.

Basically, I use the Loop Parameters operator to set multiple values for a macro located inside it. This macro is used to create a one row example set with the value of the macro. This example set is then passed to the Execute R operator where the R script uses it as a parameter to drive the rest of the script. It's clunky but it works.

This approach could be adapted to allow R scripts to be run as part of a more complex modelling process. Relatively tough to do but feasible.

Here's a link to the process.

You'll need the R Scripting extension and you will also need to ensure that R is running on your machine with the data.table, tidyverse and tidyquant R packages all installed.

If RapidMiner enhances the Execute R operator to take parameters, (which would be a good enhancement), then this work around will not be needed anymore.

Wednesday, 25 May 2022

Parties at 10 Downing Street in 33 words

The Sue Gray report was published today. I made a word cloud of some of the more frequent words to try and summarise what it's about.

This one uses 33 words and seems to do a reasonable job.

Sunday, 17 April 2022

Ministerial Directions - improving a misleading graphic.

In the UK, the Civil Service has the job of implementing Government policy. They do this in a non-political way. They do have a duty to advise and if they conclude that a policy is unworkable from a number of perspectives, they can request a ministerial direction that transfers the liability from the civil service to the government minister. This recently happened with the the UK's new proposal to ship to Rwanda asylum seekers who arrive illegally by boat across the English Channel. There was seemingly some doubt that the policy would save money.

This has led to some media coverage about how frequently these ministerial directions happen. One graphic in particular shows these since the time of John Major.

The original for this is here.

This graphic is confusing because the eye is drawn to a trend which could give the impression that the number of these directions is decreasing.

I spent time counting the number of interventions (yes, I did it manually because I couldn't find the source data). These are the raw numbers for each prime minister.

Major 13

Blair 20

Brown 17

Cameron 10

May 10

Johnson 18

The actual numbers are not interesting by themselves, it's vital to normalise the number of directions by the length of the time the prime minister spent in office. If we do this, we get the following interesting graphic.

This is interesting because it shows a recent increasing trend and shows that the Johnson premiership has the largest proportion of directions per year. Prime Minister Brown would no doubt argue that the global financial crash during his time contributed to the large number for him. Prime Minister May would no doubt point to Brexit and Johnson would point both to Brexit and Covid. It is somewhat concerning however that the UK is implementing policies that may not be represent value for money.

It's ironic that the ministers who make the directions will be out of office and will not face any sanction if the policy does indeed fail at some distant point in the future. There is no check and balance that could deter such risky decisions to stop ministers ignoring advice.

Reproducible research is important and you can find all the data and code on Github here.

Thursday, 3 September 2020

How well are different countries doing testing and tracing ?

It's hard to trust raw Corona Virus data. It keeps being adjusted. This is the normal for anyone doing data analysis.

Inspired by this article from The Centre for Evidence-Based Medicine, I thought it would be interesting to work out the infection fatality rate (IFR) for recent data as a way to understand how different countries are doing with their testing and tracing. The IFR measures how many people will die if exposed to the virus. It's different from the case fatality rate (CFR) which includes cases showing symptoms and will therefore tend to produce a higher number than the IFR which will include asymptomatic cases. My assumption is that as countries get better at testing and identifying all cases, including asymptomatic ones, the IFR will eventually level off to a constant value. It doesn't matter how many cases and deaths there are, the ratio must surely trend to the IFR for the population. A country can never do better than the IFR (unless health care outcomes improve of course). If cases and deaths are measured in the same way everywhere, the IFR for all countries should converge to the same value. Any country with a high IFR may therefore not yet be identifying all cases.

My method is to use the usual European Centre for Disease Prevention and Control source and look at cases and deaths. I'm led to believe that deaths follow cases after around 14 days. So, if I work out a rolling average for both deaths and cases and then use cases from 14 days ago as the denominator with deaths as the numerator, I should get a crude estimate for the IFR.

Let's assume some other things. Firstly, the virus affects people in the same way regardless of race or gender. Secondly, all countries are recording deaths and cases in the same way. Thirdly, all countries have the same distribution of ages in their population. Fourth, all countries have similar health care systems. Finally, there has been no improvement in outcomes as a result of improved care. All of these are poor assumptions but in the interests of getting something done, I will live with them.

This shows the estimated IFR for some countries. This has been further smoothed with a rolling average for the last 7 days to make it a bit easier to see. Note the y axis is logarithmic.

How interesting. As countries gear up in the pandemic, they test more comprehensively. This has the effect of reducing the estimated IFR. For example, France, Germany, Austria and Switzerland now have IFRs between 0.5% and 0.2% and they look level which is evidence that they are finding all cases. The UK is currently around 0.9% and sharply trending down. All things being equal, it surely must be true that the IFR in the UK will trend to the lower values in other countries. Given where the UK is, I take this as evidence that the UK has not fully worked out its test and trace protocols and so is missing a number of cases.

How interesting also to see how the US, Belgium and Sweden are not converging to a level number. Indeed, there is evidence they are rising. This may suggest that their tracing efforts are focusing resources where there are case increases.

In general, if we think countries like Germany and Austria have a thorough understanding of the numbers of cases, we might conclude that an IFR below 0.5% is where we will end up.

Friday, 3 July 2020

Corona virus: traffic lights

The UK is implementing a traffic light system to enable holiday makers to travel to other countries. Red means you can't go, and I believe green and amber mean you can and there is no 14 day quarantine on return.

I thought I would compare the relative averaged death rates per million between pairs of countries to see if I could make my own list. I decided not to use the current case counts because these numbers are difficult owing to the different numbers of tests being done. The death rate is the least bad number to use.

So for the UK at the moment, the number of deaths per million for the last seven days is 1.64. The death rates for some other countries are shown below.

Brazil - 4.68
USA - 1.88
Sweden - 2.53
Italy - 0.33
France - 0.26
Spain - 0.12

So for travel between the UK and France, the difference is 1.64 - 0.26 = 1.38 which means there is more chance of infection being brought in to France from the UK. If I were the French government, I would set this to red. Travel in the reverse direction would yield -1.38 which means there is much less chance of new infection being brought back into the UK. I would set this to green if I were the UK government. Where the numbers are about the same, I set the colour to Amber.

Here's the graphic for some interesting countries.

Read it by choosing a country from the left side and reading along to see whether travel is OK to arrive at another country. It's interesting to see that travel from the UK should really be mostly red. The basic problem is that the UK's death rate is relatively very high compared to other countries.

On the face of it, it seems quite confusing that travel is allowed to other countries from the UK and there is obviously something else driving the decisions.

Monday, 1 June 2020

Corona virus: what is a peak?

It can be difficult to know when a peak happens but I'll have a go for Sweden.

Here's a graph showing the 7 day rolling average of Covid deaths, per day, per million. This time not using a logarithmic axis.

Before we get to peaks, it's clear that Sweden now has the largest number of deaths per million, now surpassing the UK - something I mentioned a couple of weeks ago. It's interesting to see that France and the UK have different trajectories. I've seen and heard that lockdown in France was very severe - you had to have a certificate to prove you were allowed to go outside. I happen to know in the UK, because I look out of the window, some neighbours have been having impromptu barbecues for some weeks now. Some of my fellow citizens are clearly above the rules.

Anyway, to Sweden. My untrained eye sees 4 peaks around the following dates...

11-April

25-April

10-May

30-May

So for Sweden, peaks are happening.

The UK has just relaxed its lockdown even more, so it will be interesting to see how the graph changes. Will it go up? My guess is it will, and there will be a period of shouting and distraction to make it go away.

I still can't get away from the simple fact that before a vaccine or cure happens, the virus still has to inflict its Infection Fatality Rate on us. An optimistic 0.36% still means we have more than 150,000 deaths to go. We have more than a year to run at the current daily death rate.

Saturday, 16 May 2020

Corona virus: how long will this go on for?

This graph shows the 7 day rolling average of reported Covid-19 deaths for various countries. It's quite interesting to see Belgium and the UK slowly reducing with Sweden now showing a quite clear plateau since the beginning of April. These three countries are suffering around 6 deaths per million per day.

It's important that I say the data I am using is from the European Centre for Disease Prevention and Control and there are many caveats about whether the data are showing the same thing for different countries and whether all deaths are being included.

If Sweden carries on with its lockdown policy, presumably I would expect the death rate to remain relatively higher than other countries. Two more weeks should allow enough time to get a sense of this.

The UK and Sweden have so far suffered respectively 511 and 358 deaths per million cumulatively. We do not know what the infection fatality rate (IFR) for the virus is but if I look at San Marino, they have had 1213 deaths per million. This is 0.12%. The back of my envelope says it is going to be at least this number everywhere. On this basis, and continuing with my envelope, if we carry on with the current death rate of 6 per million per day, the UK will require another 120 days and Sweden another 140 days to clear the infection.

In reality, the IFR is going to be higher than this and will vary by the demographics of each country. A recent German study concluded that it is 0.36% in a particular part of that country. If that number is correct, that means we have at least another year to run. If it's higher than this then it's longer.

Friday, 17 April 2020

How long will this go on for?

I heard news reports that experts tell us we will need to keep some form of lockdown in place for a year. Let's see if I can use the back of an envelope to understand where this comes from.

Before a vaccine, we will all get this disease and it will kill a certain number of us.

Let's assume a 0.83% death rate greater than the background 1% (I'm using the numbers from the Imperial Paper by Neil Ferguson). There is a lot of discussion about whether in the long run, this 0.83% will simply front load the 1% and "get them anyway". There isn't enough data yet to know, so I will continue with the 0.83% death rate assumption.

In the UK, a country of 66 million, this means an extra 550,000 will die. (Incidentally, currently, around 15,000 or about 2.7% of the total expected death toll has occurred in the UK.)

If we continue to suffer around 900 deaths per day as we are at the moment, it will take 611 days for the infection to run its course.

If we allow the death rate to increase to 5,000 per day, 110 days would be needed.

How it goes will hinge on how the various lockdown measures are relaxed. If we lockdown strongly and remain focused on minimising numbers of Corona Virus deaths, it will take more than a year to get free of it. If we relax the rules and allow the deaths to go up significantly more than we are experiencing at the moment, we have a chance to get free of this much sooner.

Of course, this is all appalling and difficult to talk about. Most of the public is still fixated on absolute numbers and it will take a lot of communication to move them away from this.

The harsh truth is that the numbers have to go up if we are to get through this quickly.

Wednesday, 15 April 2020

Corona virus: evidence that the UK has reached a peak

The graph of daily death rates per million of country population averaged over the last 7 days looks like this.

The UK, US and Belgium look like they are approaching a peak. The data here lags by about a day and I have seen tomorrow's UK data which will continue this plateau. If the shapes of the peaks are anything to go by, the UK should expect to be at this death rate for another week. This equates to around 6,000 deaths in the next 7 days.

In pure numerical terms, the US has the highest rate but once corrected for the country size, it does not look as bad.

Saturday, 11 April 2020

Corona virus: which countries have reached a peak?

If I average the reported deaths for the last 7 days and normalise by the population of each country (data as at 11/04/2020) I get the following graph. Note the y axis is a logarithmic scale. A figure of 10 per million per day for the UK with a population of 67 million means 670 deaths per day.

We can see Italy and Spain have both reached a peak and are beginning to show declines in daily cases. In contrast, the UK and Belgium are not at their peaks yet. In fact, recent figures from the UK are in the 900 per day range which means we might see the UK plateau above Spain. Belgium looks as though the plateau for this country may be significantly above the others. It's a shame no one notices because of the focus on absolute un-normalised numbers.

Austria is also showing some evidence of a plateau; Germany less so. It's interesting for these countries to see how their rates per million are significantly below the others.

Monday, 6 April 2020

Corona virus death rates per million by country 06-04-2020

The latest data shows Spain now has the largest number of deaths per million.

The table below gives the numbers.

Is there any evidence of the peak coming soon in the UK?

This graph shows the average for the last 3 days of the deaths per million.

The y axis in this graph is the number of deaths per day normalised by the population on a logarithmic scale. These show both Italy and Spain have reached a peak but the UK hasn't yet. The peak in Italy started about 10 days ago.

Using per capita numbers allows a better comparison to be drawn between countries and gives us a bit of an idea where we might end up.

Thursday, 2 April 2020

Normalised data: Data Science 101

Normalising data is what data scientists do. I see a lot of Corona Virus data and graphs reporting basic numbers of tests, infections or deaths per country. For example, I saw a graph presented on the BBC web site on April 1st.

It uses cumulative deaths to compare the UK, Spain, Italy, and the US using a logarithmic scale overlaid with exponential growth lines showing doubling times. The impression is that these countries are all about the same. This is not true if the size of the country is included. The US is 5 to 7 times bigger than the other countries and the deaths should be scaled appropriately to allow a fair comparison. It is concerning because use of the wrong number to inform decisions may increase the risk of the decision being wrong.

As an example this table summarises the numbers for today (2/4/20 12:20 UK time).

We can see that Italy has both the highest number of cumulative deaths and the largest number for deaths per million. In contrast, Belgium actually has the 3rd highest number of deaths per million whilst its actual cumulative number places it in 9th. There is very little coverage of the situation in this country but it's really quite serious.

Here is also a graph showing cumulative deaths per million per country for those countries where this has exceeded 1 death per million. I have also put exponential growth lines for doubling in 2,3, and 4 days.

It shows where each country actually is. The trends show the trajectories of countries eventually leaving the doubling in two to three day zone. I can see the UK has had a bad few days and is not leaving the zone as it should. I predict an increase in lock-down severity.

Tuesday, 31 March 2020

Corona virus death rates per million by country 31-03-2020

The virus continues apace. Here's the latest as of 31-03-2020 at 13:15 UK time.

Italy has suffered 192 deaths per million people, with Spain next on 157.

It's approaching 2 weeks ago that the UK introduced social distancing measures but no obvious sign that the death rate has been affected.

On the 23rd March, if what I said is true about the UK being 2 weeks behind Italy, we should expect the UK to reach 67 deaths per million on the 6th April. Currently the UK is at 21 per million with doubling happening every 3 days or so. Anything significantly below 67 would be evidence that social distancing and lock down is working.

Friday, 27 March 2020

Corona virus death rates per million by country 27-03-2020

Some updated graphs showing how the per capita death rate is still rising. I added South Korea and China.

This graph shows all the data.

This shows how China got on top of their outbreak almost before it had started in many countries.

This graph zooms in to the dates after March 1st.

It looks like it won't be long before France and the Netherlands will exceed the per capita death rate in Iran. It also looks like Spain will exceed Italy relatively soon.

The data sources for these are the European Centre for Disease Prevention and Control for the mortality data and the World Bank for population data.

Of course, I don't know how many of the deaths would have happened anyway but a crude estimate based on data from the World Population Review web site gives 9.1 and 10.6 deaths per 1,000 per year for Spain and Italy respectively. The populations of Spain and Italy are approximately 47 million and 60 million respectively so in a year we would expect 428,000 and 636,000 deaths in these countries per year or 1,172 and 1,742 per day assuming a constant rate (which is a gross simplification). Both these countries are currently experiencing around 600 to 700 Corona virus deaths each per day, so we can say that we are not exceeding the background normal mortality. This underlines the need to be certain that the Corona deaths are classified correctly so my homework is to read in to the source Corona mortality data to understand how it is being gathered.

Monday, 23 March 2020

Corona virus death rates per million by country 23-03-2020

I've been exploring the latest figures for Corona virus deaths and I've combined this with population data to get an interesting graph.

This shows the deaths per million of the population of a country. The Y axis is a logarithmic scale. The X axis is date from the 1st March. I chose to plot this number for each country because death rates and population are numbers that are easy to understand and are less susceptible to misinterpretation.

As of today, 23/3/20, Italy is suffering 91 deaths per million in the country. On 04/03/20, the rate for Italy was 1.3 per million. Spain is suffering 36.8 deaths per million at the moment with 1 death per million back on 12/03/20.

What is interesting is the relative shapes of these graphs and their steepness between countries. For example, the UK, where I live, currently stands at 4.2 deaths per million. Italy was at that rate on 09/03/20. If we assume the growth continues because Italy and the UK look similar, we might conclude that we will be at the same place as Italy in 14 days time. This may be behind the UK prime minister's statement that we are 2 weeks behind Italy.

Of note too is Iran, where the graph is diverging from Italy perhaps indicating that the measures in the former country are having an effect. It's also interesting to see that Spain has a very steep graph and will soon overtake Italy as the most impacted country in terms of death rates per million of population. It's also interesting to see that Belgium and the Netherlands are rising very quickly and the slope of their lines looks similar to the US.

Friday, 8 March 2019

I'm giving a talk about RapidMiner

I'll be giving a talk at the Data Science Reading Meetup group on the 26th March entitled "Introduction to RapidMiner". It's intended to be a brief introduction that should help people decide if RapidMiner is right for them.

I just discovered that it's possible to refer someone else to RapidMiner, and if that person installs the product, you get 10,000 extra rows in your license up to a maximum of 50,000.

There are 28 people going to the talk. How I wish I could have 10,000 rows for each referral I plan to send!

Saturday, 8 December 2018

Seeing how generated attributes are constructed

Sometimes, a "brute force feature generation and selection-athon" is irresistible.

I had a feeling that some data I was looking at contained hidden relationships between attributes that could have yielded an improved prediction performance. I had a gut feel that dividing one attribute by another or perhaps taking the log of one and adding it the reciprocal of another might give a new attribute with more predictive power. How to do this without a tiresome manual intervention that would have been boring, could have missed some permutation, and could have made mistakes?

There are a number of ways of doing this in RapidMiner. One approach uses one of the iterating operators, collectively known as YAGGA, to perform an evolutionary search. Each iteration generates new attributes by combining existing attributes using simple functions. The performance is assessed and attributes that don't lead to an improvement are eliminated whilst those that do are retained to allow them to generate yet more attributes. This process repeats until the desired stopping conditions have been reached.

For the masochist, there is a lower level operator called "Generate Function Set" that allows control to be exerted over the operation. I adopted this because I wanted to look in detail at the attributes that were leading to improvements and equally see those that led nowhere.

So I made a process. But then I got stuck because I found that there was no way in the RapidMiner Studio GUI, to see what construction had been applied to generate new attributes. A bit of background; when RapidMiner generates new attributes, they show up with names of the form "gensymxxx". In the old days, there was a way of seeing the attribute construction from one of the viewing panes. Alas, it's not there anymore.

Luckily, there is an operator called "Write Constructions". This takes an example set and writes it to a file which contains details of the construction. A bit laborious but workable.

Did I find a new attribute that made an improvement? Yes I did. It was a small improvement but enough to be interesting. The improvement is the sort of thing that would get you from the middle of the leaderboard to be a contender in a Kaggle competition.

Thursday, 25 January 2018

Keras + RapidMiner + digit recognition = 97% accuracy

I've successfully created a process using RapidMiner and Keras to recognise the MNIST handwritten digits with a headline accuracy of 97% on unseen data.

You can download the process here.

It requires R and Keras to be installed - an exercise for the reader.

The main features of the process are

R is used to get the MNIST data and create a training set and a test set. I use R to add the label to the data to make a single example set for the two cases rather than have separate structures for the data and the labels. This is a big strength of RapidMiner because all the data and labels are in one place.
The data is restructured in R to change 3d tensors of shape (60000, 28, 28) to 2d tensors of shape (60000, 768). The 3d tensor represents the images each of size 28 by 28 pixels. RapidMiner example sets are 2d tensors but these are OK to feed into the Keras part of the process.
The Keras part of the model has the following characteristics

The input shape is (784,), this matches the number of columns in the 2d tensor.
The loss parameter is set to "categorical_crossentropy" and the optimizer is set to "RMSprop".
There are 2 layers in the Keras model. The first is "dense" with 512 units and activation set to "relu". The second is "dense" with 10 units and activation set to "softmax". The 10 in this case is the number of different possible values the label can be.

The "validation_split" parameter is set to 0.1 so that a loss is calculated on a small part of the training data. This leads to validation loss results in the output which is used to see when over-fitting is happening.

Here is a screenshot of the history from a large run (this is output from the Keras model as an example set).

The training loss (in blue) decreases systematically as the model learns the training data more and more. The loss against the validation data (in red) shows worse performance as the number of epochs increases and the variation between epochs is evidence that perhaps I should use a larger training fraction. Nonetheless, only a small number of epochs would be enough to get a model that would perform well on unseen data.

The Keras model does not use convolution layers (an exercise for a later post) but despite this, it performs very well. Here is the confusion matrix using 3 epochs.

This is a very good result and shows the power of deep learning. It's gratifying that RapidMiner supports it.

As time permits, a future post will look at using convolution layers to see what improvements could be achieved. I may also do some systematic experiments to check how validation loss measured during training maps to actual loss on unseen data.

Wednesday, 24 January 2018

Visualising the MNIST numbers data

Keras comes with some built in functions to obtain the MNIST dataset created by the National Institute of Standards and Technology. As far as I can tell, it's not possible to get access to these from within RapidMiner but never fear, here is a process that can do it.

It uses R and obviously requires Keras to be have been installed. I'll leave that to the reader to get right.

The process also chooses one of the digits and casts it into a form that allows it to be displayed. It does this using the "Windowing" operator followed by "De-Pivot" to transform the matrix like data into x,y,z tuples.

Here's the 6th digit displayed using a block chart. This looks like a 2.

I've already used R with Keras to create a classifier that can recognise these digits. This is my first step towards using Keras in RapidMiner to build a classifier to do the same job.

Search this blog