Search this blog

Wednesday 22 November 2017

Enriching flowfiles in Apache NiFi using Mongo

I am using NiFi a lot for scalable processing of large data flows. NiFi is powerful but there are frustrations for what should to be the simplest of activities. The fact that there are foibles do betray a slight lack of maturity in the product.

Anyway, to help the World, I will describe what I had to do to solve the common problem of enriching JSON flow files containing some form of id with a looked up text value.

For example, if there is a small fragment of JSON like so...

{
    "Id": 7
}

and we want to make this by looking up the value 7 and add the new field Name with the value Me

{
    "Id": 7, "Name": "Me"
}
   
then we can in many ways using various lookup services. The way I had to choose involved reading from a Mongo database containing the Id and Name value pairs and I found most of what I needed here

The issue I found was that the type of the Id field in the Mongo database must be a string for everything to work correctly. Naively using an integer causes the matching to fail (not mismatch which is another issue) and there seems to be no way to work around this in the NiFi operator parameter settings; it always assumes a string when querying.

So, it's a bit of rough edge although in NiFi's defence, the feature I am using is quite new. Hopefully, this mini blog entry will help.


Wednesday 8 November 2017

Evolutionary process that helped me to win $1000

As the recent successful winner of the RapidMiner farming on Mars competition, I thought I would post the evolutionary process that helped lead me to the best solution.

I've tidied it up a bit and in fact there are 4 processes. The main one calls the others using "Execute Process" which ensures that common processing is placed in one location to avoid making errors.

The four processes are.

  1. EvoExample.rmp
  2. ReadAllData.rmp
  3. FilterHourAndSelectAttributes.rmp
  4. ImportData.rmp
When saving these, ensure the names are as above and they are all saved in the same repository. It's also important to point the processes at the locations of the training and test files. Download the training data from here and the test data from here. Unzip in the normal way and enter the locations into the "ReadAllData" process by changing the macros associated with the "ExecuteProcess" operator that runs the "ImportData" process.

When all the dust has settled, run the "EvoExample" process and observe the log output that writes a row each time a test has been performed with the specific settings of hour and misclassification cost.

These two parameters are contained in the depths of the process and the evolutionary process chooses values for these parameters and determines how they affect performance. 

The process has a couple of interesting features. Firstly, the performance is extracted from a calculation to match the scoring used in the competition. The operator "Extract Performance" is used to do this. Secondly, the process shows a way of varying the value of a macro inside the evolutionary operator. This is a bit of a hack and involves using the varying parameter of the evolutionary operator to control the execution of the example set made by a "Generate Data" operator and then using other operators to extract the details of this dummy example set to place into a macro. Maybe there's an easier way; I couldn't find one.

The contest did reveal an issue with the way nominal values are read and assigned which show up when running in Windows and in Linux environments. I hope this will lead to an enhancement to allow more control to be exerted. In the meantime, the process tries to get round these by sorting and being explicit in assigning positive and negative labels. Despite this, there is a chance that different environments will yield different results. The interested reader is referred to this process.

Have fun!...

Monday 6 November 2017

R packages and Shiny

Despite this blog's title containing RapidMiner, most of what I have been doing recently involves R. I maintain a GitHub repository and at the last count there are more than 50 R packages stored there. Most are private but a few are public.

I can't reveal the private ones but there are a couple of play repositories that I have published as Shiny applications.

The first one is the POTUS Progress Pie - originally posted as an idea on the HalfBakery - a site I visit a lot - see the original idea here and the R Shiny application here.

The second one shows a genetic algorithm finding the maximum to a complex function and again uses Shiny. Here's the application. Move the "audio 1" through "audio 4" sliders to try and maximise the score. Brute force is usually not an option so selecting the Find the Best option will show you the slider settings. Selecting the "Random Choice" button chooses a new function to maximise.

Coming up and in the spirit of mentioning RapidMiner, I will post the evolutionary process I used to help me win RapidMiner's recent contest...


Wednesday 12 April 2017

Zipf's law for text

I haven't posted for a while, I've been busy with work related data science topics using R. However, I'm revisiting text mining for a work related topic and I thought I would revisit some of the things I used to do.

One fascinating topic (and the subject of my Master's dissertation) is Zipf's law. It basically says that for a text corpus there is a simple relation between the rank of a word and its frequency of occurrence. The most common word is given rank 1, the second rank 2 and so on. Zipf's law says that if you multiply the rank of a word by the number of times it appears, you will get a constant. In concrete terms, this means if the most common word appears 100 times, the second most common will appear 50 times, and the third most 33 times and so on.

Of course, it's not precise and this is where it's interesting to see how different texts by different authors vary. It's also possible to calculate an expected probability to see how close to the law a real text corpus actually is. To remind myself how to do this, I made a process here that calculates the observed and expected probabilities of a document corpus.

Here's the picture showing log(rank) against log(observed probability).



It's a log-log plot because the formula relating rank to probability is of the form.

rank = K/probability

and taking the log of both sides leads to

log(rank) = log(K) - log(probability)

which is a straight line with a negative slope.

The graph shows the expected probability in red and the observed in blue. There is a reasonably nice straight line for the blue points which shows there is something in the law.

The process works as follows...

The process requires the Text Mining Extension to be installed so ensure you have that if you want to run it. The process points to the RapidMiner Studio license agreements on the local disk, so ensure you change the location for the "Loop Files" operator in order to run it yourself. This operator reads all the documents it finds and then calls "Process Documents" to process them. Very light tokenizing and filtering is done inside this operator and the resulting word list is used to feed into the rest of the process. The word list gives the words and the number of times they appear across the whole corpus. Some processing of this makes an example set containing observed and expected probabilities. Of interest is the "Normalize" operator that makes probabilities and the "Generate Attributes" operator that calculates an expected probability using a macro containing the number of words found.

The plot can be recreated using the advanced plotting capabilities.

From here, more advanced things can be done such as measuring differences between authors and texts although care is needed to make sure the different texts have certain similarities to avoid getting slightly wrong conclusions. It's also possible to try and model a different law to the distribution of words. One such is the Zipf-Mandelbrot modification which adds some additional parameters and which you can read about here (shameless plug).

In summary, I recreated the process in about half an hour. This shows how easy it can be to create powerful data mining processes using RapidMiner Studio without needing to write software.