Search this blog

Wednesday, 12 April 2017

Zipf's law for text

I haven't posted for a while, I've been busy with work related data science topics using R. However, I'm revisiting text mining for a work related topic and I thought I would revisit some of the things I used to do.

One fascinating topic (and the subject of my Master's dissertation) is Zipf's law. It basically says that for a text corpus there is a simple relation between the rank of a word and its frequency of occurrence. The most common word is given rank 1, the second rank 2 and so on. Zipf's law says that if you multiply the rank of a word by the number of times it appears, you will get a constant. In concrete terms, this means if the most common word appears 100 times, the second most common will appear 50 times, and the third most 33 times and so on.

Of course, it's not precise and this is where it's interesting to see how different texts by different authors vary. It's also possible to calculate an expected probability to see how close to the law a real text corpus actually is. To remind myself how to do this, I made a process here that calculates the observed and expected probabilities of a document corpus.

Here's the picture showing log(rank) against log(observed probability).

It's a log-log plot because the formula relating rank to probability is of the form.

rank = K/probability

and taking the log of both sides leads to

log(rank) = log(K) - log(probability)

which is a straight line with a negative slope.

The graph shows the expected probability in red and the observed in blue. There is a reasonably nice straight line for the blue points which shows there is something in the law.

The process works as follows...

The process requires the Text Mining Extension to be installed so ensure you have that if you want to run it. The process points to the RapidMiner Studio license agreements on the local disk, so ensure you change the location for the "Loop Files" operator in order to run it yourself. This operator reads all the documents it finds and then calls "Process Documents" to process them. Very light tokenizing and filtering is done inside this operator and the resulting word list is used to feed into the rest of the process. The word list gives the words and the number of times they appear across the whole corpus. Some processing of this makes an example set containing observed and expected probabilities. Of interest is the "Normalize" operator that makes probabilities and the "Generate Attributes" operator that calculates an expected probability using a macro containing the number of words found.

The plot can be recreated using the advanced plotting capabilities.

From here, more advanced things can be done such as measuring differences between authors and texts although care is needed to make sure the different texts have certain similarities to avoid getting slightly wrong conclusions. It's also possible to try and model a different law to the distribution of words. One such is the Zipf-Mandelbrot modification which adds some additional parameters and which you can read about here (shameless plug).

In summary, I recreated the process in about half an hour. This shows how easy it can be to create powerful data mining processes using RapidMiner Studio without needing to write software.

Friday, 10 June 2016

Using Genetic Algorithms to find global extremes in arbitrary functions

Genetic algorithms offer the possibility to find global maxima and minima in arbitrary functions as the inputs to the functions are varied. RapidMiner has a number of evolutionary operators and one in particular is "Optimize Parameters (Evolutionary)" that, on the face of it, allows the parameters of operators contained within it to be varied as a performance vector is calculated based on how the varied parameters cause the inner operators to behave. There is a slight difficulty because not all parameters can be exposed for control by the Optimize operator so a work-around is needed. We'll come back to that shortly.

Firstly, let's choose a function to optimize. A good candidate is the Rastrigin function that, in two dimensions, has the following form.

f(i1, i2) = 20 + i1*i1 + i2*i2 - 10*(cos(2*pi*i1) + cos(2*pi*i2))

Within the range -5.12 to 5.12 for both i1 and i2, the function has many local minima which makes finding the lowest a challenging problem for some techniques. The following graph shows the function.

By inspection and exploration we can find that the global minimum is 0 when i1 and i2 are both 0. The process to generate this data is here.

Now let's see if we can find this minimum using a RapidMiner process. The process is here.

The first part of the process shown above uses a "Generate Data by User Specification" operator to generate a small example set that needs a label and a regular attribute in order for the "Optimize Parameters (Evolutionary)" operator to work.

The inner operators inside the Optimize operator are shown below.

The two operators labelled i1 and i2 are there purely to allow parameters to be passed from the control part of the Optimize operator. This is the work around I mentioned earlier. Basically, operators i1 and i2 expose parameters that can be seen from the Optimize operator and can be varied. The parameter settings for the Optimize operator are shown in the following.

This shows that the i1.constant value is allowed to vary between -5.12 and 5.12; the Optimize operator chooses the values as it proceeds.

Returning to the inner operators, the values of the parameters are accessed using a "Generate Attributes" operator. The first two attributes show how to get the values from the i1 and i2 operators. The third attribute calculates the Rastrigin value and the fourth attribute copies this into a result attribute (not strictly necessary - this process used to have other functions which I deleted to make the cut down process for this blog post).

From this point, the "Optimize" operator needs a performance vector to work on. The simplest thing to use is the "Extract Performance" operator to extract a specific value from the example set containing the Rastrigin result. This is shown below and the optimization direction is set to minimize since we are looking for a minimum.

The final "Log" operator records what happens.

If we run this process the Log result is shown in the following plot.

This shows the performance tending to 0, the global minimum, as the process proceeds and this corresponds to i1 and i2 both being 0. Plotting the log result as a scatter plot shows clusters of blue points which show how the genetic algorithm successfully keeps to areas where there are minima.
Genetic algorithms are not guaranteed to find the global minimum and in fact, even the process in this blog occasionally does not find it. If you set the parameter "specify population size" in "Optimize Parameters (Evolutionary)" to a small number like 5, you should observe this because fewer individuals are used to test the solution space and so the chance of being near the global minimum is low. Note that a little known feature of the "Process" operator is that setting "random seed" to -1 causes a new random seed to be selected for each run thereby ensuring that each run produces different results.

Of course, there is no way to detect when the global extreme is not found in the more normal case when the answer is not known beforehand. As usual, be sceptical and alive to the possibility that the result is not the best and try different parameter settings to see how results change. As mentioned, the population size affects this as do the "crossover prob" and "tournament fraction" parameters. There is no free lunch of course and an exhaustive search will always take longer.

In summary, we can see that RapidMiner can, with a bit of a work around, be used to optimize arbitrary functions. My experiments show that it seems to do a good job in most cases. A future post will show the same thing using R.

Sunday, 21 February 2016

Reading from a SQL Server database

Reading from a SQL Server database is easy using R.

Here's a process that shows how to do this from within a RapidMiner process.

Of course, you can use the built in "Read Database" operator to read from a database, but there are restrictions in the community version. By using R you can get partially round the restrictions but you should always be aware of your license agreement. Just because you can get round the license does not mean that the terms no longer apply. If you do something that would normally trigger the purchase of an additional license then you still need to. I'm not a lawyer thankfully but, you have been warned.

Having said that, there are situations where you have to try things to prove viability and get political buy-in before committing to a more serious plan where money is to be spent. Political buy-in, as everyone knows, can sometimes takes a very long time and even the most trivial objection can completely de-rail progress. Removing the ability to make a full prototype is just such a potential trivial objection.

Having said all of that, the method the process uses here will have some subtle differences in the way it interacts with the database when compared to the "Read Database" operator. This means it might not work for some reason as yet unknown. Simple advice, don't rely on it.

Enough words, on with the process.

The process has two parts, the first sets some macros that are used within the second. It's a little known fact that you can use macros in this way but it's extremely powerful and allows the code to work in lots of places. The macros themselves are shown in the following table.

Change these to match what you have in your environment. Note that I am using SQL Server authentication so this means you have to set up your environment like this. I am led to believe that built-in authentication is possible but I have not tried it.

The R code itself is shown here.

Additional points:

  1. Install the RJDBC package into your environment, rJava may also be required.
  2. Download the Microsoft JDBC drivers from here (note that care is always needed with downloads such as these because the vendors keep changing their Web sites).
  3. If you are running on Ubuntu, the process will still work but there are some changes to do as shown in the R code.
  4. I have not tried it on a Mac.
  5. Change the query to whatever you want. The example here queries the system table.
The end result is an example set. The query shown in the example yields this.

You will see that the attribute names have been created automatically and a basic mapping to types has been done. The following shows part of the statistics for this example set.

One mapping that would need additional downstream work is the create_date attribute. It looks like it has been transformed into a polynominal. Closer inspection would, no doubt, reveal other foibles.

The example set can then be used in the normal way 

In summary, you can see that it is very easy to access SQL Server using R. It is therefore easy to do it from within RapidMiner.