Search this blog

Wednesday, 25 July 2012

Converting pdf to text

Sometimes, the "process documents from files" operator can fail when it encounters a "dodgy" pdf and it is not possible to ignore this error using the "handle exception" operator (or at least I couldn't find a way).

Up to now, I wasn't too bothered by this, but a recent thread on reddit, motivated me to try and work round the problem.

I came up with a Groovy script that uses the third party tool "pdfbox" to convert pdf to text. This is done using a combination of Groovy script and command line execution.

To recreate this, follow these steps.

Step 1 - download pdfbox from here.

Install it and remember its location, this will get used later.

Step 2 - In the RapidMiner process where you want to access the pdfs, ensure the following macros are defined.
  • file_path - the full path to the pdf file
  • file_name - the file name of the pdf file without any folder information
  • outputFileLocation - the name of the folder where you want the converted pdf to be placed
  • pdfboxLocation - the full name of the folder containing the pdfbox software
Step 3 - Now create a Groovy script containing the following code.

String file_name = operator.getProcess().macroHandler.getMacro("file_name");
String outputFileLocation =
String file_path = operator.getProcess().macroHandler.getMacro("file_path");
String pdfboxLocation = operator.getProcess().macroHandler.getMacro("pdfboxLocation");
String cmdLine = "java -jar " + "\"" + pdfboxLocation + "\"" + " ExtractText -force " + "\"" + file_path + "\" " + "\"" + outputFileLocation + file_name + ".txt" + "\"";

operator.getProcess().macroHandler.addMacro("cmdLine", cmdLine);

This creates a command line to be executed in a macro called cmdLine. It would be possible to run this from Groovy but I found it easier and quicker simply to use the "Execute Program" operator with the command set to %{cmdLine} - this is step 4. 

The command line runs the pdfbox tool, converts the pdf to text and creates a text file with the same name as the pdf but with .txt appended. The crucial thing is that it is now possible to use the "Handle Exception" operator to ignore failures of the pdfbox operation. I found that pdfs that had been created with the option to disallow copying would cause errors.

Monday, 16 July 2012

Chopping files into smaller bits

I had trouble processing a large csv file recently because it was nearly 100Mb in size and it was not possible given the resources available in my laptop to process it and subsequently insert the whole lot into a database.

So I created a process to take the file and chop it up into smaller bits so I could process these and insert into the database. This took time but at least it finished.

Here is an example process to chop csv files. This creates a large csv file by way of illustration and then proceeds to split it using the "loop batches" operator.

Remove the "generate dummy data" and "write dummy data" operators and change the macro "fileToRead" in the context to point to the location of the file you want to read.

Sunday, 1 July 2012

Operators that deserve to be better known: part V

The "loop batches" operator splits an example set into batches for the inner operators to work on. The output is simply a copy of the full input example set. The results of the inner operators are not passed to the output because the idea is for these to process the batches perhaps by writing to a file or to a database.

When writing a large example set to a database, machine resource limits can prevent this from working so batching is a good way to proceed.

An example is provided. This takes an example set with 100 examples and uses the "loop batches" operator to write each batch to a file. A macro is used to make the file names unique.