Search this blog

Wednesday 25 July 2012

Converting pdf to text

Sometimes, the "process documents from files" operator can fail when it encounters a "dodgy" pdf and it is not possible to ignore this error using the "handle exception" operator (or at least I couldn't find a way).

Up to now, I wasn't too bothered by this, but a recent thread on reddit, motivated me to try and work round the problem.

I came up with a Groovy script that uses the third party tool "pdfbox" to convert pdf to text. This is done using a combination of Groovy script and command line execution.

To recreate this, follow these steps.

Step 1 - download pdfbox from here.

Install it and remember its location, this will get used later.

Step 2 - In the RapidMiner process where you want to access the pdfs, ensure the following macros are defined.
  • file_path - the full path to the pdf file
  • file_name - the file name of the pdf file without any folder information
  • outputFileLocation - the name of the folder where you want the converted pdf to be placed
  • pdfboxLocation - the full name of the folder containing the pdfbox software
Step 3 - Now create a Groovy script containing the following code.

String file_name = operator.getProcess().macroHandler.getMacro("file_name");
String outputFileLocation =
operator.getProcess().macroHandler.getMacro("outputFileLocation");
String file_path = operator.getProcess().macroHandler.getMacro("file_path");
String pdfboxLocation = operator.getProcess().macroHandler.getMacro("pdfboxLocation");
String cmdLine = "java -jar " + "\"" + pdfboxLocation + "\"" + " ExtractText -force " + "\"" + file_path + "\" " + "\"" + outputFileLocation + file_name + ".txt" + "\"";

operator.getProcess().macroHandler.addMacro("cmdLine", cmdLine);

This creates a command line to be executed in a macro called cmdLine. It would be possible to run this from Groovy but I found it easier and quicker simply to use the "Execute Program" operator with the command set to %{cmdLine} - this is step 4. 

The command line runs the pdfbox tool, converts the pdf to text and creates a text file with the same name as the pdf but with .txt appended. The crucial thing is that it is now possible to use the "Handle Exception" operator to ignore failures of the pdfbox operation. I found that pdfs that had been created with the option to disallow copying would cause errors.


4 comments:

  1. Hi, I'm going crazy about this stuff. I tried to do everything like you wrote or in single line in the "Execute Program" operator but nothing worked / Also tried the "Execute Program" with runtime&exec() / And putting the jar or folder for pdfbox.

    cmd /c start java -jar %{path}pdfbox-0.7.3.jar ExtractText -force %{path}zzz.pdf %{path}aaa.txt

    If you can check this and show me where I'm wrong, I would be happy !

    Thank

    ReplyDelete
  2. Hello

    I suggest adding the following line

    operator.logNote ("command line " + cmdLine);

    or alternatively what does the macro "cmdLine" have in it?

    Either way, report back with the result.

    ReplyDelete
  3. I'm not a developer, i always use this free online pdf to text converter to convert pdf to text online.

    ReplyDelete
    Replies
    1. It's amazing what open source is available.

      Delete