Data Science With RapidMiner: Converting pdf to text

Wednesday, 25 July 2012

Converting pdf to text

Sometimes, the "process documents from files" operator can fail when it encounters a "dodgy" pdf and it is not possible to ignore this error using the "handle exception" operator (or at least I couldn't find a way).

Up to now, I wasn't too bothered by this, but a recent thread on reddit, motivated me to try and work round the problem.

I came up with a Groovy script that uses the third party tool "pdfbox" to convert pdf to text. This is done using a combination of Groovy script and command line execution.

To recreate this, follow these steps.

Step 1 - download pdfbox from here.

Install it and remember its location, this will get used later.

Step 2 - In the RapidMiner process where you want to access the pdfs, ensure the following macros are defined.

file_path - the full path to the pdf file
file_name - the file name of the pdf file without any folder information
outputFileLocation - the name of the folder where you want the converted pdf to be placed
pdfboxLocation - the full name of the folder containing the pdfbox software

Step 3 - Now create a Groovy script containing the following code.


String file_name = operator.getProcess().macroHandler.getMacro("file_name");


String outputFileLocation = 


operator.getProcess().macroHandler.getMacro("outputFileLocation");


String file_path = operator.getProcess().macroHandler.getMacro("file_path");


String pdfboxLocation = operator.getProcess().macroHandler.getMacro("pdfboxLocation");


String cmdLine = "java -jar " + 
    "\"" + pdfboxLocation + "\"" + 
    " ExtractText -force " + 
    "\"" + file_path + "\" " + 
    "\"" + outputFileLocation + file_name + ".txt" + "\"";

operator.getProcess().macroHandler.addMacro("cmdLine", cmdLine);

This creates a command line to be executed in a macro called cmdLine. It would be possible to run this from Groovy but I found it easier and quicker simply to use the "Execute Program" operator with the command set to %{cmdLine} - this is step 4.

The command line runs the pdfbox tool, converts the pdf to text and creates a text file with the same name as the pdf but with .txt appended. The crucial thing is that it is now possible to use the "Handle Exception" operator to ignore failures of the pdfbox operation. I found that pdfs that had been created with the option to disallow copying would cause errors.

4 comments:

Anonymous30 June 2013 at 15:04
Hi, I'm going crazy about this stuff. I tried to do everything like you wrote or in single line in the "Execute Program" operator but nothing worked / Also tried the "Execute Program" with runtime&exec() / And putting the jar or folder for pdfbox.

cmd /c start java -jar %{path}pdfbox-0.7.3.jar ExtractText -force %{path}zzz.pdf %{path}aaa.txt

If you can check this and show me where I'm wrong, I would be happy !

Thank
ReplyDelete
Replies
Andrew1 July 2013 at 21:55
Hello

I suggest adding the following line

operator.logNote ("command line " + cmdLine);

or alternatively what does the macro "cmdLine" have in it?

Either way, report back with the result.

ReplyDelete
Replies
DomFilk28 October 2015 at 14:47
I'm not a developer, i always use this free online pdf to text converter to convert pdf to text online.
ReplyDelete
Replies

Add comment

Data Science With RapidMiner

Search this blog

Wednesday, 25 July 2012

Converting pdf to text

4 comments:

About Me

Labels

Blog Archive