Up to now, I wasn't too bothered by this, but a recent thread on reddit, motivated me to try and work round the problem.
I came up with a Groovy script that uses the third party tool "pdfbox" to convert pdf to text. This is done using a combination of Groovy script and command line execution.
To recreate this, follow these steps.
Step 1 - download pdfbox from here.
Install it and remember its location, this will get used later.
Step 2 - In the RapidMiner process where you want to access the pdfs, ensure the following macros are defined.
- file_path - the full path to the pdf file
- file_name - the file name of the pdf file without any folder information
- outputFileLocation - the name of the folder where you want the converted pdf to be placed
- pdfboxLocation - the full name of the folder containing the pdfbox software
String file_name = operator.getProcess().macroHandler.getMacro("file_name");
String outputFileLocation =
operator.getProcess().macroHandler.getMacro("outputFileLocation");
String file_path = operator.getProcess().macroHandler.getMacro("file_path");
String pdfboxLocation = operator.getProcess().macroHandler.getMacro("pdfboxLocation");
String cmdLine = "java -jar " +
"\"" + pdfboxLocation + "\"" +
" ExtractText -force " +
"\"" + file_path + "\" " +
"\"" + outputFileLocation + file_name + ".txt" + "\"";
operator.getProcess().macroHandler.addMacro("cmdLine", cmdLine);
This creates a command line to be executed in a macro called cmdLine. It would be possible to run this from Groovy but I found it easier and quicker simply to use the "Execute Program" operator with the command set to %{cmdLine} - this is step 4. The command line runs the pdfbox tool, converts the pdf to text and creates a text file with the same name as the pdf but with .txt appended. The crucial thing is that it is now possible to use the "Handle Exception" operator to ignore failures of the pdfbox operation. I found that pdfs that had been created with the option to disallow copying would cause errors.