[R] Doing PDF OCR with R

Thu Aug 13 07:05:19 CEST 2015

Hi All,

I have been trying to do OCR within R (reading PDF data which data as scanned image). Have been reading about this @ http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/

This a very good post.

Effectively 3 steps:

convert pdf to ppm (an image format)
convert ppm to tif ready for tesseract (using ImageMagick for convert)
convert tif to text file
The effective code for the above 3 steps as per the link post:

lapply(myfiles, function(i){
  # convert pdf to ppm (an image format), just pages 1-10 of the PDF
  # but you can change that easily, just remove or edit the
  # -f 1 -l 10 bit in the line below
  shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r 600 ocrbook")))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i, ".tif")))
  # convert tif to text file
  shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
  # delete tif file
  file.remove(paste0(i, ".tif" ))
  })
The first two steps are happening fine. (although taking good amount of time, for 4 pages of a pdf, but will look into the scalability part later, first trying if this works or not)

While running this, the first two steps work fine.

While runinng the 3rd step, i.e

**shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))**
I having this error:

Error: evaluation nested too deeply: infinite recursion / options(expressions=)?

Or

Tesseract is crashing.

Any workaround or root cause analysis would be appreciated.

Regards,
Anshuk Pal Chaudhuri

	[[alternative HTML version deleted]]