[R] Doing PDF OCR with R
Anshuk Pal Chaudhuri
anshuk.p at motivitylabs.com
Thu Aug 13 07:05:19 CEST 2015
Hi All,
I have been trying to do OCR within R (reading PDF data which data as scanned image). Have been reading about this @ http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/
This a very good post.
Effectively 3 steps:
convert pdf to ppm (an image format)
convert ppm to tif ready for tesseract (using ImageMagick for convert)
convert tif to text file
The effective code for the above 3 steps as per the link post:
lapply(myfiles, function(i){
# convert pdf to ppm (an image format), just pages 1-10 of the PDF
# but you can change that easily, just remove or edit the
# -f 1 -l 10 bit in the line below
shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r 600 ocrbook")))
# convert ppm to tif ready for tesseract
shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i, ".tif")))
# convert tif to text file
shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
# delete tif file
file.remove(paste0(i, ".tif" ))
})
The first two steps are happening fine. (although taking good amount of time, for 4 pages of a pdf, but will look into the scalability part later, first trying if this works or not)
While running this, the first two steps work fine.
While runinng the 3rd step, i.e
**shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))**
I having this error:
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
Or
Tesseract is crashing.
Any workaround or root cause analysis would be appreciated.
Regards,
Anshuk Pal Chaudhuri
[[alternative HTML version deleted]]
More information about the R-help
mailing list