[R] Extracting the first currency value from PDF files
Rasmus Liland
jr@| @end|ng |rom po@teo@no
Wed May 13 16:17:06 CEST 2020
On 2020-05-13 06:44 -0700, Jeff Newmiller wrote:
> On May 13, 2020 6:33:03 AM PDT, Manish Mukherjee wrote:
> >
> > How to extract this value from a number
> > of PDF files and put it in a data frame.
>
> they could be part of embedded bitmaps.
Dear Manish and Jeff,
I recently found the programs pdftoppm [1]
and Google tesseract [2] to be really useful
when reading text from pdfs formatted as "a
single column of text of variable sizes",
e.g. a receipt from a grocery store :)
folder <- "path/to/pdfs"
pdfs <- list.files(folder, ".pdf$")
pdf <- pdfs[1]
cmd <-
paste0("pdftoppm -png -r 500 ",
folder, pdf, " /tmp/out && ",
"tesseract /tmp/out-1.png - ",
"-l nor --psm 4")
lines <- system(cmd, intern=TRUE)
# x <- lapply(x, system, intern=TRUE)
# names(x) <- pdfs
# saveRDS(x, "texts.rds")
In any other case with a sensibly formatted
pdf, I would have used pdftotext [3] ...
Best,
Rasmus
[1] https://manpages.debian.org/buster/poppler-utils/pdftoppm.1.en.html
[2] https://manpages.debian.org/buster/tesseract-ocr/tesseract.1.en.html
[3] https://manpages.debian.org/buster/poppler-utils/pdftotext.1.en.html
More information about the R-help
mailing list