[R] "Complex?" import of pdf files (criminal records) into R table
b.rowlingson at lancaster.ac.uk
Thu Oct 15 17:10:23 CEST 2009
On Thu, Oct 15, 2009 at 3:28 PM, Marc Schwartz <marc_schwartz at me.com> wrote:
> On Oct 15, 2009, at 3:43 AM, Biedermann, Jürgen wrote:
> You don't indicate the OS you are on, but you will want to get a hold of
> 'pdftotext', which is a command line application that can extract the
> textual content from the PDF files.
That's assuming the text is in the PDF as a text object. If it's a
scan of a paper document the chances are that all you have is an
image, in which case you need to do OCR (optical character
recognition) or get someone to type it all in again.
Even if you can get all the text out with pdftext, R might not be the
right tool for the job - I'd do this kind of text processing and
matching job in Python (and before Python, I'd have used Perl). But if
all you have is a wRench...
More information about the R-help