[R] "Complex?" import of pdf files (criminal records) into R table

Marc Schwartz marc_schwartz at me.com
Thu Oct 15 17:23:47 CEST 2009


On Oct 15, 2009, at 10:10 AM, Barry Rowlingson wrote:

> On Thu, Oct 15, 2009 at 3:28 PM, Marc Schwartz  
> <marc_schwartz at me.com> wrote:
>> On Oct 15, 2009, at 3:43 AM, Biedermann, Jürgen wrote:
>
>> You don't indicate the OS you are on, but you will want to get a  
>> hold of
>> 'pdftotext', which is a command line application that can extract the
>> textual content from the PDF files.
>
> That's assuming the text is in the PDF as a text object. If it's a
> scan of a paper document the chances are that all you have is an
> image, in which case you need to do OCR (optical character
> recognition) or get someone to type it all in again.

Good point...a scanned image would certainly complicate matters. Even  
with OCR, you introduce the potential for error in the the translation  
of the image to text and risk formatting issues, which can lead to  
inconsistencies in page layouts.

Cheers,

Marc




More information about the R-help mailing list