[R] parsing pdf files

Mark Wardle mark at wardle.org
Sun Jan 10 18:06:33 CET 2010


[copied to list for posterity...]


Sorry. I am completely wrong. I've been using itext to split, fill in
forms and recombine PDF so assumed (wrongly) that text extraction was
possible.

In fact, reading the mailing lists is quite informative - clearly PDF
is not designed for this.

Try this

http://pdfbox.apache.org/commandlineutilities/ExtractText

can be run from command line so potentially could be automated.

Mark

2010/1/10 Mark Wardle <mark at wardle.org>:
> If you can use a R <-> java interface, you could use itext to do this
> as long as the PDF is fairly sane.
>
> see http://itextpdf.com/
>
> It is what pdftk uses.
>
> b/w
>
> Mark
>
> 2010/1/9 David Kane <dave at kanecap.com>:
>> I have a pdf file that I would like to parse into R:
>>
>> http://www.williams.edu/Registrar/geninfo/faculty.pdf
>>
>> For now, I open the file in Acrobat by hand, then save it "as text"
>> and then use readLines(). That works fine but a) I am concerned that
>> some information may be lost and b) I may be doing this a lot, so I
>> would rather have R grab the information from the pdf file directly.
>>
>> So: is there something like readPDF() for R?
>>
>> Thanks,
>>
>> Dave Kane
>>
>> PS. If you're curious, here is the sort of work that I want to do with
>> this data:
>> http://www.ephblog.com/2010/01/08/class-update-and-faculty-ages/
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
>
> --
> Dr. Mark Wardle
> Specialist registrar, Neurology
> Cardiff, UK
>



-- 
Dr. Mark Wardle
Specialist registrar, Neurology
Cardiff, UK



More information about the R-help mailing list