[R] Extracting a a chunk of text from a pdf file

Joshua Wiley jwiley.psych at gmail.com
Sun Sep 18 18:25:24 CEST 2011


On Sun, Sep 18, 2011 at 7:44 AM, Victor <vdemart at gmail.com> wrote:
> Unfortunately pdf2text doesn't seem to exist either in linux or mac osx.

I think Jeff's main point was to search for software specific for your
task (convert a pdf to text).  Formatting will be lost so once you get
your text files, I would look at regular expressions to try to find
the right part of text to grab.  Some general functions that seem like
they might be relevant:

## for getting the text into R
?readLines
?scan
## for finding the part you need
?regexp
?grep

Cheers,

Josh


> Ciao Vittorio
>
> Il giorno 17/set/2011, alle ore 21:00, Jeff Newmiller ha scritto:
>
>> Doesn't seen like an R task, but see pdf2text? (From pdftools, UNIX command line tools)
>> ---------------------------------------------------------------------------
>> Jeff Newmiller The ..... ..... Go Live...
>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
>> Live: OO#.. Dead: OO#.. Playing
>> Research Engineer (Solar/Batteries O.O#. #.O#. with
>> /Software/Embedded Controllers) .OO#. .OO#. rocks...1k
>> ---------------------------------------------------------------------------
>> Sent from my phone. Please excuse my brevity.
>>
>> Victor <vdemart at gmail.com> wrote:
>> In an R script I need to extract some  figures from  many web pages in pdf format. As an example see http://www.terna.it/LinkClick.aspx?fileticket=TTQuOPUf%2fs0%3d&tabid=435&mid=3072 from which I would like to extract the "Totale: 1,025,823").
>> Is there any solution?
>> Ciao
>> Vittorio
>>
>>
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/



More information about the R-help mailing list