[R] Reading a web page in pdf format

Marc Schwartz marc_schwartz at comcast.net
Wed May 9 19:08:21 CEST 2007


On Wed, 2007-05-09 at 10:55 -0500, Marc Schwartz wrote:
> On Wed, 2007-05-09 at 15:47 +0100, Vittorio wrote:
> > Each day the daily balance in the following link
> > 
> > http://www.
> > snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf
> > 
> > is 
> > updated.
> > 
> > I would like to set up an R procedure to be run daily in a 
> > server able to read the figures in a couple of lines only 
> > ("Industriale" and "Termoelettrico", towards the end of the balance) 
> > and put the data in a table.
> > 
> > Is that possible? If yes, what R-packages 
> > should I use?
> > 
> > Ciao
> > Vittorio
> 
> Vittorio,
> 
> Keep in mind that PDF files are typically text files. Thus you can read
> it in using readLines():
> 
> PDFFile <- readLines("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")
> 
> # Clean up
> unlink("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")
> 
> 
> > str(PDFFile)
>  chr [1:989] "%PDF-1.2" "6 0 obj" "<<" "/Length 7 0 R" ...
> 
> 
> # Now find the lines containing the values you wish
> # Use grep() with a regex for either term
> Lines <- grep("(Industriale|Termoelettrico)", PDFFile)
> 
> > Lines
> [1] 33 34
> 
> > PDFFile[Lines]
> [1] "/F3 1 Tf 9 0 0 9 204 304 Tm (Industriale )Tj 9 0 0 9 420 304 Tm (       46,6)Tj"
> [2] "9 0 0 9 204 283 Tm (Termoelettrico )Tj 9 0 0 9 420 283 Tm (       99,3)Tj"      
> 
> 
> # Now parse the values out of the lines"
> Vals <- sub(".*\\((.*)\\).*", "\\1", PDFFile[Lines])
> 
> > Vals
> [1] "       46,6" "       99,3"
> 
> 
> # Now convert them to numeric
> # need to change the ',' to a '.' at least in my locale
>       
> > as.numeric(gsub(",", "\\.", Vals))
> [1] 46.6 99.3

Vittorio,

Just a quick tweak here, given the possibility that the order of the
values may be subject to change.

After reading the file and getting the lines, use:

# Use sub() with 2 back references, 1 for each value in the line
Vals <- sub(".*\\((.*)\\).*\\((.*)\\).*", "\\1 \\2", PDFFile[Lines])

> Vals
[1] "Industriale         46,6"    "Termoelettrico         99,3"


This gives us the labels and the values. Now convert to a data frame and
then coerce the values to numeric:

DF <- read.table(textConnection(Vals))

> DF
              V1   V2
1    Industriale 46,6
2 Termoelettrico 99,3


DF$V2 <- as.numeric(sub(",", "\\.", DF$V2))

> DF
              V1   V2
1    Industriale 46.6
2 Termoelettrico 99.3


> str(DF)
'data.frame':   2 obs. of  2 variables:
 $ V1: Factor w/ 2 levels "Industriale",..: 1 2
 $ V2: num  46.6 99.3


HTH,

Marc



More information about the R-help mailing list