[R] Reading a web page in pdf format

Marc Schwartz marc_schwartz at comcast.net
Wed May 9 17:55:39 CEST 2007


On Wed, 2007-05-09 at 15:47 +0100, Vittorio wrote:
> Each day the daily balance in the following link
> 
> http://www.
> snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf
> 
> is 
> updated.
> 
> I would like to set up an R procedure to be run daily in a 
> server able to read the figures in a couple of lines only 
> ("Industriale" and "Termoelettrico", towards the end of the balance) 
> and put the data in a table.
> 
> Is that possible? If yes, what R-packages 
> should I use?
> 
> Ciao
> Vittorio

Vittorio,

Keep in mind that PDF files are typically text files. Thus you can read
it in using readLines():

PDFFile <- readLines("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")

# Clean up
unlink("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")


> str(PDFFile)
 chr [1:989] "%PDF-1.2" "6 0 obj" "<<" "/Length 7 0 R" ...


# Now find the lines containing the values you wish
# Use grep() with a regex for either term
Lines <- grep("(Industriale|Termoelettrico)", PDFFile)

> Lines
[1] 33 34

> PDFFile[Lines]
[1] "/F3 1 Tf 9 0 0 9 204 304 Tm (Industriale )Tj 9 0 0 9 420 304 Tm (       46,6)Tj"
[2] "9 0 0 9 204 283 Tm (Termoelettrico )Tj 9 0 0 9 420 283 Tm (       99,3)Tj"      


# Now parse the values out of the lines"
Vals <- sub(".*\\((.*)\\).*", "\\1", PDFFile[Lines])

> Vals
[1] "       46,6" "       99,3"


# Now convert them to numeric
# need to change the ',' to a '.' at least in my locale

> as.numeric(gsub(",", "\\.", Vals))
[1] 46.6 99.3


HTH,

Marc Schwartz



More information about the R-help mailing list