[R] Reading a web page in pdf format
Marc Schwartz
marc_schwartz at comcast.net
Wed May 9 19:08:21 CEST 2007
On Wed, 2007-05-09 at 10:55 -0500, Marc Schwartz wrote:
> On Wed, 2007-05-09 at 15:47 +0100, Vittorio wrote:
> > Each day the daily balance in the following link
> >
> > http://www.
> > snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf
> >
> > is
> > updated.
> >
> > I would like to set up an R procedure to be run daily in a
> > server able to read the figures in a couple of lines only
> > ("Industriale" and "Termoelettrico", towards the end of the balance)
> > and put the data in a table.
> >
> > Is that possible? If yes, what R-packages
> > should I use?
> >
> > Ciao
> > Vittorio
>
> Vittorio,
>
> Keep in mind that PDF files are typically text files. Thus you can read
> it in using readLines():
>
> PDFFile <- readLines("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")
>
> # Clean up
> unlink("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")
>
>
> > str(PDFFile)
> chr [1:989] "%PDF-1.2" "6 0 obj" "<<" "/Length 7 0 R" ...
>
>
> # Now find the lines containing the values you wish
> # Use grep() with a regex for either term
> Lines <- grep("(Industriale|Termoelettrico)", PDFFile)
>
> > Lines
> [1] 33 34
>
> > PDFFile[Lines]
> [1] "/F3 1 Tf 9 0 0 9 204 304 Tm (Industriale )Tj 9 0 0 9 420 304 Tm ( 46,6)Tj"
> [2] "9 0 0 9 204 283 Tm (Termoelettrico )Tj 9 0 0 9 420 283 Tm ( 99,3)Tj"
>
>
> # Now parse the values out of the lines"
> Vals <- sub(".*\\((.*)\\).*", "\\1", PDFFile[Lines])
>
> > Vals
> [1] " 46,6" " 99,3"
>
>
> # Now convert them to numeric
> # need to change the ',' to a '.' at least in my locale
>
> > as.numeric(gsub(",", "\\.", Vals))
> [1] 46.6 99.3
Vittorio,
Just a quick tweak here, given the possibility that the order of the
values may be subject to change.
After reading the file and getting the lines, use:
# Use sub() with 2 back references, 1 for each value in the line
Vals <- sub(".*\\((.*)\\).*\\((.*)\\).*", "\\1 \\2", PDFFile[Lines])
> Vals
[1] "Industriale 46,6" "Termoelettrico 99,3"
This gives us the labels and the values. Now convert to a data frame and
then coerce the values to numeric:
DF <- read.table(textConnection(Vals))
> DF
V1 V2
1 Industriale 46,6
2 Termoelettrico 99,3
DF$V2 <- as.numeric(sub(",", "\\.", DF$V2))
> DF
V1 V2
1 Industriale 46.6
2 Termoelettrico 99.3
> str(DF)
'data.frame': 2 obs. of 2 variables:
$ V1: Factor w/ 2 levels "Industriale",..: 1 2
$ V2: num 46.6 99.3
HTH,
Marc
More information about the R-help
mailing list