[R] web scraping image

Curtis DeGasperi curtis.degasperi at gmail.com
Mon Jun 8 20:09:09 CEST 2015


Thanks to Jim's prompting, I think I came up with a fairly painless way to
parse the HTML without having to write any parsing code myself using the
function getHTMLExternalFiles in the XML package. A working version of the
code follows:

## Code to process USGS peak flow data

require(dataRetrieval)
require(XML)

## Need to start with list of gauge ids to process

siteno <- c('12142000','12134500','12149000')

lstas <-length(siteno) #length of locator list

print(paste('Processsing...',siteno[1],' ',siteno[1], sep = ""))

datall <-  readNWISpeak(siteno[1])

for (a in 2:lstas) {
  # Print station being processed
  print(paste('Processsing...',siteno[a], sep = ""))

  dat<-  readNWISpeak(siteno[a])

  datall <- rbind(datall,dat)

}

write.csv(datall, file = "usgs_peaks.csv")

# Retrieve ascii text files and graphics
for (a in 1:lstas) {

  print(paste('Processsing...',siteno[a], sep = ""))

  graphic.url <-
paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=',siteno[a],'&agency_cd=USGS&format=img',
sep = "")
  usgs.img <- getHTMLExternalFiles(graphic.url)
  graphic.img <- paste('http://nwis.waterdata.usgs.gov',usgs.img, sep = "")

  peakfq.url <-
paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=',siteno[a],'&agency_cd=USGS&format=hn2',
sep = "")
  tab.url  <- paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=',siteno[a],'&agency_cd=USGS&format=rdb',
sep = "")

  graphic.fn <- paste('graphic_',siteno[a],'.gif', sep = "")
  peakfq.fn <- paste('peakfq_',siteno[a],'.txt', sep = "")
  tab.fn  <- paste('tab_',siteno[a],'.txt', sep = "")
  download.file(graphic.img,graphic.fn,mode='wb')
  download.file(peakfq.url,peakfq.fn)
  download.file(tab.url,tab.fn)
}

> ------------------------------
>
> Message: 34
> Date: Fri, 5 Jun 2015 08:59:04 +1000
> From: Jim Lemon <drjimlemon at gmail.com>
> To: Curtis DeGasperi <curtis.degasperi at gmail.com>
> Cc: r-help mailing list <r-help at r-project.org>
> Subject: Re: [R] web scraping image
> Message-ID:
>         <
CA+8X3fV0aJw+E22JayV1GfM6JR_taZuA5FwGD3T_mfGfQy2nFA at mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> Hi Chris,
> I don't have the packages you are using, but tracing this indicates
> that the page source contains the relative path of the graphic, in
> this case:
>
> /nwisweb/data/img/USGS.12144500.19581112.20140309..0.peak.pres.gif
>
> and you already have the server URL:
>
> nwis.waterdata.usgs.gov
>
> getting the path out of the page source isn't difficult, just split
> the text at double quotes and get the token following "img src=". If I
> understand the arguments of "download.file" correctly, the path is the
> graphic.fn argument and the server URL is the graphic.url argument. I
> would paste them together and display the result to make sure that it
> matches the image you want. When I did this, the correct image
> appeared in my browser. I'm using Google Chrome, so I don't have to
> prepend the http://
>
> Jim
>
> On Fri, Jun 5, 2015 at 2:31 AM, Curtis DeGasperi
> <curtis.degasperi at gmail.com> wrote:
>> I'm working on a script that downloads data from the USGS NWIS server.
>> dataRetrieval makes it easy to quickly get the data in a neat tabular
>> format, but I was also interested in getting the tabular text files -
>> also fairly easy for me using download.file.
>>
>> However, I'm not skilled enough to work out how to download the nice
>> graphic files that can be produced dynamically from the USGS NWIS
>> server (for example:
>>
http://nwis.waterdata.usgs.gov/nwis/peak?site_no=12144500&agency_cd=USGS&format=img
)
>>
>> My question is how do I get the image from this web page and save it
>> to a local directory? scrapeR returns the information from the page
>> and I suspect this is a possible solution path, but I don't know what
>> the next step is.
>>
>> My code provided below works from a list I've created of USGS flow
>> gauging stations.
>>
>> Curtis
>>
>> ## Code to process USGS daily flow data for high and low flow analysis
>> ## Need to start with list of gauge ids to process
>> ## Can't figure out how to automate download of images
>>
>> require(dataRetrieval)
>> require(data.table)
>> require(scrapeR)
>>
>> df <- read.csv("usgs_stations.csv", header=TRUE)
>>
>> lstas <-length(df$siteno) #length of locator list
>>
>> print(paste('Processsing...',df$name[1],' ',df$siteno[1], sep = ""))
>>
>> datall <-  readNWISpeak(df$siteno[1])
>>
>> for (a in 2:lstas) {
>>   # Print station being processed
>>   print(paste('Processsing...',df$name[a],' ',df$siteno[a], sep = ""))
>>
>>   dat<-  readNWISpeak(df$siteno[a])
>>
>>   datall <- rbind(datall,dat)
>>
>> }
>>
>> write.csv(datall, file = "usgs_peaks.csv")
>>
>> # Retrieve ascii text files and graphics
>>
>> for (a in 1:lstas) {
>>
>>   print(paste('Processsing...',df$name[1],' ',df$siteno[1], sep = ""))
>>
>>   graphic.url <-
>> paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=
',df$siteno[a],'&agency_cd=USGS&format=img',
>> sep = "")
>>   peakfq.url <-
>> paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=
',df$siteno[a],'&agency_cd=USGS&format=hn2',
>> sep = "")
>>   tab.url  <- paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=
',df$siteno[a],'&agency_cd=USGS&format=rdb',
>> sep = "")
>>
>>   graphic.fn <- paste('graphic_',df$siteno[a],'.gif', sep = "")
>>   peakfq.fn <- paste('peakfq_',df$siteno[a],'.txt', sep = "")
>>   tab.fn  <- paste('tab_',df$siteno[a],'.txt', sep = "")
>>
>>   download.file(graphic.url,graphic.fn,mode='wb') # This apparently
>> doesn't work - file is empty
>>   download.file(peakfq.url,peakfq.fn)
>>   download.file(tab.url,tab.fn)
>> }
>>
>> # scrapeR
>> pageSource<-scrape(url="
http://nwis.waterdata.usgs.gov/nwis/peak?site_no=12144500&agency_cd=USGS&format=img
",headers=TRUE,
>> parse=FALSE)
>> page<-scrape(object="pageSource")
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]



More information about the R-help mailing list