[R] scraping with session cookies

Duncan Temple Lang dtemplelang at ucdavis.edu
Wed Sep 19 08:15:59 CEST 2012


Hi ?

The key is that you want to use the same curl handle
for both the postForm() and for getting the data document.

site = u =
"http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18"

library(RCurl)
curl = getCurlHandle(cookiefile = "", verbose = TRUE)

postForm(site, disclaimer_action="I Agree")

Now we have the cookie in the curl handle so we can use that same curl handle
to request the data document:

txt = getURLContent(u, curl = curl)

Now we can use readHTMLTable() on the local document content:

library(XML)
tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE)



Rather than knowing how to post the form, I like to read
the form programmatically and generate an R function to do the submission
for me. The RHTMLForms package can do this.

library(RHTMLForms)
forms = getHTMLFormDescription(u, FALSE)
fun = createFunction(forms[[1]])

Then we can use

 fun(.curl = curl)

instead of

  postForm(site, disclaimer_action="I Agree")

This helps to abstract the details of the form.

  D.

On 9/18/12 5:57 PM, CPV wrote:
> Hi, I am starting coding in r and one of the things that i want to do is to
> scrape some data from the web.
> The problem that I am having is that I cannot get passed the disclaimer
> page (which produces a session cookie). I have been able to collect some
> ideas and combine them in the code below but I dont get passed the
> disclaimer page.
> I am trying to agree the disclaimer with the postForm and write the cookie
> to a file, but I cannot do it succesfully....
> The webpage cookies are written to the file but the value is FALSE... So
> any ideas of what I should do or what I am doing wrong with?
> Thank you for your help,
> 
> library(RCurl)
> library(XML)
> 
> site <- "
> http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18"
> 
> postForm(site, disclaimer_action="I Agree")
> 
> cf <- "cookies.txt"
> 
> no_cookie <- function() {
>         curlHandle <- getCurlHandle(cookiefile=cf, cookiejar=cf)
>         getURL(site, curl=curlHandle)
> 
>         rm(curlHandle)
>         gc()
> }
> 
> if ( file.exists(cf) == TRUE ) {
>         file.create(cf)
>         no_cookie()
> }
> allTables <- readHTMLTable(site)
> allTables
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
>




More information about the R-help mailing list