[R] RCurl unable to download a particular web page -- what is so special about this web page?

clair.crossupton at googlemail.com clair.crossupton at googlemail.com
Tue Jan 27 18:14:57 CET 2009


Cheers Duncan, that worked great

> getURL("http://uk.youtube.com", httpheader = c("User-Agent" = "R (2.8.1)"))
[1] "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\"
\"http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd\">\n\n\
[etc]

May I ask if there was a specific manual you read to learn these
things please? I do not think i could have worked that one out on my
own.

Thank you again for your time,
C.C

On 27 Jan, 16:46, Duncan Temple Lang <dun... at wald.ucdavis.edu> wrote:
> Some Web servers are strict. In this case, it won't accept
> a request without being told who is asking, i.e. the User-Agent.
>
> If you use
>
>   getURL("http://www.youtube.com",
>            httpheader = c("User-Agent" = "R (2.9.0)")))
>
> you should get the contents of the page as expected.
>
> (Or with URL uk.youtube.com, etc.)
>
>   D.
>
>
>
> clair.crossup... at googlemail.com wrote:
> > Thank you. The output i get from that example is below:
>
> >> d = debugGatherer()
> >> getURL("http://uk.youtube.com",
> > +          debugfunction = d$update, verbose = TRUE )
> > [1] ""
> >> d$value()
>
> > text
> > "About to connect() to uk.youtube.com port 80 (#0)\n  Trying
> > 208.117.236.72... connected\nConnected to uk.youtube.com
> > (208.117.236.72) port 80 (#0)\nConnection #0 to host uk.youtube.com
> > left intact\n"
>
> > headerIn
> > "HTTP/1.1 400 Bad Request\r\nVia: 1.1 PFO-FIREWALL\r\nConnection: Keep-
> > Alive\r\nProxy-Connection: Keep-Alive\r\nTransfer-Encoding: chunked\r
> > \nExpires: Tue, 27 Apr 1971 19:44:06 EST\r\nDate: Tue, 27 Jan 2009
> > 15:31:25 GMT\r\nContent-Type: text/plain\r\nServer: Apache\r\nX-
> > Content-Type-Options: nosniff\r\nCache-Control: no-cache\r
> > \nCneonction: close\r\n\r\n"
>
> > headerOut
> > "GET / HTTP/1.1\r\nHost: uk.youtube.com\r\nAccept: */*\r\n\r\n"
>
> > dataIn
> > "0\r\n\r\n"
>
> > dataOut
> > ""
>
> > So the critical information from this is the '400 Bad Request'. A
> > Google search defines this for me as:
>
> >     The request could not be understood by the server due to malformed
> >     syntax. The client SHOULD NOT repeat the request without
> > modifications.
>
> > looking through sort(both listCurlOptions() and
> >http://curl.haxx.se/libcurl/c/curl_easy_setopt.htm) doesn't really
> > help me this time (unless i missed something). Any advice?
>
> > Thank you for your time,
> > C.C
>
> > P.S. I can get the d/l to work if i use:
> >> toString(readLines("http://www.uk.youtube.com"))
> > [1] "<html>, \t<head>, \t\t<title>OpenDNS</title>, \t</head>, ,
> > \t<body id=\"mainbody\" onLoad=\"testforbanner();\" style=\"margin:
> > 0px;\">, \t\t<script language=\"JavaScript\">, \t\t\tfunction
> > testforbanner() {, \t\t\t\tvar width;, \t\t\t\tvar height;, \t\t\t
> > \tvar x = 0;, \t\t\t\tvar isbanner = false;, \t\t\t\tvar bannersizes =
> > new Array(16), \t\t\t\tbannersizes[0] = [etc]
>
> > On 27 Jan, 13:52, Duncan Temple Lang <dun... at wald.ucdavis.edu> wrote:
> >> clair.crossup... at googlemail.com wrote:
> >>> Thank you Duncan.
> >>> I remember seeing in your documentation that you have used this
> >>> 'verbose=TRUE' argument in functions before when trying to see what is
> >>> going on. This is good. However, I have not been able to get it to
> >>> work for me. Does the output appear in R or do you use some other
> >>> external window (i.e. MS DOS window?)?
> >> The libcurl code typically defaults to print on the console.
> >> So on the Windows GUI, this will not show up. Using
> >> a shell (MS DOS window or Unix-like shell) should
> >> should cause the output to be displayed.
>
> >> A more general way however is to use the debugfunction
> >> option.
>
> >> d = debugGatherer()
>
> >> getURL("http://uk.youtube.com",
> >>          debugfunction = d$update, verbose = TRUE)
>
> >> When this completes, use
>
> >>   d$value()
>
> >> and you have the entire contents that would be displayed on the console.
>
> >>   D.
>
> >>>> library(RCurl)
> >>>> my.url <- 'http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...
> >>>> getURL(my.url, verbose = TRUE)
> >>> [1] ""
> >>> I am having a problem with a new webpage (http://uk.youtube.com/) but
> >>> if i can get this verbose to work, then i think i will be able to
> >>> google the right action to take based on the information it gives.
> >>> Many thanks for your time,
> >>> C.C.
> >>> On 26 Jan, 16:12, Duncan Temple Lang <dun... at wald.ucdavis.edu> wrote:
> >>>> clair.crossup... at googlemail.com wrote:
> >>>>> Dear R-help,
> >>>>> There seems to be a web page I am unable to download using RCurl. I
> >>>>> don't understand why it won't download:
> >>>>>> library(RCurl)
> >>>>>> my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..."
> >>>>>> getURL(my.url)
> >>>>> [1] ""
> >>>>   I like the irony that RCurl seems to have difficulties downloading an
> >>>> article about R.  Good thing it is just a matter of additional arguments
> >>>> to getURL() or it would be bad news.
> >>>> The followlocation parameter defaults to FALSE, so
> >>>>    getURL(my.url, followlocation = TRUE)
> >>>> gets what you want.
> >>>> The way I found this  is
> >>>>   getURL(my.url, verbose = TRUE)
> >>>> and take a look at the information being sent from R
> >>>> and received by R from the server.
> >>>> This gives
> >>>> * About to connect() towww.nytimes.comport80(#0)
> >>>> *   Trying 199.239.136.200... * connected
> >>>> * Connected towww.nytimes.com(199.239.136.200) port 80 (#0)
> >>>>  > GET /2009/01/07/technology/business-computing/07program.html?_r=2
> >>>> HTTP/1.1
> >>>> Host:www.nytimes.com
> >>>> Accept: */*
> >>>> < HTTP/1.1 301 Moved Permanently
> >>>> < Server: Sun-ONE-Web-Server/6.1
> >>>> < Date: Mon, 26 Jan 2009 16:10:51 GMT
> >>>> < Content-length: 0
> >>>> < Content-type: text/html
> >>>> < Location:http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/t...
> >>>> <
> >>>> And the 301 is the critical thing here.
> >>>>   D.
> >>>>> Other web pages are ok to download but this is the first time I have
> >>>>> been unable to download a web page using the very nice RCurl package.
> >>>>> While i can download the webpage using the RDCOMClient, i would like
> >>>>> to understand why it doesn't work as above please?
> >>>>>> library(RDCOMClient)
> >>>>>> my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..."
> >>>>>> ie <- COMCreate("InternetExplorer.Application")
> >>>>>> txt <- list()
> >>>>>> ie$Navigate(my.url)
> >>>>> NULL
> >>>>>> while(ie[["Busy"]]) Sys.sleep(1)
> >>>>>> txt[[my.url]] <- ie[["document"]][["body"]][["innerText"]]
> >>>>>> txt
> >>>>> $`http://www.nytimes.com/2009/01/07/technology/business-computing/
> >>>>> 07program.html?_r=2`
> >>>>> [1] "Skip to article Try Electronic Edition Log ...
> >>>>> Many thanks for your time,
> >>>>> C.C
> >>>>> Windows Vista, running with administrator privileges.
> >>>>>> sessionInfo()
> >>>>> R version 2.8.1 (2008-12-22)
> >>>>> i386-pc-mingw32
> >>>>> locale:
> >>>>> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
> >>>>> 1252;LC_MONETARY=English_United Kingdom.
> >>>>> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
> >>>>> attached base packages:
> >>>>> [1] stats     graphics  grDevices utils     datasets  methods
> >>>>> base
> >>>>> other attached packages:
> >>>>> [1] RDCOMClient_0.92-0 RCurl_0.94-0
> >>>>> loaded via a namespace (and not attached):
> >>>>> [1] tools_2.8.1
> >>>>> ______________________________________________
> >>>>> R-h... at r-project.org mailing list
> >>>>>https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> >>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>> ______________________________________________
> >>>> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>> ______________________________________________
> >>> R-h... at r-project.org mailing list
> >>>https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >> ______________________________________________
> >> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
>
> > ______________________________________________
> > R-h... at r-project.org mailing list
> >https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list