[R] RCurl unable to download a particular web page -- what is so special about this web page?
Duncan Temple Lang
duncan at wald.ucdavis.edu
Tue Jan 27 17:46:02 CET 2009
Some Web servers are strict. In this case, it won't accept
a request without being told who is asking, i.e. the User-Agent.
If you use
getURL("http://www.youtube.com",
httpheader = c("User-Agent" = "R (2.9.0)")))
you should get the contents of the page as expected.
(Or with URL uk.youtube.com, etc.)
D.
clair.crossupton at googlemail.com wrote:
> Thank you. The output i get from that example is below:
>
>> d = debugGatherer()
>> getURL("http://uk.youtube.com",
> + debugfunction = d$update, verbose = TRUE )
> [1] ""
>> d$value()
>
> text
> "About to connect() to uk.youtube.com port 80 (#0)\n Trying
> 208.117.236.72... connected\nConnected to uk.youtube.com
> (208.117.236.72) port 80 (#0)\nConnection #0 to host uk.youtube.com
> left intact\n"
>
> headerIn
> "HTTP/1.1 400 Bad Request\r\nVia: 1.1 PFO-FIREWALL\r\nConnection: Keep-
> Alive\r\nProxy-Connection: Keep-Alive\r\nTransfer-Encoding: chunked\r
> \nExpires: Tue, 27 Apr 1971 19:44:06 EST\r\nDate: Tue, 27 Jan 2009
> 15:31:25 GMT\r\nContent-Type: text/plain\r\nServer: Apache\r\nX-
> Content-Type-Options: nosniff\r\nCache-Control: no-cache\r
> \nCneonction: close\r\n\r\n"
>
> headerOut
> "GET / HTTP/1.1\r\nHost: uk.youtube.com\r\nAccept: */*\r\n\r\n"
>
> dataIn
> "0\r\n\r\n"
>
> dataOut
> ""
>
> So the critical information from this is the '400 Bad Request'. A
> Google search defines this for me as:
>
> The request could not be understood by the server due to malformed
> syntax. The client SHOULD NOT repeat the request without
> modifications.
>
>
> looking through sort(both listCurlOptions() and
> http://curl.haxx.se/libcurl/c/curl_easy_setopt.htm) doesn't really
> help me this time (unless i missed something). Any advice?
>
> Thank you for your time,
> C.C
>
> P.S. I can get the d/l to work if i use:
>> toString(readLines("http://www.uk.youtube.com"))
> [1] "<html>, \t<head>, \t\t<title>OpenDNS</title>, \t</head>, ,
> \t<body id=\"mainbody\" onLoad=\"testforbanner();\" style=\"margin:
> 0px;\">, \t\t<script language=\"JavaScript\">, \t\t\tfunction
> testforbanner() {, \t\t\t\tvar width;, \t\t\t\tvar height;, \t\t\t
> \tvar x = 0;, \t\t\t\tvar isbanner = false;, \t\t\t\tvar bannersizes =
> new Array(16), \t\t\t\tbannersizes[0] = [etc]
>
>
>
>
> On 27 Jan, 13:52, Duncan Temple Lang <dun... at wald.ucdavis.edu> wrote:
>> clair.crossup... at googlemail.com wrote:
>>> Thank you Duncan.
>>> I remember seeing in your documentation that you have used this
>>> 'verbose=TRUE' argument in functions before when trying to see what is
>>> going on. This is good. However, I have not been able to get it to
>>> work for me. Does the output appear in R or do you use some other
>>> external window (i.e. MS DOS window?)?
>> The libcurl code typically defaults to print on the console.
>> So on the Windows GUI, this will not show up. Using
>> a shell (MS DOS window or Unix-like shell) should
>> should cause the output to be displayed.
>>
>> A more general way however is to use the debugfunction
>> option.
>>
>> d = debugGatherer()
>>
>> getURL("http://uk.youtube.com",
>> debugfunction = d$update, verbose = TRUE)
>>
>> When this completes, use
>>
>> d$value()
>>
>> and you have the entire contents that would be displayed on the console.
>>
>> D.
>>
>>
>>
>>>> library(RCurl)
>>>> my.url <- 'http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...
>>>> getURL(my.url, verbose = TRUE)
>>> [1] ""
>>> I am having a problem with a new webpage (http://uk.youtube.com/) but
>>> if i can get this verbose to work, then i think i will be able to
>>> google the right action to take based on the information it gives.
>>> Many thanks for your time,
>>> C.C.
>>> On 26 Jan, 16:12, Duncan Temple Lang <dun... at wald.ucdavis.edu> wrote:
>>>> clair.crossup... at googlemail.com wrote:
>>>>> Dear R-help,
>>>>> There seems to be a web page I am unable to download using RCurl. I
>>>>> don't understand why it won't download:
>>>>>> library(RCurl)
>>>>>> my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..."
>>>>>> getURL(my.url)
>>>>> [1] ""
>>>> I like the irony that RCurl seems to have difficulties downloading an
>>>> article about R. Good thing it is just a matter of additional arguments
>>>> to getURL() or it would be bad news.
>>>> The followlocation parameter defaults to FALSE, so
>>>> getURL(my.url, followlocation = TRUE)
>>>> gets what you want.
>>>> The way I found this is
>>>> getURL(my.url, verbose = TRUE)
>>>> and take a look at the information being sent from R
>>>> and received by R from the server.
>>>> This gives
>>>> * About to connect() towww.nytimes.comport80 (#0)
>>>> * Trying 199.239.136.200... * connected
>>>> * Connected towww.nytimes.com(199.239.136.200) port 80 (#0)
>>>> > GET /2009/01/07/technology/business-computing/07program.html?_r=2
>>>> HTTP/1.1
>>>> Host:www.nytimes.com
>>>> Accept: */*
>>>> < HTTP/1.1 301 Moved Permanently
>>>> < Server: Sun-ONE-Web-Server/6.1
>>>> < Date: Mon, 26 Jan 2009 16:10:51 GMT
>>>> < Content-length: 0
>>>> < Content-type: text/html
>>>> < Location:http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/t...
>>>> <
>>>> And the 301 is the critical thing here.
>>>> D.
>>>>> Other web pages are ok to download but this is the first time I have
>>>>> been unable to download a web page using the very nice RCurl package.
>>>>> While i can download the webpage using the RDCOMClient, i would like
>>>>> to understand why it doesn't work as above please?
>>>>>> library(RDCOMClient)
>>>>>> my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..."
>>>>>> ie <- COMCreate("InternetExplorer.Application")
>>>>>> txt <- list()
>>>>>> ie$Navigate(my.url)
>>>>> NULL
>>>>>> while(ie[["Busy"]]) Sys.sleep(1)
>>>>>> txt[[my.url]] <- ie[["document"]][["body"]][["innerText"]]
>>>>>> txt
>>>>> $`http://www.nytimes.com/2009/01/07/technology/business-computing/
>>>>> 07program.html?_r=2`
>>>>> [1] "Skip to article Try Electronic Edition Log ...
>>>>> Many thanks for your time,
>>>>> C.C
>>>>> Windows Vista, running with administrator privileges.
>>>>>> sessionInfo()
>>>>> R version 2.8.1 (2008-12-22)
>>>>> i386-pc-mingw32
>>>>> locale:
>>>>> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
>>>>> 1252;LC_MONETARY=English_United Kingdom.
>>>>> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>>>>> attached base packages:
>>>>> [1] stats graphics grDevices utils datasets methods
>>>>> base
>>>>> other attached packages:
>>>>> [1] RDCOMClient_0.92-0 RCurl_0.94-0
>>>>> loaded via a namespace (and not attached):
>>>>> [1] tools_2.8.1
>>>>> ______________________________________________
>>>>> R-h... at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>> ______________________________________________
>>>> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>> ______________________________________________
>>> R-h... at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> ______________________________________________
>> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list