[R] How to download this data?

Duncan Temple Lang dtemplelang at ucdavis.edu
Sat Aug 3 15:58:42 CEST 2013


Hi Ron

  Yes, you can use ssl.verifypeer = FALSE.  Or alternatively, you can use also use

   getURLContent(........,  cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

 to specify where libcurl can find the certificates to verify the SSL signature.


 The error you are encountering appears to becoming from a garbled R expression. This may have
arisen as a result of an HTML mailer adding the <a href="....".... into the expression
where it found an https://...

 What we want to do is end up with a string of the form

   https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=adasdasdad?expiryData=&specId=219

We have to substitute the text adasdasdad which  we assigned to jsession in a previous command.
So, take the literal text

   c("https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=",
     jsession,
     "?expiryData=&specId=219")

and combine it into a single string with paste0.

We need the literal strings as they appear when you view the mail for R to make sense of them, not what the mailer adds.


As to where I found this, it is in the source of the original HTML page in rawDoc

 scripts = getNodeSet(rawDoc, "//body//script")
 scripts[[ length(scripts) ]]

and look at the text, specifically the app.urls and its 'expiry' field.


<script type="text/javascript"><![CDATA[

        var app = {};

        app.isOption = false;

        app.urls = {

            'spec':'/productguide/ProductSpec.shtml;jsessionid=22E9BE9DB19FC6F3446C9ED4AFF2BE3F?details=&specId=219',

            'data':'/productguide/ProductSpec.shtml;jsessionid=22E9BE9DB19FC6F3446C9ED4AFF2BE3F?data=&specId=219',


'confirm':'/reports/dealreports/getSampleConfirm.do;jsessionid=22E9BE9DB19FC6F3446C9ED4AFF2BE3F?hubId=403&productId=254',

            'reports':'/productguide/ProductSpec.shtml;jsessionid=22E9BE9DB19FC6F3446C9ED4AFF2BE3F?reports=&specId=219',


'expiry':'/productguide/ProductSpec.shtml;jsessionid=22E9BE9DB19FC6F3446C9ED4AFF2BE3F?expiryDates=&specId=219'

        };

        app.Router = Backbone.Router.extend({

            routes:{

                "spec":"spec",

                "data":"data",

                "confirm":"confirm",


On 8/3/13 1:05 AM, Ron Michael wrote:
> In the mean time I have this problem sorted out, hopefully I did it correctly. I have modified the line of your code as:
>  
> rawOrig = getURLContent("https://www.theice.com/productguide/ProductSpec.shtml?specId=219#expiry", ssl.verifypeer = FALSE)
>  
> However next I faced with another problem to executing:
>  > u = sprintf("<a href="https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=%s?expiryDates=&specId=219">https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=%s?expiryDates=&specId=219", jsession) 
> Error: unexpected symbol in "u = sprintf("<a href="https"
> 
> Can you or someone else help me to get out of this error?
>  
> Also, my another question is: from where you got the expression:
> "<a href="https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=%s?expiryDates=&specId=219">https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=%s?expiryDates=&specId=219"
>  
> I really appreciate if someone help me to understand that.
>  
> Thank you.
> 
> 
> ----- Original Message -----
> From: Ron Michael <ron_michael70 at yahoo.com>
> To: Duncan Temple Lang <dtemplelang at ucdavis.edu>; "r-help at r-project.org" <r-help at r-project.org>
> Cc: 
> Sent: Saturday, 3 August 2013 12:58 PM
> Subject: Re: [R] How to download this data?
> 
> Hello Duncan,
>  
> Thank you very much for your pointer.
>  
> However when I tried to run your code, I got following error:
>  > rawOrig = getURLContent("https://www.theice.com/productguide/ProductSpec.shtml?specId=219#expiry") 
> Error in function (type, msg, asError = TRUE)  : 
>   SSL certificate problem, verify that the CA cert is OK. Details:
> error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
> 
> Can someone help me to understand what could be the cause of this error?
>  
> Thank you.
> 
> 
> ----- Original Message -----
> From: Duncan Temple Lang <dtemplelang at ucdavis.edu>
> To: r-help at r-project.org
> Cc: 
> Sent: Saturday, 3 August 2013 4:33 AM
> Subject: Re: [R] How to download this data?
> 
> 
> That URL is an HTTPS (secure HTTP), not an HTTP.
> The XML parser cannot retrieve the file.
> Instead, use the RCurl package to get the file.
> 
> However, it is more complicated than that. If
> you look at source of the HTML page in a browser,
> you'll see a jsessionid and that is a session identifier.
> 
> The following retrieves the content of your URL and then
> parses it and extracts the value of the jsessionid.
> Then we create the full URL to the actual data page (which is actually in the HTML
> content but in JavaScript code)
> 
> library(RCurl)
> library(XML)
> 
> rawOrig = getURLContent("https://www.theice.com/productguide/ProductSpec.shtml?specId=219#expiry")
> rawDoc = htmlParse(rawOrig)
> tmp = getNodeSet(rawDoc, "//@href[contains(.,\040'jsessionid=')]")[[1]]
> jsession = gsub(".*jsessionid=([^?]+)?.*", "\\1", tmp)
> 
> u = sprintf("https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=%s?expiryDates=&specId=219", jsession)
> 
> doc = htmlParse(getURLContent(u))
> tbls = readHTMLTable(doc)
> data = tbls[[1]]
> 
> dim(data)
> 
> 
> I did this quickly so it may not be the best way or completely robust, but hopefully
> it gets the point across and does get the data.
> 
>   D.
> 
> On 8/2/13 2:42 PM, Ron Michael wrote:
>> Hi all,
>>   
>> I need to download the data from this web page:
>>   
>> https://www.theice.com/productguide/ProductSpec.shtml?specId=219#expiry
>>   
>> I used the function readHTMLTable() from package XML, however could not download that.
>>   
>> Can somebody help me how to get the data onto my R window?
>>   
>> Thank you.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.   
> 
>



More information about the R-help mailing list