[R] How to download this data?
Ron Michael
ron_michael70 at yahoo.com
Sat Aug 3 20:32:25 CEST 2013
Hi Duncan,
Thank you very much for your prompt help. Now all worked very smoothly.
Thank you.
----- Original Message -----
From: Duncan Temple Lang <dtemplelang at ucdavis.edu>
To: Ron Michael <ron_michael70 at yahoo.com>
Cc: "r-help at r-project.org" <r-help at r-project.org>
Sent: Saturday, 3 August 2013 7:43 PM
Subject: Re: [R] How to download this data?
Hi Ron
Yes, you can use ssl.verifypeer = FALSE. Or alternatively, you can use also use
getURLContent(........, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
to specify where libcurl can find the certificates to verify the SSL signature.
The error you are encountering appears to becoming from a garbled R expression. This may have
arisen as a result of an HTML mailer adding the <a href="....".... into the expression
where it found an https://...
What we want to do is end up with a string of the form
https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=adasdasdad?expiryData=&specId=219
We have to substitute the text adasdasdad which we assigned to jsession in a previous command.
So, take the literal text
c("https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=",
jsession,
"?expiryData=&specId=219")
and combine it into a single string with paste0.
We need the literal strings as they appear when you view the mail for R to make sense of them, not what the mailer adds.
As to where I found this, it is in the source of the original HTML page in rawDoc
scripts = getNodeSet(rawDoc, "//body//script")
scripts[[ length(scripts) ]]
and look at the text, specifically the app.urls and its 'expiry' field.
<script type="text/javascript"><![CDATA[
var app = {};
app.isOption = false;
app.urls = {
'spec':'/productguide/ProductSpec.shtml;jsessionid=22E9BE9DB19FC6F3446C9ED4AFF2BE3F?details=&specId=219',
'data':'/productguide/ProductSpec.shtml;jsessionid=22E9BE9DB19FC6F3446C9ED4AFF2BE3F?data=&specId=219',
'confirm':'/reports/dealreports/getSampleConfirm.do;jsessionid=22E9BE9DB19FC6F3446C9ED4AFF2BE3F?hubId=403&productId=254',
'reports':'/productguide/ProductSpec.shtml;jsessionid=22E9BE9DB19FC6F3446C9ED4AFF2BE3F?reports=&specId=219',
'expiry':'/productguide/ProductSpec.shtml;jsessionid=22E9BE9DB19FC6F3446C9ED4AFF2BE3F?expiryDates=&specId=219'
};
app.Router = Backbone.Router.extend({
routes:{
"spec":"spec",
"data":"data",
"confirm":"confirm",
On 8/3/13 1:05 AM, Ron Michael wrote:
> In the mean time I have this problem sorted out, hopefully I did it correctly. I have modified the line of your code as:
>
> rawOrig = getURLContent("https://www.theice.com/productguide/ProductSpec.shtml?specId=219#expiry", ssl.verifypeer = FALSE)
>
> However next I faced with another problem to executing:
> > u = sprintf("<a href="https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=%s?expiryDates=&specId=219">https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=%s?expiryDates=&specId=219", jsession)
> Error: unexpected symbol in "u = sprintf("<a href="https"
>
> Can you or someone else help me to get out of this error?
>
> Also, my another question is: from where you got the expression:
> "<a href="https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=%s?expiryDates=&specId=219">https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=%s?expiryDates=&specId=219"
>
> I really appreciate if someone help me to understand that.
>
> Thank you.
>
>
> ----- Original Message -----
> From: Ron Michael <ron_michael70 at yahoo.com>
> To: Duncan Temple Lang <dtemplelang at ucdavis.edu>; "r-help at r-project.org" <r-help at r-project.org>
> Cc:
> Sent: Saturday, 3 August 2013 12:58 PM
> Subject: Re: [R] How to download this data?
>
> Hello Duncan,
>
> Thank you very much for your pointer.
>
> However when I tried to run your code, I got following error:
> > rawOrig = getURLContent("https://www.theice.com/productguide/ProductSpec.shtml?specId=219#expiry")
> Error in function (type, msg, asError = TRUE) :
> SSL certificate problem, verify that the CA cert is OK. Details:
> error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
>
> Can someone help me to understand what could be the cause of this error?
>
> Thank you.
>
>
> ----- Original Message -----
> From: Duncan Temple Lang <dtemplelang at ucdavis.edu>
> To: r-help at r-project.org
> Cc:
> Sent: Saturday, 3 August 2013 4:33 AM
> Subject: Re: [R] How to download this data?
>
>
> That URL is an HTTPS (secure HTTP), not an HTTP.
> The XML parser cannot retrieve the file.
> Instead, use the RCurl package to get the file.
>
> However, it is more complicated than that. If
> you look at source of the HTML page in a browser,
> you'll see a jsessionid and that is a session identifier.
>
> The following retrieves the content of your URL and then
> parses it and extracts the value of the jsessionid.
> Then we create the full URL to the actual data page (which is actually in the HTML
> content but in JavaScript code)
>
> library(RCurl)
> library(XML)
>
> rawOrig = getURLContent("https://www.theice.com/productguide/ProductSpec.shtml?specId=219#expiry")
> rawDoc = htmlParse(rawOrig)
> tmp = getNodeSet(rawDoc, "//@href[contains(.,\040'jsessionid=')]")[[1]]
> jsession = gsub(".*jsessionid=([^?]+)?.*", "\\1", tmp)
>
> u = sprintf("https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=%s?expiryDates=&specId=219", jsession)
>
> doc = htmlParse(getURLContent(u))
> tbls = readHTMLTable(doc)
> data = tbls[[1]]
>
> dim(data)
>
>
> I did this quickly so it may not be the best way or completely robust, but hopefully
> it gets the point across and does get the data.
>
> D.
>
> On 8/2/13 2:42 PM, Ron Michael wrote:
>> Hi all,
>>
>> I need to download the data from this web page:
>>
>> https://www.theice.com/productguide/ProductSpec.shtml?specId=219#expiry
>>
>> I used the function readHTMLTable() from package XML, however could not download that.
>>
>> Can somebody help me how to get the data onto my R window?
>>
>> Thank you.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
More information about the R-help
mailing list