[R] RCurl and Google Scholar's EndNote references
Duncan Temple Lang
duncan at wald.ucdavis.edu
Fri Sep 18 06:39:26 CEST 2009
Hi Jarno
You've only told us half the story. You didn't show how you
i) performed the original query
ii) retrieved the URL you used in subsequent queries
But I can suggest two possible problems.
a) specifying the cookiejar option tells libcurl where to write the
cookies that the particular curl handle has collected during its life.
These are written when the curl handle is destroyed.
So that wouldn't change the getURL() operation, just change what happens
when the curl handle is destroyed.
b) You probably mean to use cookiefile rather than cookiejar so that
the curl request would read existing cookies from a file.
But in that case, how did that file get created with the correct cookies.
c) libcurl will collect cookies in a curl handle as it receives them from a server
as part of a response. And it will use these in subsequent requests to that server.
But you must be using the same curl handle. Different curl handles are entirely
independent (unless one is copied from another).
So a possible solution may be that you need to do the initial query with the same
curl handle
So I would try something like
curl = getCurlHandle()
z = getForm("http://scholar.google.com/scholar", q ='Frank Harrell', hl = 'en', btnG = 'Search',
.opts = list(verbose = TRUE), curl = curl)
dd = htmlParse(z)
links = getNodeSet(dd, "//a[@href]")
# do something to identify the link you want
tmp = getURL(linkIWant, curl = curl)
Note that we are using the same curl object in both requests.
This may not do what you want, but if you let us know the details
about how you are doing the preceding steps, we should be able to sort
things out.
D.
Jarno Tuimala wrote:
> Hi!
>
> I've performed a Google Scholar Search using a query, let's say "Frank
> Harrell", and parsed the links to the EndNote references from the resulting
> HTML code. Now I'd like to download all the references automatically. For
> this, I have tried to use RCurl, but I can't seem to get it working: I
> always get error code "403 Forbidden" from the web server.
>
> Initially I tried to do this without using cookies:
>
> library(RCurl)
> getURL("
> http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0
> ")
>
> or
>
> getURLContent("
> http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0
> ")
> Error: Forbidden
> and then with cookies:
>
> getURL("
> http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0",
> .opts=list(cookiejar="cookiejar.txt"))
>
> But they both consistently fail the same way. What am I doing wrong?
>
> sessionInfo()
> R version 2.9.0 (2009-04-17)
> i386-pc-mingw32
> locale:
> LC_COLLATE=Finnish_Finland.1252;LC_CTYPE=Finnish_Finland.1252;LC_MONETARY=Finnish_Finland.1252;LC_NUMERIC=C;LC_TIME=Finnish_Finland.1252
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] RCurl_0.98-1 bitops_1.0-4.1
>
> Thanks!
> Jarno
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list