[R] RCurl and Google Scholar's EndNote references

Duncan Temple Lang duncan at wald.ucdavis.edu
Fri Sep 18 06:39:26 CEST 2009


Hi Jarno

You've only told us half the story. You didn't show how you
i) performed the original query
ii) retrieved the URL you used in subsequent queries


But I can suggest two possible problems.

a) specifying the cookiejar option tells libcurl where to write the
   cookies that the particular curl handle has collected during its life.
   These are written when the curl handle is destroyed.
   So that wouldn't change the getURL() operation, just change what happens
   when the curl handle is destroyed.

b) You probably mean to use cookiefile rather than cookiejar so that
   the curl request would read existing cookies from a file.
   But in that case, how did that file get created with the correct cookies.

c) libcurl will collect cookies in a curl handle as it receives them from a server
   as part of a response. And it will use these in subsequent requests to that server.
   But you must be using the same curl handle.  Different curl handles are entirely
   independent (unless one is copied from another).
   So a possible solution may be that you need to do the initial query with the same
   curl handle


So I would try something like

curl = getCurlHandle()
z = getForm("http://scholar.google.com/scholar", q ='Frank Harrell', hl = 'en', btnG = 'Search',
              .opts = list(verbose = TRUE), curl = curl)

dd = htmlParse(z)
links = getNodeSet(dd, "//a[@href]")

# do something to identify the link you want

tmp = getURL(linkIWant, curl = curl)


Note that we are using the same curl object in both requests.


This may not do what you want, but if you let us know the details
about how you are doing the preceding steps, we should be able to sort
things out.

  D.


Jarno Tuimala wrote:
> Hi!
> 
> I've performed a Google Scholar Search using a query, let's say "Frank
> Harrell", and parsed the links to the EndNote references from the resulting
> HTML code. Now I'd like to download all the references automatically. For
> this, I have tried to use RCurl, but I can't seem to get it working: I
> always get error code "403 Forbidden" from the web server.
> 
> Initially I tried to do this without using cookies:
> 
> library(RCurl)
> getURL("
> http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0
> ")
> 
> or
> 
> getURLContent("
> http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0
> ")
> Error: Forbidden
> and then with cookies:
> 
>  getURL("
> http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0",
> .opts=list(cookiejar="cookiejar.txt"))
> 
> But they both consistently fail the same way. What am I doing wrong?
> 
> sessionInfo()
> R version 2.9.0 (2009-04-17)
> i386-pc-mingw32
> locale:
> LC_COLLATE=Finnish_Finland.1252;LC_CTYPE=Finnish_Finland.1252;LC_MONETARY=Finnish_Finland.1252;LC_NUMERIC=C;LC_TIME=Finnish_Finland.1252
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> other attached packages:
> [1] RCurl_0.98-1   bitops_1.0-4.1
> 
> Thanks!
> Jarno
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list