[R] regexp problem (was: Re: publication statistics from Web of Science)
baptiste auguie
ba208 at exeter.ac.uk
Thu Jan 15 11:19:37 CET 2009
Whoops, it seems I could use some help with regular expressions...
Consider the following two functions, creating a search string, and
retrieving the content from the url,
>
> makeURLsearch <- function(key, dates=c(NULL, NULL)){
>
> base.search <- "http://scholar.google.co.uk/scholar?"
> key.search <- paste("as_q=", key,"&", sep="")
> other.search <- "num=10&btnG=Search
> +
> Scholar
> &as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&"
> dates.search <- paste("as_ylo=", dates[1], "&as_yhi=", dates[2],
> "&as_allsubj=all&hl=en&lr=", sep="")
>
> full.search <- paste(base.search, key.search, other.search,
> dates.search, sep="")
> return(full.search)
> }
>
>
> makeURLsearch("plasmonics")
> makeURLsearch("photonics", c(1980, NULL))
>
> retrieveNumberPublications <- function(url){
>
> x <- readLines(url)
> y <- grep('of about',x, value=TRUE)
> z <- gsub('of about\\s+</b>','\\1',y[1],perl=TRUE) # this does not
> do what I wanted
>
> # the bit to retrieve is the number below
> # <b>10</b> of about <b>21,900</b> for <b><b>photonics</b>
> z
> }
>
> retrieveNumberPublications( makeURLsearch("photonics", c(2008,
> NULL)) )
I can isolate the long string containing the result I want, but not
single out the value which lies between " <b>10</b> of about
<b>21,900</b> for <b><b>photonics</b> " .
Any regexp guru to help me out? I've never got my head around these,
other than trivial cases.
Many thanks,
baptiste
On 15 Jan 2009, at 09:45, baptiste auguie wrote:
> For the record, I thought I'd share two findings:
>
> First, the web of science website does seem to have some sort of API,
> as discussed here:
>
> http://scientific.thomson.com/support/faq/webservices/
> It does not seem like a trivial thing to set up though.
>
> Second, because I could not pass the search term easily in the
> address, I looked into Google scholar instead, where a typical search
> looks like:
> http://scholar.google.co.uk/scholar?as_q=plasmonics&num=10&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&as_ylo=&as_yhi=1960&as_allsubj=all&hl=en&lr=
>
> here it is trivial to create such a string with the desired keyword
> and dates, and retrieve the number of results using readLines(url) and
> grep.
>
>
> Thanks to Phil Spector for some pointers.
>
> Best wishes,
>
> baptiste
_____________________________
Baptiste Auguié
School of Physics
University of Exeter
Stocker Road,
Exeter, Devon,
EX4 4QL, UK
Phone: +44 1392 264187
http://newton.ex.ac.uk/research/emag
More information about the R-help
mailing list