[R-sig-eco] Help with function to webscrap

Augusto Ribas ribas.aca at gmail.com
Tue Jun 26 20:57:18 CEST 2012


Hello.
I'm haveing problems with a function to do webscrap.
I have a long list like this example:

data<-data.frame(especie=c("Rana pipiens","Rana vaillanti","Ctenosaura
similis","Bos taurus"),group=c("sapo","sapo","reptil","mamifero"))

And, as some species names are out of data, i trying to make a
function to check catalogue of life (http://www.catalogueoflife.org/)
and get the current names.
This have some problems, like species name that split, but help as a
first check.

So i made this function to web scrap the data.
Its simple, it search the site, makeing a link with the keywords, then
enter the first link of the list of results produced and get the
accepted name and author, giveing the results as a list.
for example:

> sp.check("Rana pipiens")
$sp.aceito
[1] "Lithobates pipiens"

$autor
[1] "Schreber, 1782"

But sometimes the function cannot acess the internet, and give a error.

I'm made this function trying to copy some examples on foruns, but i
have some doubts:

01) How do i supress the readlines() warnings?

02) How can i make the function try again when cannot acess internet,
or just print something like "Cant acess internet", or when i try
something like:

data$check<-NA
for(i in 1:nrow(data)) {
 data$check[i]<-sp.check(data$especie[i])
 }

the loop dont stop.
I made a short list, but when with 500 or more lines it usually stop
in the middle.

03) Anyone have an example how to scrap http://www.iucnredlist.org/
the status of species, as it does not use the keyword in the link? Is
there any tutorial simple for someone without any background on
programing or computer science?


Well thanks for the attention.

#função sp.check

sp.check<-function(especie) {
#split species name
especie<-as.character(especie)

gen<-strsplit(especie,"\\ ")[[1]][1]
esp<-strsplit(especie,"\\ ")[[1]][2]

#makeing first link
link<-paste("http://www.catalogueoflife.org/col/search/all/key/",gen,"+",esp,"/match/1",sep="")
link <- iconv(link, 'latin1', 'UTF-8')
Encoding(link) <- 'bytes'

#reading table of results
pagina <- readLines(url(link))

n.linhas<-which(pagina%in%"        <td class=\"field_header_black\">")

#is there any results?
if(length(n.linhas)>0) {

pag.sp<-strsplit(pagina[n.linhas[1]+1],'\\"')[[1]][2]

#second link
link2 <- paste( "http://www.catalogueoflife.org",pag.sp,sep="")
link2 <- iconv(link2, 'latin1', 'UTF-8')
Encoding(link2) <- 'bytes'
link2

#read
pagina2 <- readLines(url(link2))

#get line of interest
linha2<-grep('(accepted name)',pagina2)
sp.final<-pagina2[linha2]

#get species name
corte1<-strsplit(sp.final,'<i>')[[1]][2]
sp.aceito<-strsplit(corte1,'</i>')[[1]][1]

#get author
corte2<-strsplit(sp.final,'\\(')[[1]][2]
autor<-strsplit(corte2,')')[[1]][1]
}else {
sp.aceito<-c("Não encontrado")
autor<-c("Não encontrado")
}
return(list(sp.aceito=sp.aceito,autor=autor))
}

--
Grato
Augusto C. A. Ribas

Site Pessoal: http://augustoribas.heliohost.org
Lattes: http://lattes.cnpq.br/7355685961127056



More information about the R-sig-ecology mailing list