[R-sig-eco] Help with function to webscrap

Wed Jun 27 15:39:35 CEST 2012

Hai Augusto,

regarding question #3:
You could use the red list API with RCurl and XML packages.
Here is an example:

 > require(RCurl)
 > require(XML)
 > get_IUCN_status <- function(x) {
+   spec <- tolower(x)
+   spec <- gsub(" ", "-", spec)
+   url <- paste("http://api.iucnredlist.org/go/", spec, sep="")
+   get <- getURL(url, followlocation = TRUE)
+   h <- htmlParse(get)
+   status <- xpathSApply(h, '//div[@id ="red_list_category_code"]', 
xmlValue)
+   return(status)
+ }
 >
 > get_IUCN_status("Panthera uncia")
[1] "EN"

For more resources just type 'webscraping R' in your favourite search 
engine.

HTH,

Eduard

On 26/06/12 20:57, Augusto Ribas wrote:
> Hello.
> I'm haveing problems with a function to do webscrap.
> I have a long list like this example:
>
> data<-data.frame(especie=c("Rana pipiens","Rana vaillanti","Ctenosaura
> similis","Bos taurus"),group=c("sapo","sapo","reptil","mamifero"))
>
> And, as some species names are out of data, i trying to make a
> function to check catalogue of life (http://www.catalogueoflife.org/)
> and get the current names.
> This have some problems, like species name that split, but help as a
> first check.
>
> So i made this function to web scrap the data.
> Its simple, it search the site, makeing a link with the keywords, then
> enter the first link of the list of results produced and get the
> accepted name and author, giveing the results as a list.
> for example:
>
>> sp.check("Rana pipiens")
> $sp.aceito
> [1] "Lithobates pipiens"
>
> $autor
> [1] "Schreber, 1782"
>
> But sometimes the function cannot acess the internet, and give a error.
>
> I'm made this function trying to copy some examples on foruns, but i
> have some doubts:
>
> 01) How do i supress the readlines() warnings?
>
> 02) How can i make the function try again when cannot acess internet,
> or just print something like "Cant acess internet", or when i try
> something like:
>
> data$check<-NA
> for(i in 1:nrow(data)) {
>   data$check[i]<-sp.check(data$especie[i])
>   }
>
> the loop dont stop.
> I made a short list, but when with 500 or more lines it usually stop
> in the middle.
>
> 03) Anyone have an example how to scrap http://www.iucnredlist.org/
> the status of species, as it does not use the keyword in the link? Is
> there any tutorial simple for someone without any background on
> programing or computer science?
>
>
> Well thanks for the attention.
>
> #função sp.check
>
> sp.check<-function(especie) {
> #split species name
> especie<-as.character(especie)
>
> gen<-strsplit(especie,"\\ ")[[1]][1]
> esp<-strsplit(especie,"\\ ")[[1]][2]
>
> #makeing first link
> link<-paste("http://www.catalogueoflife.org/col/search/all/key/",gen,"+",esp,"/match/1",sep="")
> link <- iconv(link, 'latin1', 'UTF-8')
> Encoding(link) <- 'bytes'
>
> #reading table of results
> pagina <- readLines(url(link))
>
> n.linhas<-which(pagina%in%"        <td class=\"field_header_black\">")
>
> #is there any results?
> if(length(n.linhas)>0) {
>
> pag.sp<-strsplit(pagina[n.linhas[1]+1],'\\"')[[1]][2]
>
> #second link
> link2 <- paste( "http://www.catalogueoflife.org",pag.sp,sep="")
> link2 <- iconv(link2, 'latin1', 'UTF-8')
> Encoding(link2) <- 'bytes'
> link2
>
> #read
> pagina2 <- readLines(url(link2))
>
> #get line of interest
> linha2<-grep('(accepted name)',pagina2)
> sp.final<-pagina2[linha2]
>
> #get species name
> corte1<-strsplit(sp.final,'<i>')[[1]][2]
> sp.aceito<-strsplit(corte1,'</i>')[[1]][1]
>
> #get author
> corte2<-strsplit(sp.final,'\\(')[[1]][2]
> autor<-strsplit(corte2,')')[[1]][1]
> }else {
> sp.aceito<-c("Não encontrado")
> autor<-c("Não encontrado")
> }
> return(list(sp.aceito=sp.aceito,autor=autor))
> }
>
> --
> Grato
> Augusto C. A. Ribas
>
> Site Pessoal: http://augustoribas.heliohost.org
> Lattes: http://lattes.cnpq.br/7355685961127056
>
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>