[R-sig-eco] Webscraping the Plants Database

Chris Stubben stubben at lanl.gov
Thu Jan 3 18:36:08 CET 2013


Tim Seipel wrote
> Thanks Sarah,
> Didn't realize you can go through advanced search webpage to get all the 
> fields!

If you use the advanced search page, check the box in the bottom right
corner to "Display search URL for future use".  Depending on the fields you
select, that should give you something like...

http://plants.usda.gov/java/AdvancedSearchServlet?sciname=Astragalus%20miser&dsp_symbol=on&dsp_statefips=on&dsp_family=on&dsp_dur=on&dsp_grwhabt=on&dsp_nativestatuscode=on&dsp_fed_te_status=on&Synonyms=all&viewby=sciname

Just take that URL and paste the species at the end and then use the
readHTMLTable function in the XML package.

url <-
"http://plants.usda.gov/java/AdvancedSearchServlet?dsp_symbol=on&dsp_statefips=on&dsp_family=on&dsp_dur=on&dsp_grwhabt=on&dsp_nativestatuscode=on&dsp_fed_te_status=on&Synonyms=all&viewby=sciname&sciname="

species <- "Astragalus miser"
url2 <- paste(url, species, sep="")
x<-readHTMLTable(url2)

These pages have lots of formatting tables, so the results are quite
messy... Sometimes it helps to count the number of rows in each table, 

sapply(x, nrow)
NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL 
  65    1   56   46   43    9    1    1    1    1 

But once you find the table you need (#8), then you can just select that
directly.

x[[8]]

or use the which option to get table 8.  I also removed the newlines from
column names...
x2<-readHTMLTable(url2, which=8, stringsAsFactors=FALSE)
names(x2) <- gsub("(.*?)\r\n.*", "\\1", names(x2) )
x2

t(x2)
                   [,1]                                                    
Symbol             "ASMI9"                                                 
Scientific Name    "Astragalus miser"                                      
State and Province "USA (AZ, CO, ID, MT, NV, SD, UT, WA, WY), CAN (AB, BC)"
Family             "Fabaceae"                                              
Duration           "Perennial"                                             
Growth Habit       "Forb/herb"                                             
Native Status      "L48 (N), CAN (N)"                                      
Federal T/E Status ""     


You should be able to wrap that in a loop and go through your species

species<- "Festuca idahoensis"
url2 <- paste(url, species, sep="")
x2<-readHTMLTable(url2, which=8, stringsAsFactors=FALSE)

names(x2) <- gsub("(.*?)\r\n.*", "\\1", names(x2) )

 t(x2)
                   [,1]                                                                    
Symbol             "FEID"                                                                  
Scientific Name    "Festuca idahoensis"                                                    
State and Province "USA (AZ, CA, CO, ID, MT, NM, NV, OR, SD, UT, WA, WY),
CAN (AB, BC, SK)"
Family             "Poaceae"                                                               
Duration           "Perennial"                                                             
Growth Habit       "Graminoid"                                                             
Native Status      "L48 (N), CAN (N)"                                                      
Federal T/E Status ""       




Chris Stubben












--
View this message in context: http://r-sig-ecology.471788.n2.nabble.com/Webscraping-the-Plants-Database-tp7577775p7577781.html
Sent from the r-sig-ecology mailing list archive at Nabble.com.



More information about the R-sig-ecology mailing list