[R] Is there a way to extract some fields data from HTML pages through any R function ?
Martin Morgan
mtmorgan at fhcrc.org
Wed Jul 1 17:51:23 CEST 2009
Hi Maura --
mauede at alice.it wrote:
> I deal with a huge amount of Biology data stored in different databases.
> The databases belongig to Bioconductor organization can be accessed through Bioconductor packages.
> Unluckily some useful data is stored in databases like, for instance, miRDB, miRecords, etc ... which offer just an
> interactive HTML interface. See for instance
> http://mirdb.org/cgi-bin/search.cgi,
> http://mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search
>
> Downloading data manually from the web pages is a painstaking time-consumung and error-prone activity.
> I came across a Python script that downloads (dumps) whole web pages into a text file that is then parsed.
> This is possible because Python has a library to access web pages.
> But I have no experience with Python programming nor I like such a programming language whose syntax is indentation-sensitive.
>
> I am *hoping* that there exists some sort of web pages, HTML connection from R ... is there ??
Tools in R for this are the RCurl package and the XML package.
library(RCurl)
library(XML)
Typically this involves manual exploration of the web form, Then you
might query the web form
result <- postForm("http://mirdb.org/cgi-bin/search.cgi",
searchType="miRNA", species="Human",
searchBox="hsa-let-7a", submitButton="Go")
and parse the results into a convenient structure
html <- htmlTreeParse(result, asText=TRUE, useInternalNodes=TRUE)
you can then use XPath (http://www.w3.org/TR/xpath, especially section
2.5) to explore and extract information, e.g.,
## second table, first row
getNodeSet(html, "//table[2]/tr[1]")
## second table, makes subsequent paths shorter
tbl <- getNodeSet(html, "//table[2]")[[1]]
xget <- function(xml, path) # a helper function
unlist(xpathApply(xml, path, xmlValue))[-1]
df <- data.frame(TargetRank=as.numeric(xget(tbl, "./tr/td[2]")),
TargetScore=as.numeric(xget(tbl, "./tr/td[3]")),
miRNAName=xget(tbl, "./tr/td[4]"),
GeneSymbol=xget(tbl, "./tr/td[5]"),
GeneDescription=xget(tbl, "./tr/td[6]"))
There are many ways through this latter part, probably some much cleaner
than presented above. There are fairly extensive examples on each of the
relevant help pages, e.g., ?postForm.
Martin
> Thank you very much for any suggestion.
> Maura
>
>
> tutti i telefonini TIM!
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list