[R] R: R: Is there a way to extract some fields data from HTML pages through any R function ?
Gabor Grothendieck
ggrothendieck at gmail.com
Mon Jul 6 13:49:50 CEST 2009
If the question is to how download and read into R an Excel file at
a known location into an R data frame then try this:
library(gdata)
URL <- "http://mirecords.umn.edu/miRecords/download_data.php?v=1"
DF <- read.xls(URL)
See ?read.xls for more info.
On Mon, Jul 6, 2009 at 2:27 AM, <mauede at alice.it> wrote:
> It helps. But it is overly sophisticated.
> I have already downloaded and used the Excel file containing the validated stuff.
>
> Since there are R commands to download gzip as well as FASTA files, I wonder whether it is possible to
> automatically download the Excel file from http://mirecords.umn.edu/miRecords/download.php
> Actually the latter may not be the actual file URL because it is necessary to click on the word "here" to download the file.
>
> Thank you,
> Maura
>
> -----Messaggio originale-----
> Da: Martin Morgan [mailto:mtmorgan at fhcrc.org]
> Inviato: dom 05/07/2009 21.42
> A: mauede at alice.it
> Cc: r-help at stat.math.ethz.ch
> Oggetto: Re: R: [R] Is there a way to extract some fields data from HTML pages through any R function ?
>
> mauede at alice.it wrote:
>> I tried to apply the scheme you suggested to open the web page on
>> "http://mirecords.umn.edu/miRecords/index.php" and got the followiing:
>>
>>> result <- postForm("http://mirecords.umn.edu/miRecords/index.php",
>> + searchType="miRNA", species="Homo sapiens",
>> + searchBox="hsa-let-7a", submitButton="Search")
>
> What we are doing here is sometimes called 'screen scraping' -- figuring
> out how to extract information from a web page when the information is
> not presented in an alternative, more reliable, form. I offered this
> route as a response to your specific question, how to extract some
> fields from an HTML page, but maybe there is a better way that is
> specific to the resources and information you are trying to extract. For
> instance, I see on the web page above that there is a link 'Download
> validated targets' that leads to an Excel-style spread sheet. Maybe that
> is a better route for this resource? I don't know.
>
> In terms of the problem you are encountering above, the fields
> searchType, species, searchBox, and submitButton were all defined on the
> web page of the resource you mentioned in a previous email; here you
> must look at the 'source' (e.g., right-click 'View Page Source' in
> Firefox) of the web page you are trying to scrape, and figure out the
> appropriate fields. This requires some familiarity with html and html
> forms, so that you can recognize what you are looking for. I think on
> this particular page you are likely to run in to additional
> difficulties, because selection of a 'species' populates the 'mirna_acc'
> field with allowable values that combine the miRNA name with the number
> of validated targets that will be returned -- you almost need to know
> the answer before you can programatically extract the data.
>
>>> html <- htmlTreeParse(result, asText=TRUE, useInternalNodes=TRUE)
>> Unexpected end tag : a
>> error parsing attribute name
>> Opening and ending tag mismatch: strong and font
>> htmlParseStartTag: invalid element name
>> Unexpected end tag : a
>
> htmlTreeParse is very forgiving of mal-formed html, and it is telling
> you that it has parsed the document, even though it was formatted
> incorrectly.
>
>>> html <- htmlTreeParse(result, asText=FALSE, useInternalNodes=TRUE)
>
> There are too many parameters involved to try changing them arbitrarily;
> you must take it upon yourself to understand the functions and the
> correct way to use them.
>
> Hoping this helps,
>
> Martin
>
>> Error in htmlTreeParse(result, asText = FALSE, useInternalNodes = TRUE) :
>> File <html><!-- InstanceBegin template="/Templates/admin.dwt"
>> codeOutsideHTMLIsLocked="false" -->
>>
>> <head>
>>
>> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
>>
>> <link href="style/link.css" rel="stylesheet" type="text/css">
>>
>> <!-- InstanceParam name="nav_1" type="boolean" value="true" -->
>>
>> <title>miRecords</title>
>>
>> </head>
>>
>> <body bgcolor="#FFFFFF" leftmargin="0" topmargin="0" marginwidth="0"
>> marginheight="0">
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> <table width="80" border="0" cellspacing="0" cellpadding="0">
>>
>> <tr>
>>
>> <td colspan="3"><img src="images/title.jpg" alt="" width=900
>> height=79 border="0"></a></td>
>>
>> </tr>
>>
>> <tr>
>>
>> <td width="131" valign="bottom" bgcolor="#CCCCCC"menu""></td>
>>
>> <td width="769" align="right" valign="middle" bgcolor="#CCCCCC"><a
>> href="redirect.php?s=l" class="menu">Validated Targets </a> | <a
>> href="redirect.php?s=p" class="menu">Predicted Targets </a> | <a
>> href="download.php" class="menu">Download Validated Targets </a> | <a
>> href="submit.php" class="m
>>>
>>
>>
>>
>> I am lost about how to proceed from the above.
>> My goal is always to get the VALIDATED miRNA identified and string
>> followed by its target gene's 3'utr sequence-
>>
>> Thank you in advance,
>> Maura
>>
>> P:S. BioMart started to work fine since yesterday
>>
>> -----Messaggio originale-----
>> Da: Martin Morgan [mailto:mtmorgan at fhcrc.org]
>> Inviato: mer 01/07/2009 17.51
>> A: mauede at alice.it
>> Cc: r-help at stat.math.ethz.ch
>> Oggetto: Re: [R] Is there a way to extract some fields data from HTML
>> pages through any R function ?
>>
>> Hi Maura --
>>
>> mauede at alice.it wrote:
>>> I deal with a huge amount of Biology data stored in different databases.
>>> The databases belongig to Bioconductor organization can be accessed
>> through Bioconductor packages.
>>> Unluckily some useful data is stored in databases like, for instance,
>> miRDB, miRecords, etc ... which offer just an
>>> interactive HTML interface. See for instance
>>> http://mirdb.org/cgi-bin/search.cgi,
>>>
>> http://mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search
>> <http://mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search>
>>>
>>> Downloading data manually from the web pages is a painstaking
>> time-consumung and error-prone activity.
>>> I came across a Python script that downloads (dumps) whole web pages
>> into a text file that is then parsed.
>>> This is possible because Python has a library to access web pages.
>>> But I have no experience with Python programming nor I like such a
>> programming language whose syntax is indentation-sensitive.
>>>
>>> I am *hoping* that there exists some sort of web pages, HTML
>> connection from R ... is there ??
>>
>> Tools in R for this are the RCurl package and the XML package.
>>
>> library(RCurl)
>> library(XML)
>>
>> Typically this involves manual exploration of the web form, Then you
>> might query the web form
>>
>> result <- postForm("http://mirdb.org/cgi-bin/search.cgi",
>> searchType="miRNA", species="Human",
>> searchBox="hsa-let-7a", submitButton="Go")
>>
>> and parse the results into a convenient structure
>>
>> html <- htmlTreeParse(result, asText=TRUE, useInternalNodes=TRUE)
>>
>> you can then use XPath (http://www.w3.org/TR/xpath, especially section
>> 2.5) to explore and extract information, e.g.,
>>
>> ## second table, first row
>> getNodeSet(html, "//table[2]/tr[1]")
>> ## second table, makes subsequent paths shorter
>> tbl <- getNodeSet(html, "//table[2]")[[1]]
>> xget <- function(xml, path) # a helper function
>> unlist(xpathApply(xml, path, xmlValue))[-1]
>> df <- data.frame(TargetRank=as.numeric(xget(tbl, "./tr/td[2]")),
>> TargetScore=as.numeric(xget(tbl, "./tr/td[3]")),
>> miRNAName=xget(tbl, "./tr/td[4]"),
>> GeneSymbol=xget(tbl, "./tr/td[5]"),
>> GeneDescription=xget(tbl, "./tr/td[6]"))
>>
>> There are many ways through this latter part, probably some much cleaner
>> than presented above. There are fairly extensive examples on each of the
>> relevant help pages, e.g., ?postForm.
>>
>> Martin
>>
>>
>>> Thank you very much for any suggestion.
>>> Maura
>>>
>>>
>>> tutti i telefonini TIM!
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>>
>> Alice Messenger ;-) chatti anche con gli amici di Windows Live Messenger
>> e tutti i telefonini TIM!
>
> er
>>
>
>
>
>
>
>
> tutti i telefonini TIM!
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list