[BioC] How do I parse HTML table using RCurl?

Mon Mar 14 22:58:57 CET 2011

On Mon, Mar 14, 2011 at 1:18 PM, Ruppert Valentino <ruppert7 at hotmail.com> wrote:
>
>
> Hello,
>
> I am trying to write a script that will enter miRNA and get the predicted target genes for that miRNA. I am trying to use various software to do this, one of them is TargetScan. The problem is that I don't know how to parse the HTML output table so that I can get the target genes only.
>
> For example I am search for target genes for the miRNA mmu-miR-1 as follows:
>
> http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&gid=&mir_sc=&mir_c=&mir_nc=&mirg=mmu-miR-1
>
> This generates a table
>
>
>
> The script is:
>
> URL <- "http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&gid=&mir_sc=&mir_c=&mir_nc=&mirg=mmu-miR-1"
> dat <- readLines(URL)
>
>
> But I don't know how to parse the table to separate it into columns then I can take the column entitled "Human ortholog of target gene" which would have the target genes.
>
>
> In the example above the first gene COL4A3 starts at HTML code:
>
> <td><a href="http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene&Cmd=ShowDetailView&TermToSearch=1285" target=new>COL4A3
>
>
>
> Is there any way to format such a table into columns then transpose the column entitled "Human ortholog of target gene" and pass that to a variable?
>
>
> Many thanks,
>
>

Hi,

In general, screen scraping is not the best solution--if the page
design changes, your code will break.

(If you just need to do this once, you could just copy and paste the
table into Excel.)

When faced with this type of situation, you might try and see if the
web site in question has a programmatic interface, or web service.
Looking at it briefly, it doesn't appear that they do, however, they
do make all of their data available in CSV format along with some Perl
scripts to do basic analysis:

http://www.targetscan.org/cgi-bin/targetscan/data_download.cgi?db=vert_50

This may get you closer to what you want to do. Consider downloading
the data in CSV format and using R (or the Perl scripts in combination
with R) to recreate the table you got with your original query...from
there it's a simple matter (in R) to subset the column(s) you're
interested in.

If that doesn't work out, Sean's suggestion to use the XML package is
a good one.

Dan

>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>