[BioC] How do I parse HTML table using RCurl?

Mon Mar 14 21:18:04 CET 2011

Hello,

I am trying to write a script that will enter miRNA and get the predicted target genes for that miRNA. I am trying to use various software to do this, one of them is TargetScan. The problem is that I don't know how to parse the HTML output table so that I can get the target genes only.

For example I am search for target genes for the miRNA mmu-miR-1 as follows:

http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&gid=&mir_sc=&mir_c=&mir_nc=&mirg=mmu-miR-1

This generates a table

The script is:

URL <- "http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&gid=&mir_sc=&mir_c=&mir_nc=&mirg=mmu-miR-1"
dat <- readLines(URL)

But I don't know how to parse the table to separate it into columns then I can take the column entitled "Human ortholog of target gene" which would have the target genes.

In the example above the first gene COL4A3 starts at HTML code:

<td><a href="http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene&Cmd=ShowDetailView&TermToSearch=1285" target=new>COL4A3

Is there any way to format such a table into columns then transpose the column entitled "Human ortholog of target gene" and pass that to a variable?

Many thanks,