[Bioc-sig-seq] GenomicFeatures package

Martin Morgan mtmorgan at fhcrc.org
Sat Oct 31 22:15:40 CET 2009


Hi Steve --

Steve Lianoglou <mailinglist.honeypot at gmail.com> writes:

> Hi,
>
> On Oct 28, 2009, at 3:44 PM, Martin Morgan wrote:
> <snip>
>>>>
>>> It could be useful, but you'll need to get it into the shape
>>> expected by the
>>> GenomicFeatures functions. With specific regard to transcripts(),
>>> it will
>>> need a 'chrom' column with the chromosome names, 'name' column for
>>> the gene
>>> name, and 'txStart' and 'txEnd' for the start and end of the
>>> transcripts,
>>> using UCSC coordinate conventions.
>>>
>>> I'm now thinking that it would be more convenient for the user if
>>> there was
>>> a transcripts method on a RangesList object, which could provide the
>>> necessary information. This could be extracted from the RangedData,
>>> which
>>> rtracklayer can create from your GFF file, with one caveat: if the
>>> GFF file
>>> contains a mixture of features (genes, exons, etc) and relies on the
>>> hierarchical features of GFF, it will take more work to get things
>>> into the
>>> right shape.
>>>
>>> The question then is where would this new functionality belong? The
>>> future
>>> of the GenomicFeatures package was a bit uncertain, the last time I
>>> checked.
>>
>> The intention is that GenomicFeatures will mature to contain these
>> sorts of data structures and functions, so that it's easy to create or
>> retrieve transcript or exon information.
> </snip>

> I've actually been working on a package of my own that does a lot of
> this stuff, since I've been working lately with rna-seq data.
>
> It takes an input file similar to what you get from the ucsc table
> browser with gene, tx/cds -start/-end, exon, etc. columns and whips up
> an annotation package for the genome/annotation source you are using
> (info is stored in an SQLite database). You can then query the
> database for genes of interest by name, chromosome, or bounded (start/
> end) chromosome position for further downstream use.
>
> Usually results are returned to the user as lists of gene objects,
> which the user would have to whip into a data.frame format if that's
> what's desired.
>
> I'll post some code below to show you what I mean, and see if you all
> think it's helpful. I'm not sure that it's ready for public
> consumption, but I guess I just wanted to mention that it was my
> intention to develop this a bit further in the near future. If it's
> useful for other people, all the better.

Just wanted to acknowledge this email. I like how straight-forward
your interface is, and the use of appropriate classes. RTree and the
custom compile is a little problematic; I wonder whether it's really
necessary for data this size (i.e., not too large) or whether there
are other ways in which the query could be optimized (representing the
data in-memory and using IRanges::findOverlaps?).

What ever the case, it would be good to coordinate your activities
with what we do here in Seattle and with GenomicFeatures, so we'll try
and keep the channels of communication open.

Martin

> Unfortunately I need to put much of this on hold for the immediate
> future, since I have to prepare for my upcoming qualifying exam ...
>
> I realize it somehow does similar things to what the GenomicFeatures
> package might have done, but I needed this exact functionality for my
> immediate analysis needs, and thought I'd have a go at it. The SQLite
> database that's generated actually stores gene, transcript, and exon
> info in RTree indices[1], so ranged queries (like all genes/
> transcripts between some start/stop positions) are super fast. It
> requires a custom compile of SQLite, though
>
> Examples of how I use it are pasted below.
>
> -steve
>
> [1] RTree: http://www.sqlite.org/rtree.html
>
> ##############################
> # Using with refseq annotations for hg18
>
> R> library(GenomeAnnotations) # name of my library
> R> hg18.refseq <- GenomeDB("hg18", "refseq")
> R> d1r <- Gene("DICER1", genome=hg18.refseq)
>
> ## Getting transcript info
>
> R> transcripts(d1r)
> SimpleRangesList instance:
> $NM_030621
> IRanges instance:
>          start      end width
> [1]  94622320 94626752  4433
> [2]  94627120 94627200    81
> [3]  94627296 94627456   161
> [4]  94629976 94630248   273
> [5]  94631912 94632800   889
> [6]  94635872 94636024   153
> [7]  94639432 94640216   785
> [8]  94641160 94641336   177
> [9]  94641768 94641872   105
> ...       ...      ...   ...
> [20] 94660288 94660760   473
> [21] 94662672 94662840   169
> [22] 94665560 94665720   161
> [23] 94666144 94666280   137
> [24] 94667600 94667728   129
> [25] 94668608 94668768   161
> [26] 94669408 94669592   185
> [27] 94676752 94676848    97
> [28] 94677776 94677808    33
>
> <1 additional element>
>
> ## UTR Info
> R> getUtr3(d1r)
> $NM_030621
> IRanges instance:
>         start      end width
> [1] 94622320 94626587  4268
>
> $NM_177438
> IRanges instance:
>         start      end width
> [1] 94622320 94626587  4268
>
> R> getUtr5(d1r)
> $NM_030621
> IRanges instance:
>         start      end width
> [1] 94669548 94669592    45
> [2] 94676752 94676848    97
> [3] 94677776 94677808    33
>
> $NM_177438
> IRanges instance:
>         start      end width
> [1] 94669548 94669592    45
> [2] 94693320 94693512   193
>
>
> ## Get genes on chromosome 1
> R> genes.1 <- getGenesOnChromosome(hg18.refseq, 1)
> R> names(genes.1)[1:5]
> [1] "WASH5P"       "LOC100288778" "FAM138F"      "FAM138A"
> "FAM138C"
>
> ## Get genes on chr1 between pos 1e6 and 1e7
> R> genes.1.sub <- getGenesOnChromosome(hg18.refseq, 1, start=1e6,
> end=1e7)
> R> names(genes.1.sub)[1:5]
> [1] "C1orf159" "TTLL10"   "TTLL10"   "TNFRSF18" "TNFRSF18"
>
> #########################################
> # or, hg18 aceview annos
> R> hg18.ace <- GenomeDB("hg18", "aceview")
> R> d1a <- Gene("DICER1", genome=hg18.ace)
> R> transcripts(d1a)
> SimpleRangesList instance:
> $DICER1.cApr07
> IRanges instance:
>          start      end width
> [1]  94622320 94626752  4433
> [2]  94627120 94627200    81
> [3]  94627296 94627456   161
> [4]  94629976 94630248   273
> [5]  94631912 94632800   889
> [6]  94635872 94636024   153
> [7]  94639432 94640216   785
> [8]  94641160 94641336   177
> [9]  94641768 94641872   105
> ...       ...      ...   ...
> [19] 94653712 94653840   129
> [20] 94660288 94660760   473
> [21] 94662672 94662840   169
> [22] 94665560 94665720   161
> [23] 94666144 94666280   137
> [24] 94667600 94667728   129
> [25] 94668608 94668768   161
> [26] 94669408 94669592   185
> [27] 94693320 94693512   193
>
> <10 additional elements>
>
> R> getUtr3(d1a)
> $DICER1.cApr07
> IRanges instance:
>         start      end width
> [1] 94622320 94626587  4268
>
> $DICER1.bApr07
> IRanges instance:
>         start      end width
> [1] 94622320 94626587  4268
>
> .... etc ...

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-sig-sequencing mailing list