On Sat, Oct 31, 2009 at 2:15 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote:

> Hi Steve --
>
> Steve Lianoglou <mailinglist.honeypot@gmail.com> writes:
>
> > Hi,
> >
> > On Oct 28, 2009, at 3:44 PM, Martin Morgan wrote:
> > <snip>
> >>>>
> >>> It could be useful, but you'll need to get it into the shape
> >>> expected by the
> >>> GenomicFeatures functions. With specific regard to transcripts(),
> >>> it will
> >>> need a 'chrom' column with the chromosome names, 'name' column for
> >>> the gene
> >>> name, and 'txStart' and 'txEnd' for the start and end of the
> >>> transcripts,
> >>> using UCSC coordinate conventions.
> >>>
> >>> I'm now thinking that it would be more convenient for the user if
> >>> there was
> >>> a transcripts method on a RangesList object, which could provide the
> >>> necessary information. This could be extracted from the RangedData,
> >>> which
> >>> rtracklayer can create from your GFF file, with one caveat: if the
> >>> GFF file
> >>> contains a mixture of features (genes, exons, etc) and relies on the
> >>> hierarchical features of GFF, it will take more work to get things
> >>> into the
> >>> right shape.
> >>>
> >>> The question then is where would this new functionality belong? The
> >>> future
> >>> of the GenomicFeatures package was a bit uncertain, the last time I
> >>> checked.
> >>
> >> The intention is that GenomicFeatures will mature to contain these
> >> sorts of data structures and functions, so that it's easy to create or
> >> retrieve transcript or exon information.
> > </snip>
>
> > I've actually been working on a package of my own that does a lot of
> > this stuff, since I've been working lately with rna-seq data.
> >
> > It takes an input file similar to what you get from the ucsc table
> > browser with gene, tx/cds -start/-end, exon, etc. columns and whips up
> > an annotation package for the genome/annotation source you are using
> > (info is stored in an SQLite database). You can then query the
> > database for genes of interest by name, chromosome, or bounded (start/
> > end) chromosome position for further downstream use.
> >
> > Usually results are returned to the user as lists of gene objects,
> > which the user would have to whip into a data.frame format if that's
> > what's desired.
> >
> > I'll post some code below to show you what I mean, and see if you all
> > think it's helpful. I'm not sure that it's ready for public
> > consumption, but I guess I just wanted to mention that it was my
> > intention to develop this a bit further in the near future. If it's
> > useful for other people, all the better.
>
> Just wanted to acknowledge this email. I like how straight-forward
> your interface is, and the use of appropriate classes. RTree and the
> custom compile is a little problematic;


I'm wondering if the custom compile is really necessary? Is it not possible
for rsqlite to provide that feature by default?


> I wonder whether it's really
> necessary for data this size (i.e., not too large) or whether there
> are other ways in which the query could be optimized (representing the
> data in-memory and using IRanges::findOverlaps?).
>
>
For optimizing in-memory queries, see the IntervalTree() function.
Precomputing the interval tree brings substantial time savings for multiple
queries on the same database.


> What ever the case, it would be good to coordinate your activities
> with what we do here in Seattle and with GenomicFeatures, so we'll try
> and keep the channels of communication open.
>
> Martin
>
> > Unfortunately I need to put much of this on hold for the immediate
> > future, since I have to prepare for my upcoming qualifying exam ...
> >
> > I realize it somehow does similar things to what the GenomicFeatures
> > package might have done, but I needed this exact functionality for my
> > immediate analysis needs, and thought I'd have a go at it. The SQLite
> > database that's generated actually stores gene, transcript, and exon
> > info in RTree indices[1], so ranged queries (like all genes/
> > transcripts between some start/stop positions) are super fast. It
> > requires a custom compile of SQLite, though
> >
> > Examples of how I use it are pasted below.
> >
> > -steve
> >
> > [1] RTree: http://www.sqlite.org/rtree.html
> >
> > ##############################
> > # Using with refseq annotations for hg18
> >
> > R> library(GenomeAnnotations) # name of my library
> > R> hg18.refseq <- GenomeDB("hg18", "refseq")
> > R> d1r <- Gene("DICER1", genome=hg18.refseq)
> >
> > ## Getting transcript info
> >
> > R> transcripts(d1r)
> > SimpleRangesList instance:
> > $NM_030621
> > IRanges instance:
> >          start      end width
> > [1]  94622320 94626752  4433
> > [2]  94627120 94627200    81
> > [3]  94627296 94627456   161
> > [4]  94629976 94630248   273
> > [5]  94631912 94632800   889
> > [6]  94635872 94636024   153
> > [7]  94639432 94640216   785
> > [8]  94641160 94641336   177
> > [9]  94641768 94641872   105
> > ...       ...      ...   ...
> > [20] 94660288 94660760   473
> > [21] 94662672 94662840   169
> > [22] 94665560 94665720   161
> > [23] 94666144 94666280   137
> > [24] 94667600 94667728   129
> > [25] 94668608 94668768   161
> > [26] 94669408 94669592   185
> > [27] 94676752 94676848    97
> > [28] 94677776 94677808    33
> >
> > <1 additional element>
> >
> > ## UTR Info
> > R> getUtr3(d1r)
> > $NM_030621
> > IRanges instance:
> >         start      end width
> > [1] 94622320 94626587  4268
> >
> > $NM_177438
> > IRanges instance:
> >         start      end width
> > [1] 94622320 94626587  4268
> >
> > R> getUtr5(d1r)
> > $NM_030621
> > IRanges instance:
> >         start      end width
> > [1] 94669548 94669592    45
> > [2] 94676752 94676848    97
> > [3] 94677776 94677808    33
> >
> > $NM_177438
> > IRanges instance:
> >         start      end width
> > [1] 94669548 94669592    45
> > [2] 94693320 94693512   193
> >
> >
> > ## Get genes on chromosome 1
> > R> genes.1 <- getGenesOnChromosome(hg18.refseq, 1)
> > R> names(genes.1)[1:5]
> > [1] "WASH5P"       "LOC100288778" "FAM138F"      "FAM138A"
> > "FAM138C"
> >
> > ## Get genes on chr1 between pos 1e6 and 1e7
> > R> genes.1.sub <- getGenesOnChromosome(hg18.refseq, 1, start=1e6,
> > end=1e7)
> > R> names(genes.1.sub)[1:5]
> > [1] "C1orf159" "TTLL10"   "TTLL10"   "TNFRSF18" "TNFRSF18"
> >
> > #########################################
> > # or, hg18 aceview annos
> > R> hg18.ace <- GenomeDB("hg18", "aceview")
> > R> d1a <- Gene("DICER1", genome=hg18.ace)
> > R> transcripts(d1a)
> > SimpleRangesList instance:
> > $DICER1.cApr07
> > IRanges instance:
> >          start      end width
> > [1]  94622320 94626752  4433
> > [2]  94627120 94627200    81
> > [3]  94627296 94627456   161
> > [4]  94629976 94630248   273
> > [5]  94631912 94632800   889
> > [6]  94635872 94636024   153
> > [7]  94639432 94640216   785
> > [8]  94641160 94641336   177
> > [9]  94641768 94641872   105
> > ...       ...      ...   ...
> > [19] 94653712 94653840   129
> > [20] 94660288 94660760   473
> > [21] 94662672 94662840   169
> > [22] 94665560 94665720   161
> > [23] 94666144 94666280   137
> > [24] 94667600 94667728   129
> > [25] 94668608 94668768   161
> > [26] 94669408 94669592   185
> > [27] 94693320 94693512   193
> >
> > <10 additional elements>
> >
> > R> getUtr3(d1a)
> > $DICER1.cApr07
> > IRanges instance:
> >         start      end width
> > [1] 94622320 94626587  4268
> >
> > $DICER1.bApr07
> > IRanges instance:
> >         start      end width
> > [1] 94622320 94626587  4268
> >
> > .... etc ...
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>

	[[alternative HTML version deleted]]

