On Sat, Oct 31, 2009 at 2:15 PM, Martin Morgan wrote: > Hi Steve -- > > Steve Lianoglou writes: > > > Hi, > > > > On Oct 28, 2009, at 3:44 PM, Martin Morgan wrote: > > > >>>> > >>> It could be useful, but you'll need to get it into the shape > >>> expected by the > >>> GenomicFeatures functions. With specific regard to transcripts(), > >>> it will > >>> need a 'chrom' column with the chromosome names, 'name' column for > >>> the gene > >>> name, and 'txStart' and 'txEnd' for the start and end of the > >>> transcripts, > >>> using UCSC coordinate conventions. > >>> > >>> I'm now thinking that it would be more convenient for the user if > >>> there was > >>> a transcripts method on a RangesList object, which could provide the > >>> necessary information. This could be extracted from the RangedData, > >>> which > >>> rtracklayer can create from your GFF file, with one caveat: if the > >>> GFF file > >>> contains a mixture of features (genes, exons, etc) and relies on the > >>> hierarchical features of GFF, it will take more work to get things > >>> into the > >>> right shape. > >>> > >>> The question then is where would this new functionality belong? The > >>> future > >>> of the GenomicFeatures package was a bit uncertain, the last time I > >>> checked. > >> > >> The intention is that GenomicFeatures will mature to contain these > >> sorts of data structures and functions, so that it's easy to create or > >> retrieve transcript or exon information. > > > > > I've actually been working on a package of my own that does a lot of > > this stuff, since I've been working lately with rna-seq data. > > > > It takes an input file similar to what you get from the ucsc table > > browser with gene, tx/cds -start/-end, exon, etc. columns and whips up > > an annotation package for the genome/annotation source you are using > > (info is stored in an SQLite database). You can then query the > > database for genes of interest by name, chromosome, or bounded (start/ > > end) chromosome position for further downstream use. > > > > Usually results are returned to the user as lists of gene objects, > > which the user would have to whip into a data.frame format if that's > > what's desired. > > > > I'll post some code below to show you what I mean, and see if you all > > think it's helpful. I'm not sure that it's ready for public > > consumption, but I guess I just wanted to mention that it was my > > intention to develop this a bit further in the near future. If it's > > useful for other people, all the better. > > Just wanted to acknowledge this email. I like how straight-forward > your interface is, and the use of appropriate classes. RTree and the > custom compile is a little problematic; I'm wondering if the custom compile is really necessary? Is it not possible for rsqlite to provide that feature by default? > I wonder whether it's really > necessary for data this size (i.e., not too large) or whether there > are other ways in which the query could be optimized (representing the > data in-memory and using IRanges::findOverlaps?). > > For optimizing in-memory queries, see the IntervalTree() function. Precomputing the interval tree brings substantial time savings for multiple queries on the same database. > What ever the case, it would be good to coordinate your activities > with what we do here in Seattle and with GenomicFeatures, so we'll try > and keep the channels of communication open. > > Martin > > > Unfortunately I need to put much of this on hold for the immediate > > future, since I have to prepare for my upcoming qualifying exam ... > > > > I realize it somehow does similar things to what the GenomicFeatures > > package might have done, but I needed this exact functionality for my > > immediate analysis needs, and thought I'd have a go at it. The SQLite > > database that's generated actually stores gene, transcript, and exon > > info in RTree indices[1], so ranged queries (like all genes/ > > transcripts between some start/stop positions) are super fast. It > > requires a custom compile of SQLite, though > > > > Examples of how I use it are pasted below. > > > > -steve > > > > [1] RTree: http://www.sqlite.org/rtree.html > > > > ############################## > > # Using with refseq annotations for hg18 > > > > R> library(GenomeAnnotations) # name of my library > > R> hg18.refseq <- GenomeDB("hg18", "refseq") > > R> d1r <- Gene("DICER1", genome=hg18.refseq) > > > > ## Getting transcript info > > > > R> transcripts(d1r) > > SimpleRangesList instance: > > $NM_030621 > > IRanges instance: > > start end width > > [1] 94622320 94626752 4433 > > [2] 94627120 94627200 81 > > [3] 94627296 94627456 161 > > [4] 94629976 94630248 273 > > [5] 94631912 94632800 889 > > [6] 94635872 94636024 153 > > [7] 94639432 94640216 785 > > [8] 94641160 94641336 177 > > [9] 94641768 94641872 105 > > ... ... ... ... > > [20] 94660288 94660760 473 > > [21] 94662672 94662840 169 > > [22] 94665560 94665720 161 > > [23] 94666144 94666280 137 > > [24] 94667600 94667728 129 > > [25] 94668608 94668768 161 > > [26] 94669408 94669592 185 > > [27] 94676752 94676848 97 > > [28] 94677776 94677808 33 > > > > <1 additional element> > > > > ## UTR Info > > R> getUtr3(d1r) > > $NM_030621 > > IRanges instance: > > start end width > > [1] 94622320 94626587 4268 > > > > $NM_177438 > > IRanges instance: > > start end width > > [1] 94622320 94626587 4268 > > > > R> getUtr5(d1r) > > $NM_030621 > > IRanges instance: > > start end width > > [1] 94669548 94669592 45 > > [2] 94676752 94676848 97 > > [3] 94677776 94677808 33 > > > > $NM_177438 > > IRanges instance: > > start end width > > [1] 94669548 94669592 45 > > [2] 94693320 94693512 193 > > > > > > ## Get genes on chromosome 1 > > R> genes.1 <- getGenesOnChromosome(hg18.refseq, 1) > > R> names(genes.1)[1:5] > > [1] "WASH5P" "LOC100288778" "FAM138F" "FAM138A" > > "FAM138C" > > > > ## Get genes on chr1 between pos 1e6 and 1e7 > > R> genes.1.sub <- getGenesOnChromosome(hg18.refseq, 1, start=1e6, > > end=1e7) > > R> names(genes.1.sub)[1:5] > > [1] "C1orf159" "TTLL10" "TTLL10" "TNFRSF18" "TNFRSF18" > > > > ######################################### > > # or, hg18 aceview annos > > R> hg18.ace <- GenomeDB("hg18", "aceview") > > R> d1a <- Gene("DICER1", genome=hg18.ace) > > R> transcripts(d1a) > > SimpleRangesList instance: > > $DICER1.cApr07 > > IRanges instance: > > start end width > > [1] 94622320 94626752 4433 > > [2] 94627120 94627200 81 > > [3] 94627296 94627456 161 > > [4] 94629976 94630248 273 > > [5] 94631912 94632800 889 > > [6] 94635872 94636024 153 > > [7] 94639432 94640216 785 > > [8] 94641160 94641336 177 > > [9] 94641768 94641872 105 > > ... ... ... ... > > [19] 94653712 94653840 129 > > [20] 94660288 94660760 473 > > [21] 94662672 94662840 169 > > [22] 94665560 94665720 161 > > [23] 94666144 94666280 137 > > [24] 94667600 94667728 129 > > [25] 94668608 94668768 161 > > [26] 94669408 94669592 185 > > [27] 94693320 94693512 193 > > > > <10 additional elements> > > > > R> getUtr3(d1a) > > $DICER1.cApr07 > > IRanges instance: > > start end width > > [1] 94622320 94626587 4268 > > > > $DICER1.bApr07 > > IRanges instance: > > start end width > > [1] 94622320 94626587 4268 > > > > .... etc ... > > -- > Martin Morgan > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > [[alternative HTML version deleted]]