[Bioc-sig-seq] GenomicFeatures package

Seth Falcon sfalcon at fhcrc.org
Mon Nov 2 06:47:08 CET 2009


On 10/31/09 4:14 PM, Michael Lawrence wrote:
> On Sat, Oct 31, 2009 at 2:15 PM, Martin Morgan<mtmorgan at fhcrc.org>  wrote:
>
>> Hi Steve --
>>
>> Steve Lianoglou<mailinglist.honeypot at gmail.com>  writes:
>>
>>> Hi,
>>>
>>> On Oct 28, 2009, at 3:44 PM, Martin Morgan wrote:
>>> <snip>
>>>>>>
>>>>> It could be useful, but you'll need to get it into the shape
>>>>> expected by the
>>>>> GenomicFeatures functions. With specific regard to transcripts(),
>>>>> it will
>>>>> need a 'chrom' column with the chromosome names, 'name' column for
>>>>> the gene
>>>>> name, and 'txStart' and 'txEnd' for the start and end of the
>>>>> transcripts,
>>>>> using UCSC coordinate conventions.
>>>>>
>>>>> I'm now thinking that it would be more convenient for the user if
>>>>> there was
>>>>> a transcripts method on a RangesList object, which could provide the
>>>>> necessary information. This could be extracted from the RangedData,
>>>>> which
>>>>> rtracklayer can create from your GFF file, with one caveat: if the
>>>>> GFF file
>>>>> contains a mixture of features (genes, exons, etc) and relies on the
>>>>> hierarchical features of GFF, it will take more work to get things
>>>>> into the
>>>>> right shape.
>>>>>
>>>>> The question then is where would this new functionality belong? The
>>>>> future
>>>>> of the GenomicFeatures package was a bit uncertain, the last time I
>>>>> checked.
>>>>
>>>> The intention is that GenomicFeatures will mature to contain these
>>>> sorts of data structures and functions, so that it's easy to create or
>>>> retrieve transcript or exon information.
>>> </snip>
>>
>>> I've actually been working on a package of my own that does a lot of
>>> this stuff, since I've been working lately with rna-seq data.
>>>
>>> It takes an input file similar to what you get from the ucsc table
>>> browser with gene, tx/cds -start/-end, exon, etc. columns and whips up
>>> an annotation package for the genome/annotation source you are using
>>> (info is stored in an SQLite database). You can then query the
>>> database for genes of interest by name, chromosome, or bounded (start/
>>> end) chromosome position for further downstream use.
>>>
>>> Usually results are returned to the user as lists of gene objects,
>>> which the user would have to whip into a data.frame format if that's
>>> what's desired.
>>>
>>> I'll post some code below to show you what I mean, and see if you all
>>> think it's helpful. I'm not sure that it's ready for public
>>> consumption, but I guess I just wanted to mention that it was my
>>> intention to develop this a bit further in the near future. If it's
>>> useful for other people, all the better.
>>
>> Just wanted to acknowledge this email. I like how straight-forward
>> your interface is, and the use of appropriate classes. RTree and the
>> custom compile is a little problematic;
>
>
> I'm wondering if the custom compile is really necessary? Is it not possible
> for rsqlite to provide that feature by default?

As far as I know, recent versions of RSQLite have the RTree support 
enabled.  The following, taken from the RTree documentations 
(http://www.sqlite.org/rtree.html), seems to work (RSQLite_0.7-3 DBI_0.2-4):


library("RSQLite")
sql = "
CREATE VIRTUAL TABLE demo_index USING rtree(
    id,              -- Integer primary key
    minX, maxX,      -- Minimum and maximum X coordinate
    minY, maxY       -- Minimum and maximum Y coordinate
)
"

db <- dbConnect(SQLite())
dbGetQuery(db, sql)

ins1 <- "
INSERT INTO demo_index VALUES(
     1,                   -- Primary key
     -80.7749, -80.7747,  -- Longitude range
     30.3776, 30.3778     -- Latitude range
)
"

ins2 <- "
INSERT INTO demo_index VALUES(
     2,
     -81.0, -79.6,
     35.0, 36.2
)
"

dbGetQuery(db, ins1)
dbGetQuery(db, ins2)

sqlq <- "SELECT id FROM demo_index WHERE minX>=-81.08 AND maxX<=-80.58 
AND minY>=30.00  AND maxY<=35.44"

dbGetQuery(db, sqlq)


In any case, it would be great to continue the conversation.

+ seth



More information about the Bioc-sig-sequencing mailing list