On 05/12/2011 09:42 PM, Michael Lawrence wrote:
>
>
> On Thu, May 12, 2011 at 10:35 AM, Marc Carlson <mcarlson@fhcrc.org 
> <mailto:mcarlson@fhcrc.org>> wrote:
>
>     Hi Michael,
>
>     That is an interesting idea.  I like the idea of having more data
>     be available via FeatureDb, and I especially like the idea of
>     having useful transformations of the data it provides.  But I am a
>     little confused about one part of what you are suggesting.  Is
>     there a reason why we would want to add a bunch of stuff to
>     basically re-implement what the database does instead of just
>     writing some simple methods to allow the import of these other
>     kinds to files?
>
>     I can think of advantages to keeping the data container type
>     consistent (providing that it is not proving burdensome), since
>     SQL allows joins to be made across databases thus allowing a
>     collection of data that has all been stored in this way to be
>     easily linked together as needed.  But what is the advantage of
>     making a bunch of classes and methods that will allow us to
>     pretend that our bam and vcf files are actually databases?
>
>
> I don't want to pretend that they are first-class databases. It would 
> just be nice to have a common interface around these various data 
> sources. The simplest interface would allow range-based (GRanges) 
> queries and return a GRanges as the result. Each data source will have 
> its own specific parameters, but those could be fields of the 
> particular Db class and held constant across multiple range queries.
>
> The driving use-case for this is visualization. The user is looking at 
> a particular region of a track. That track has an underlying data 
> source, often on disk. We just need to query for the features in that 
> region.

Hi Michael,

In that case, we will have to write something that can match the 
features in format "x" to a GRanges object regardless of 
implementation.  So wouldn't it be simpler to just write a single set of 
import and display functions?  IOW, why not just write a simple set of 
import methods so that we can import data from gff, or bam or vcf etc 
such that they all end up in the same basic container?  I understand 
that we could achieve the same end result by also using a more complex 
class structure, but what is the upside to adding that additional 
complexity?  I think the upside to using a single container type on the 
back end is that these containers can all be accessed under the hood in 
a similar manner, joined together as needed and even indexed when 
appropriate.  And I worry that if we take the more complicated approach 
and then someone wants to do operations like this they may have to worry 
about "reinventing the wheel" 1st before they can proceed.  With a 
single container approach, it seems that we get the benefits of previous 
generations of database features plus whatever other operations we also 
enable with our GRanges infrastructure on top.  But with the more 
complex design we are forced to only use whatever things we have 
re-implemented in R.  If for some reason there are unforeseen 
bottlenecks, I worry that we might have less flexibility with the design 
you are proposing than if we just did something simpler.

This "two layers of access" approach has already come in handy on 
several occasions in the GenomicFeatures package and it was convenient 
on those occasions to be able to choose the method (direct database 
access vs access as a GRanges object) that gave better performance.  It 
was nice to have a couple of alternative access options to choose from.  
And I will grant you that so far, in most of those cases the operations 
have been very simpls and so the quickest option was to actually to just 
use the very fast GRanges operations.  ;)  But as workflows get more 
complicated, it is easy to imagine that this might not always be the 
case.  And indeed the future will probably eventually involve bringing 
together many different kinds of these data resources together in 
compound ways and so that starts to look more and more like a job for a 
traditional relational database.  It might be nice to be able to 
leverage that when the time comes.

Curious to hear what you think about this.


   Marc


>
>     Also what would the purpose of a SequenceDb object be?  The name
>     is generic enough that I am unable to guess what you have in mind.
>
>
> A data source for sequences. Could be implemented with a BSgenome 
> object, or FA indexed file, etc.
>
>
>      Marc
>
>
>
>
>     On 05/12/2011 06:08 AM, Michael Lawrence wrote:
>
>         Hi guys,
>
>         I was just looking at the FeatureDb class in GenomicFeatures.
>         I'm wondering
>         if we couldn't abstract that from its SQLite implementation.
>         There are many
>         other sources of features, e.g., files like BAM, VCF and even
>         BED. If these
>         are indexed properly, we could make fast queries against them.
>         So what we
>         really need is a class, named something like FeatureDb, that
>         returns, for a
>         given 'which' (as a bare minimum), a GRanges.
>
>         I could also imagine having proxy FeatureDb objects that
>         transform the data
>         on the way. Like a FeatureDb that will return the coverage,
>         using another
>         FeatureDb as a source. Caching could be implemented as part of
>         the base
>         class. I'm also wondering whether these should be reference
>         classes. Then if
>         some "parent" FeatureDb is modified, the down-stream objects
>         can be informed
>         of the change.
>
>         And a SequenceDb would be nice, too.
>
>         I'll write up a prototype in the MutableRanges package (in the
>         bioc repo),
>         but I'll call it RangeDb to avoid conflicts for now.
>
>         Michael
>
>                [[alternative HTML version deleted]]
>
>         _______________________________________________
>         Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
>         mailing list
>         https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
>     _______________________________________________
>     Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing
>     list
>     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>


	[[alternative HTML version deleted]]

