[Bioc-devel] range-directed metadata management

Thu Jul 10 23:16:55 CEST 2014

Hi,

On Thu, Jul 10, 2014 at 1:52 PM, Vincent Carey
<stvjc at channing.harvard.edu> wrote:
> a new, more inclusive GWAS catalog is available (GRASP, from Andrew Johnson
> at NHLBI), with 6 million records and voluminous metadata (though it seems
> sparse and perhaps can be trimmed/reshaped)
>
> i made a GRanges and it takes 3 minutes to load.  even after stripping all
> the
> metadata, a GRanges with 6 million records takes 20 seconds to load.
>  that's probably acceptable, but a managed chromosome-specific distribution
> might
> be closer to interactive availability.
>
> the metadata probably would be best kept in SQLite.  it occurred to me to
> consider an arrangement in which we have the GRanges managing the ranges
> and a key to the database.  range operations can engender queries to
> retrieve metadata, metadata queries in the db can generate indices to
> retrieve matching ranges.
>
> is anyone doing something along these lines?

You might consider just stuffing it all in the database.

SQLite supports RTrees, which is a spatial index, so you could in
theory get the fast overlap stuff baked in w/o a need to have a
parallel GRanges object to index into the database:
http://www.sqlite.org/rtree.html

Before the reboot of the GenomicFeatures package (we're talking around
2008/2009?) I was doing something like that for genomic annotations.

The way that Hadley has abstracted db access in dplyr to make a
database look like a data.frame and respond to all the "data
manipulation verbs" in the same way gives me inspiration to believe
that we can do the same and make the database look essentially like a
GRanges / VRanges object and get cooking that way.

Hopefully this answer was at least minimally aligned in the direction
of what you were asking ;-)

-steve

-- 
Steve Lianoglou
Computational Biologist
Genentech