[Bioc-devel] SQLite databases

Mon Jun 11 18:48:31 CEST 2007

Hi,

Francois Pepin <fpepin at cs.mcgill.ca> writes:
> I would personally very much appreciate something of the sort and I know
> several other of my collaborators would also.
>
> My personal favorite was the idea of a by-species package that would
> behave just like the chip annotation. To use EntrezID instead of the
> probe ids and to have all the xxxGO, xxxENZYME, xxxSYMBOL, etc.

This, in particular, is on the way.  We plan to have <what>EG.db
packages for <what> = human, mouse, and rat.  These will replace the
<what>LLMappings packages, be SQLite-based, and look as much as
possible like the standard chip packages in terms of the maps provided
and interface.

> On Mon, 2007-06-11 at 08:52 -0400, Sean Davis wrote:
>> Now that RSQLite and DBI are really beginning to merge with Bioconductor
>> tools, does it make sense to think about building data sources (SQLite
>> databases) as a base for further development?  As an example, might it
>> make sense to include all of the data available at the Entrez Gene ftp
>> site as a database file?  Does a repository of such database files (and
>> possibly supporting files) make sense?  Making such files is pretty
>> straightforward, but what makes the most sense for distribution?  A full
>> package with accessors, etc?  A simple sqlite file?  Something in between?
>> 
>> I may be asking questions for which the answers are already
>> known/decided, but it would be good to know anyway.

Our plan is to have all BioC annotation data packages be SQLite-based.
There is a package in devel called AnnotationDbi and it implements an
interface for SQLite-based ann pkgs that allows them to be used just
like their environment-based cousins.  We are actively working on this
interface and making the set of SQLite-based packages complete.

In the process of creating these packges, we are creating a new
package building pipeline where we generate larger intermediate DBs
from which the individual annotation packages are generated.  At least
in principle, these are along the lines of a SQLite DB containing data
from the Entrez Gene ftp site.

Whether these intermediate DBs will be of use to others isn't clear to
me, but when our process gels a bit more, we will be happy to share
what we have.  Genrally, I think it will be useful to distribute
SQLite DB versions of public annotation data since this will support:

   - general SQL querries
   - works platform
   - can be accessed from just about any programming language

But in terms of making things easily accessible to Bioconductor users,
simply making a SQLite DB file available is not, in general, going to
be enough.  If we want users to be able to access the data without
writing SQL, then we will need careful study of the DB schema and
interface classes that provide alternate query mechanisms.

Best,

+ seth

-- 
Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center
http://bioconductor.org