[BioC] Querying a HDF5 database from R

Fri Jan 17 19:32:12 CET 2014

Dear James!

On 08.01.2014, at 19:19, James at embl-heidelberg.de wrote:
> I have a database (22 GB) in SQLite that I query from R for numerical analysis. I'm considering converting the database to HDF5 for faster read times (because reading the population is slow). I have two questions about the rhdf5 package that I haven't been able to figure out from my own experimenting.
> 
> (i) Suppose that I save an R dataframe to a HDF file. Is it possible to read subsets of the dataframe based on variable names and variable values? Often, I don't won't to read the full dataframe into memory (~ 100 million observations and ~ 30 variables).
HDF5 does not support the R-specific structure 'data.frame'. In the standard settings,
data.frame's are stored as an HDF5 compound datatype. Reading compound data is
very inefficient. You should consider storing your data as arrays. This will provide you
with the most efficient reading and writing methods. And you are able to subset these
arrays. Subsetting is currently not supported for data.frames/Compound data types.

Subsetting with variable names is not part of rhdf5/HDF5. You can only subset using
Integers. You have to organize the mapping from names to indices yourself.

> (ii) I frequently use indexes in my SQLite database to quickly join related tables. Does rhdf5 have a similar feature? If not, will converting to a HDF5 database create substantial bottlenecks if I rely on these joins frequently?
HDF5 is not a database management system as is SQLite. If you need a 'join', you
should consider SQLite or other, more efficient database management systems.

> Thanks so much for your help.
> 
> -- output of sessionInfo(): 
> 
> N/A
> 
> --
> Sent via the guest posting facility at bioconductor.org.