[Bioc-devel] Incrimental writing to HDF5 / DelayedMatrix

Martin Morgan martin.morgan at roswellpark.org
Thu Dec 21 13:16:41 CET 2017


On 12/21/2017 06:22 AM, Francesco Napolitano wrote:
> Hi,
> 
> I need to deal with very large matrices and I was thinking of using
> HDF5-based data models. However, from the documentation and examples
> that I have been looking at, I'm not quite sure how to do this.
> 
> My use case is as follows.
> I want to build a very large matrix one column at a time, and I need
> to write columns directly to disk since I would otherwise run out of
> memory. I need a format that, afterwards, will allow me to extract
> subsets of rows or columns and rank them. The subsets will be small
> enough to be loaded in memory. Can I achieve this with current HDF5
> support in R?

this is basically straight-forward in rhdf5. The idea is to create a 
dataset of the size to contain your total data

   library(rhdf5)
   fl <- tempfile()
   h5createFile(fl)

   nrow <- 10000
   ncol <- 100
   h5createDataset(fl, "big", c(nrow, ncol), showWarnings = FALSE)

then to fill it in chunks by specifying which start row / column you'd 
like to write to and the 'count' of the number data points in each 
direction you'd like to write to

   chunk_ncol <- ncol / 10
   j <- 1                           # which column to start writing?

   while (j < ncol) {
     m <- matrix(seq(1, length.out = nrow * chunk_ncol), nrow)
     h5write(m, fl, "big", start = c(1, j), count = c(nrow, chunk_ncol))
     j <- j + chunk_ncol
   }

You can read arbitrary  'slabs'

   h5read(fl, "big", start = c(1, 1), count = c(5, 5))
   h5read(fl, "big", start = c(1, 9), count = c(5, 2))

Probably you don't want to write 1 column at a time, but as many columns 
as comfortably fit into memory. This minimizes the number of R function 
calls needed to write / read the data.

The HDF5Array package provides an easy abstraction for reading (probably 
writing is possible too, but it might be easier to understand the 
building blocks first).

 > library(HDF5Array)
 > hdf <- HDF5Array(fl, "big")
 > hdf
HDF5Matrix object of 10000 x 100 doubles:
            [,1]   [,2]   [,3] ...  [,99] [,100]
     [1,]      1  10001  20001   .  80001  90001
     [2,]      2  10002  20002   .  80002  90002
     [3,]      3  10003  20003   .  80003  90003
     [4,]      4  10004  20004   .  80004  90004
     [5,]      5  10005  20005   .  80005  90005
      ...      .      .      .   .      .      .
  [9996,]   9996  19996  29996   .  89996  99996
  [9997,]   9997  19997  29997   .  89997  99997
  [9998,]   9998  19998  29998   .  89998  99998
  [9999,]   9999  19999  29999   .  89999  99999
[10000,]  10000  20000  30000   .  90000 100000
 > hdf[1:5, 1:5]
DelayedMatrix object of 5 x 5 doubles:
       [,1]  [,2]  [,3]  [,4]  [,5]
[1,]     1 10001 20001 30001 40001
[2,]     2 10002 20002 30002 40002
[3,]     3 10003 20003 30003 40003
[4,]     4 10004 20004 30004 40004
[5,]     5 10005 20005 30005 40005
 > as.matrix(hdf[1:5, 1:5])
      [,1]  [,2]  [,3]  [,4]  [,5]
[1,]    1 10001 20001 30001 40001
[2,]    2 10002 20002 30002 40002
[3,]    3 10003 20003 30003 40003
[4,]    4 10004 20004 30004 40004
[5,]    5 10005 20005 30005 40005
 > rowSums(hdf[1:5, 1:5])
[1] 100005 100010 100015 100020 100025

Martin

> 
> Any help greatly appreciated.
> 
> than you,
> Francesco
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 


This email message may contain legally privileged and/or...{{dropped:2}}



More information about the Bioc-devel mailing list