[Bioc-devel] Incrimental writing to HDF5 / DelayedMatrix
Martin Morgan
martin.morgan at roswellpark.org
Thu Dec 21 13:16:41 CET 2017
On 12/21/2017 06:22 AM, Francesco Napolitano wrote:
> Hi,
>
> I need to deal with very large matrices and I was thinking of using
> HDF5-based data models. However, from the documentation and examples
> that I have been looking at, I'm not quite sure how to do this.
>
> My use case is as follows.
> I want to build a very large matrix one column at a time, and I need
> to write columns directly to disk since I would otherwise run out of
> memory. I need a format that, afterwards, will allow me to extract
> subsets of rows or columns and rank them. The subsets will be small
> enough to be loaded in memory. Can I achieve this with current HDF5
> support in R?
this is basically straight-forward in rhdf5. The idea is to create a
dataset of the size to contain your total data
library(rhdf5)
fl <- tempfile()
h5createFile(fl)
nrow <- 10000
ncol <- 100
h5createDataset(fl, "big", c(nrow, ncol), showWarnings = FALSE)
then to fill it in chunks by specifying which start row / column you'd
like to write to and the 'count' of the number data points in each
direction you'd like to write to
chunk_ncol <- ncol / 10
j <- 1 # which column to start writing?
while (j < ncol) {
m <- matrix(seq(1, length.out = nrow * chunk_ncol), nrow)
h5write(m, fl, "big", start = c(1, j), count = c(nrow, chunk_ncol))
j <- j + chunk_ncol
}
You can read arbitrary 'slabs'
h5read(fl, "big", start = c(1, 1), count = c(5, 5))
h5read(fl, "big", start = c(1, 9), count = c(5, 2))
Probably you don't want to write 1 column at a time, but as many columns
as comfortably fit into memory. This minimizes the number of R function
calls needed to write / read the data.
The HDF5Array package provides an easy abstraction for reading (probably
writing is possible too, but it might be easier to understand the
building blocks first).
> library(HDF5Array)
> hdf <- HDF5Array(fl, "big")
> hdf
HDF5Matrix object of 10000 x 100 doubles:
[,1] [,2] [,3] ... [,99] [,100]
[1,] 1 10001 20001 . 80001 90001
[2,] 2 10002 20002 . 80002 90002
[3,] 3 10003 20003 . 80003 90003
[4,] 4 10004 20004 . 80004 90004
[5,] 5 10005 20005 . 80005 90005
... . . . . . .
[9996,] 9996 19996 29996 . 89996 99996
[9997,] 9997 19997 29997 . 89997 99997
[9998,] 9998 19998 29998 . 89998 99998
[9999,] 9999 19999 29999 . 89999 99999
[10000,] 10000 20000 30000 . 90000 100000
> hdf[1:5, 1:5]
DelayedMatrix object of 5 x 5 doubles:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 10001 20001 30001 40001
[2,] 2 10002 20002 30002 40002
[3,] 3 10003 20003 30003 40003
[4,] 4 10004 20004 30004 40004
[5,] 5 10005 20005 30005 40005
> as.matrix(hdf[1:5, 1:5])
[,1] [,2] [,3] [,4] [,5]
[1,] 1 10001 20001 30001 40001
[2,] 2 10002 20002 30002 40002
[3,] 3 10003 20003 30003 40003
[4,] 4 10004 20004 30004 40004
[5,] 5 10005 20005 30005 40005
> rowSums(hdf[1:5, 1:5])
[1] 100005 100010 100015 100020 100025
Martin
>
> Any help greatly appreciated.
>
> than you,
> Francesco
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
This email message may contain legally privileged and/or...{{dropped:2}}
More information about the Bioc-devel
mailing list