[Bioc-devel] Incrimental writing to HDF5 / DelayedMatrix

Francesco Napolitano franapoli at gmail.com
Thu Dec 21 14:11:16 CET 2017


That seems to solve my problem, I will try this way, thak you very much.
Francesco

On Thu, Dec 21, 2017 at 1:16 PM, Martin Morgan
<martin.morgan at roswellpark.org> wrote:
> On 12/21/2017 06:22 AM, Francesco Napolitano wrote:
>>
>> Hi,
>>
>> I need to deal with very large matrices and I was thinking of using
>> HDF5-based data models. However, from the documentation and examples
>> that I have been looking at, I'm not quite sure how to do this.
>>
>> My use case is as follows.
>> I want to build a very large matrix one column at a time, and I need
>> to write columns directly to disk since I would otherwise run out of
>> memory. I need a format that, afterwards, will allow me to extract
>> subsets of rows or columns and rank them. The subsets will be small
>> enough to be loaded in memory. Can I achieve this with current HDF5
>> support in R?
>
>
> this is basically straight-forward in rhdf5. The idea is to create a dataset
> of the size to contain your total data
>
>   library(rhdf5)
>   fl <- tempfile()
>   h5createFile(fl)
>
>   nrow <- 10000
>   ncol <- 100
>   h5createDataset(fl, "big", c(nrow, ncol), showWarnings = FALSE)
>
> then to fill it in chunks by specifying which start row / column you'd like
> to write to and the 'count' of the number data points in each direction
> you'd like to write to
>
>   chunk_ncol <- ncol / 10
>   j <- 1                           # which column to start writing?
>
>   while (j < ncol) {
>     m <- matrix(seq(1, length.out = nrow * chunk_ncol), nrow)
>     h5write(m, fl, "big", start = c(1, j), count = c(nrow, chunk_ncol))
>     j <- j + chunk_ncol
>   }
>
> You can read arbitrary  'slabs'
>
>   h5read(fl, "big", start = c(1, 1), count = c(5, 5))
>   h5read(fl, "big", start = c(1, 9), count = c(5, 2))
>
> Probably you don't want to write 1 column at a time, but as many columns as
> comfortably fit into memory. This minimizes the number of R function calls
> needed to write / read the data.
>
> The HDF5Array package provides an easy abstraction for reading (probably
> writing is possible too, but it might be easier to understand the building
> blocks first).
>
>> library(HDF5Array)
>> hdf <- HDF5Array(fl, "big")
>> hdf
> HDF5Matrix object of 10000 x 100 doubles:
>            [,1]   [,2]   [,3] ...  [,99] [,100]
>     [1,]      1  10001  20001   .  80001  90001
>     [2,]      2  10002  20002   .  80002  90002
>     [3,]      3  10003  20003   .  80003  90003
>     [4,]      4  10004  20004   .  80004  90004
>     [5,]      5  10005  20005   .  80005  90005
>      ...      .      .      .   .      .      .
>  [9996,]   9996  19996  29996   .  89996  99996
>  [9997,]   9997  19997  29997   .  89997  99997
>  [9998,]   9998  19998  29998   .  89998  99998
>  [9999,]   9999  19999  29999   .  89999  99999
> [10000,]  10000  20000  30000   .  90000 100000
>> hdf[1:5, 1:5]
> DelayedMatrix object of 5 x 5 doubles:
>       [,1]  [,2]  [,3]  [,4]  [,5]
> [1,]     1 10001 20001 30001 40001
> [2,]     2 10002 20002 30002 40002
> [3,]     3 10003 20003 30003 40003
> [4,]     4 10004 20004 30004 40004
> [5,]     5 10005 20005 30005 40005
>> as.matrix(hdf[1:5, 1:5])
>      [,1]  [,2]  [,3]  [,4]  [,5]
> [1,]    1 10001 20001 30001 40001
> [2,]    2 10002 20002 30002 40002
> [3,]    3 10003 20003 30003 40003
> [4,]    4 10004 20004 30004 40004
> [5,]    5 10005 20005 30005 40005
>> rowSums(hdf[1:5, 1:5])
> [1] 100005 100010 100015 100020 100025
>
> Martin
>
>>
>> Any help greatly appreciated.
>>
>> than you,
>> Francesco
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>
> This email message may contain legally privileged and/or confidential
> information.  If you are not the intended recipient(s), or the employee or
> agent responsible for the delivery of this message to the intended
> recipient(s), you are hereby notified that any disclosure, copying,
> distribution, or use of this email message is prohibited.  If you have
> received this message in error, please notify the sender immediately by
> e-mail and delete this email message from your computer. Thank you.



More information about the Bioc-devel mailing list