[Bioc-devel] Incrimental writing to HDF5 / DelayedMatrix
Francesco Napolitano
franapoli at gmail.com
Thu Dec 21 14:11:16 CET 2017
That seems to solve my problem, I will try this way, thak you very much.
Francesco
On Thu, Dec 21, 2017 at 1:16 PM, Martin Morgan
<martin.morgan at roswellpark.org> wrote:
> On 12/21/2017 06:22 AM, Francesco Napolitano wrote:
>>
>> Hi,
>>
>> I need to deal with very large matrices and I was thinking of using
>> HDF5-based data models. However, from the documentation and examples
>> that I have been looking at, I'm not quite sure how to do this.
>>
>> My use case is as follows.
>> I want to build a very large matrix one column at a time, and I need
>> to write columns directly to disk since I would otherwise run out of
>> memory. I need a format that, afterwards, will allow me to extract
>> subsets of rows or columns and rank them. The subsets will be small
>> enough to be loaded in memory. Can I achieve this with current HDF5
>> support in R?
>
>
> this is basically straight-forward in rhdf5. The idea is to create a dataset
> of the size to contain your total data
>
> library(rhdf5)
> fl <- tempfile()
> h5createFile(fl)
>
> nrow <- 10000
> ncol <- 100
> h5createDataset(fl, "big", c(nrow, ncol), showWarnings = FALSE)
>
> then to fill it in chunks by specifying which start row / column you'd like
> to write to and the 'count' of the number data points in each direction
> you'd like to write to
>
> chunk_ncol <- ncol / 10
> j <- 1 # which column to start writing?
>
> while (j < ncol) {
> m <- matrix(seq(1, length.out = nrow * chunk_ncol), nrow)
> h5write(m, fl, "big", start = c(1, j), count = c(nrow, chunk_ncol))
> j <- j + chunk_ncol
> }
>
> You can read arbitrary 'slabs'
>
> h5read(fl, "big", start = c(1, 1), count = c(5, 5))
> h5read(fl, "big", start = c(1, 9), count = c(5, 2))
>
> Probably you don't want to write 1 column at a time, but as many columns as
> comfortably fit into memory. This minimizes the number of R function calls
> needed to write / read the data.
>
> The HDF5Array package provides an easy abstraction for reading (probably
> writing is possible too, but it might be easier to understand the building
> blocks first).
>
>> library(HDF5Array)
>> hdf <- HDF5Array(fl, "big")
>> hdf
> HDF5Matrix object of 10000 x 100 doubles:
> [,1] [,2] [,3] ... [,99] [,100]
> [1,] 1 10001 20001 . 80001 90001
> [2,] 2 10002 20002 . 80002 90002
> [3,] 3 10003 20003 . 80003 90003
> [4,] 4 10004 20004 . 80004 90004
> [5,] 5 10005 20005 . 80005 90005
> ... . . . . . .
> [9996,] 9996 19996 29996 . 89996 99996
> [9997,] 9997 19997 29997 . 89997 99997
> [9998,] 9998 19998 29998 . 89998 99998
> [9999,] 9999 19999 29999 . 89999 99999
> [10000,] 10000 20000 30000 . 90000 100000
>> hdf[1:5, 1:5]
> DelayedMatrix object of 5 x 5 doubles:
> [,1] [,2] [,3] [,4] [,5]
> [1,] 1 10001 20001 30001 40001
> [2,] 2 10002 20002 30002 40002
> [3,] 3 10003 20003 30003 40003
> [4,] 4 10004 20004 30004 40004
> [5,] 5 10005 20005 30005 40005
>> as.matrix(hdf[1:5, 1:5])
> [,1] [,2] [,3] [,4] [,5]
> [1,] 1 10001 20001 30001 40001
> [2,] 2 10002 20002 30002 40002
> [3,] 3 10003 20003 30003 40003
> [4,] 4 10004 20004 30004 40004
> [5,] 5 10005 20005 30005 40005
>> rowSums(hdf[1:5, 1:5])
> [1] 100005 100010 100015 100020 100025
>
> Martin
>
>>
>> Any help greatly appreciated.
>>
>> than you,
>> Francesco
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>
> This email message may contain legally privileged and/or confidential
> information. If you are not the intended recipient(s), or the employee or
> agent responsible for the delivery of this message to the intended
> recipient(s), you are hereby notified that any disclosure, copying,
> distribution, or use of this email message is prohibited. If you have
> received this message in error, please notify the sender immediately by
> e-mail and delete this email message from your computer. Thank you.
More information about the Bioc-devel
mailing list