[R] How to save very large matrix?
Hervé Pagès
hpages at fhcrc.org
Wed Oct 30 00:14:13 CET 2013
Hi Petar,
If you're going to share this matrix across R sessions, save()/load() is
probably one of your best options.
Otherwise, you could try the rhdf5 package from Bioconductor:
1. Install the package with:
source("http://bioconductor.org/biocLite.R")
biocLite("rhdf5")
2. Then:
library(rhdf5)
h5createFile("my_big_matrix.h5")
# write a matrix
my_big_matrix <- matrix(runif(5000*10000), nrow=5000)
attr(my_big_matrix, "scale") <- "liter"
h5write(my_big_matrix, "my_big_matrix.h5", "my_big_matrix") #
takes 1 min.
# file size on disk is 248M
# read a matrix
my_big_matrix <- h5read("my_big_matrix.h5", "my_big_matrix") #
takes 7.4 sec.
Multiply the above numbers (obtained on a laptop with a traditional
hard drive) by 100 for your monster matrix, or less if you have super
fast I/O.
2 advantages of using the HDF5 format: (1) should not be too hard to use
the HDF5 C library in the C code you're going to use to read the matrix,
and (2) my understanding is that HDF5 is good at letting you access
arbitrary slices of the data so chunk-processing should be easy and
efficient:
http://www.hdfgroup.org/HDF5/
Cheers,
H.
On 10/29/2013 02:34 PM, Petar Milin wrote:
> Hello,
>
> On Oct 29, 2013, at 10:16 PM, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
>
>> On 29/10/2013 20:42, Rui Barradas wrote:
>>> Hello,
>>>
>>> You can use the argument to write.csv or write.table append = TRUE to
>>> write the matrix in chunks. Something like the following.
>>
>> That was going to be my suggestion. But the reason long vectors have not been implemented is that is rather implausible to be useful. A text file with the values of such a numeric matrix is likely to be 100GB. What are you going to do with such a file? For transfer to another program I would seriously consider a binary format (e.g. use writeBin), as it is the conversion to and from text that is time consuming.
>
> I need to submit it to a cluster analysis (k-means). From an independent source I have been advised to use means algorithm written in C which is very fast and efficient. It asks for a txt file as an input.
>
> I tried few options in R, where I am more comfortable, but solution never came, even after too many hours.
>
> Thanks!
> Best,
> PM
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the R-help
mailing list