[Bioc-devel] Memory limits for individual objects

Vincent Carey stvjc at channing.harvard.edu
Wed Jul 29 17:20:53 CEST 2009

GGtools uses snpMatrix snp.matrix instances extensively.  A list of
snp.matrices is used.  You might be able decompose the data in various
ways to keep very
large quantities of genotype data in an object without having a single
entity with more than
2e9 elements?  on a machine with adequate ram, list(integer(2e9),
integer(2e9)) can be constructed.

On Wed, Jul 29, 2009 at 8:06 AM, Martin Morgan<mtmorgan at fhcrc.org> wrote:
> Hi Tim --
> Tim Rayner <tfrayner at gmail.com> writes:
>> Hi,
>> I'm running R 2.9.0 on Mac OS X, and attempting to import a large
>> number of SNP calls into a snp.matrix object as supported by the
>> snpMatrix package. I've run into a problem where I'd like the final
>> matrix object to contain around 5e9 elements, but of course the
>> maximum vector (and matrix) size in R is 2^31-1 (approx. 2e9), and I
>> get the dreaded "allocMatrix: too many elements specified" error
>> message. An obvious workaround is to split the analysis up into parts
>> which will fit within this limit, but I feel I should ask whether
>> there's a better way. I'm using a 64-bit build of R and I was
>> wondering whether anyone had experience changing the indexing of R
>> vectors and matrices from signed 32-bit integers to signed 64-bit
>> integers? Or should I just head over to R-devel directly (in which
>> case, apologies for the mispost)?
>> There was a not-terribly-helpful exchange regarding this question on
>> the main Bioconductor list last year:
>> http://www.nabble.com/allocMatrix-limits-td18763791.html
> I'm not speaking with too much authority here, not having looked in
> detail into snpMatrix.
> It would be a significant task (not impossible, very challenging if
> one aims for portability) to change this limitation in R. The fastest
> way forward is probably to use some on-disk storage (I like the ncdf
> package for large numeric matrices) coupled with access to slices at
> a time.  Alternatively you can manage your own memory via external
> pointers, etc., but this requires that you implement whatever
> matrix-like functionality you want, losing all of the hard work others
> have done. There are packages in the bioc repository that have
> addressed these issues to one degree or another, including
> BufferedMatrix and externalVector.
> An interesting activity might augment the snpMatrix package with
> external pointer memory management, because as you say this really is
> a case where the data quickly hit the R limit, and restricting focus
> to snpMatrix delimits the functionality that would need to be provided
> by your code.
> Another possibility is to scrap the strict 'matrix' representation;
> maybe the Rle class from IRanges would be a very effective compression
> tool, leading to straight-forward and efficient algorithms for basic
> calculations, and allowing not too expensive expansion of slices of
> the Rle to full vectors for more elaborate computation.
> Martin
>> Since the snpMatrix package stores each element as type 'raw', the
>> final memory consumption for the snp.matrix object should only be a
>> handful of gigabytes, readily available in modern desktop computers.
>> It would be nice to be able to use that memory.
>> Many thanks,
>> Tim Rayner
>> Bioinformatician - Smith Lab
>> Cambridge Institute for Medical Research
>> University of Cambridge
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

Vincent Carey, PhD
Biostatistics, Channing Lab
617 525 2265

More information about the Bioc-devel mailing list