[Bioc-devel] Memory limits for individual objects

Martin Morgan mtmorgan at fhcrc.org
Wed Jul 29 17:06:44 CEST 2009


Hi Tim --

Tim Rayner <tfrayner at gmail.com> writes:

> Hi,
>
> I'm running R 2.9.0 on Mac OS X, and attempting to import a large
> number of SNP calls into a snp.matrix object as supported by the
> snpMatrix package. I've run into a problem where I'd like the final
> matrix object to contain around 5e9 elements, but of course the
> maximum vector (and matrix) size in R is 2^31-1 (approx. 2e9), and I
> get the dreaded "allocMatrix: too many elements specified" error
> message. An obvious workaround is to split the analysis up into parts
> which will fit within this limit, but I feel I should ask whether
> there's a better way. I'm using a 64-bit build of R and I was
> wondering whether anyone had experience changing the indexing of R
> vectors and matrices from signed 32-bit integers to signed 64-bit
> integers? Or should I just head over to R-devel directly (in which
> case, apologies for the mispost)?
>
> There was a not-terribly-helpful exchange regarding this question on
> the main Bioconductor list last year:
>
> http://www.nabble.com/allocMatrix-limits-td18763791.html

I'm not speaking with too much authority here, not having looked in
detail into snpMatrix.

It would be a significant task (not impossible, very challenging if
one aims for portability) to change this limitation in R. The fastest
way forward is probably to use some on-disk storage (I like the ncdf
package for large numeric matrices) coupled with access to slices at
a time.  Alternatively you can manage your own memory via external
pointers, etc., but this requires that you implement whatever
matrix-like functionality you want, losing all of the hard work others
have done. There are packages in the bioc repository that have
addressed these issues to one degree or another, including
BufferedMatrix and externalVector.

An interesting activity might augment the snpMatrix package with
external pointer memory management, because as you say this really is
a case where the data quickly hit the R limit, and restricting focus
to snpMatrix delimits the functionality that would need to be provided
by your code.

Another possibility is to scrap the strict 'matrix' representation;
maybe the Rle class from IRanges would be a very effective compression
tool, leading to straight-forward and efficient algorithms for basic
calculations, and allowing not too expensive expansion of slices of
the Rle to full vectors for more elaborate computation.

Martin

> Since the snpMatrix package stores each element as type 'raw', the
> final memory consumption for the snp.matrix object should only be a
> handful of gigabytes, readily available in modern desktop computers.
> It would be nice to be able to use that memory.
>
> Many thanks,
>
> Tim Rayner
>
>
> Bioinformatician - Smith Lab
> Cambridge Institute for Medical Research
> University of Cambridge
>
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list