[R] Can R handle a matrix with 8 billion entries?

Prof Brian Ripley ripley at stats.ox.ac.uk
Wed Aug 10 07:16:43 CEST 2011


On Wed, 10 Aug 2011, David Winsemius wrote:

>
> On Aug 9, 2011, at 11:38 PM, Chris Howden wrote:
>
>> Hi,
>> 
>> I’m trying to do a hierarchical cluster analysis in R with a Big Data set.
>> I’m running into problems using the dist() function.
>> 
>> I’ve been looking at a few threads about R’s memory and have read the
>> memory limits section in R help. However I’m no computer expert so I’m
>> hoping I’ve misunderstood something and R can handle my Big Data set,
>> somehow. Although at the moment I think my dataset is simply too big and
>> there is no way around it, but I’d like to be proved wrong!
>> 
>> My data set has 90523 rows of data and 24 columns.
>> 
>> My understanding is that this means the distance matrix has a min of
>> 90523^2 elements which is 8194413529. Which roughly translates as 8GB of

A bit less than half that: it is symmetric.

>> memory being required (if I assume each entry requires 1 bit).

Hmm, that would be a 0/1 distance: there are simpler methods to 
cluster such distances.

>> I only have 4GB on a 32bit build of windows and R. So there is no 
>> way that’s going to work.
>> 
>> So then I thought of getting access to a more powerful computer, and maybe
>> using cloud computing.
>> 
>> However the R memory limit help mentions  “On all builds of R, the maximum
>> length (number of elements) of a vector is 2^31 - 1 ~ 2*10^9”. Now as the
>> distance matrix I require has more elements than this does this mean it’s
>> too big for R no matter what I do?
>
> Yes. Vector indexing is done with 4 byte integers.

Assuming you need the full distance matrix at one time (which you do 
not for hierarchical clustering, itself a highly dubious method for 
more than a few hundred points).

>
> -- 
>
> David Winsemius, MD
> West Hartford, CT
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


More information about the R-help mailing list