[R] Can R handle a matrix with 8 billion entries?

Chris Howden chris at trickysolutions.com.au
Wed Aug 10 05:38:59 CEST 2011


Hi,

I’m trying to do a hierarchical cluster analysis in R with a Big Data set.
I’m running into problems using the dist() function.

I’ve been looking at a few threads about R’s memory and have read the
memory limits section in R help. However I’m no computer expert so I’m
hoping I’ve misunderstood something and R can handle my Big Data set,
somehow. Although at the moment I think my dataset is simply too big and
there is no way around it, but I’d like to be proved wrong!

My data set has 90523 rows of data and 24 columns.

My understanding is that this means the distance matrix has a min of
90523^2 elements which is 8194413529. Which roughly translates as 8GB of
memory being required (if I assume each entry requires 1 bit). I only have
4GB on a 32bit build of windows and R. So there is no way that’s going to
work.

So then I thought of getting access to a more powerful computer, and maybe
using cloud computing.

However the R memory limit help mentions  “On all builds of R, the maximum
length (number of elements) of a vector is 2^31 - 1 ~ 2*10^9”. Now as the
distance matrix I require has more elements than this does this mean it’s
too big for R no matter what I do?

Any ideas would be welcome.

Thanks.


Chris Howden
Founding Partner
Tricky Solutions
Tricky Solutions 4 Tricky Problems
Evidence Based Strategic Development, IP Commercialisation and Innovation,
Data Analysis, Modelling and Training
(mobile) 0410 689 945
(fax / office)
chris at trickysolutions.com.au

Disclaimer: The information in this email and any attachments to it are
confidential and may contain legally privileged information. If you are
not the named or intended recipient, please delete this communication and
contact us immediately. Please note you are not authorised to copy, use or
disclose this communication or any attachments without our consent.
Although this email has been checked by anti-virus software, there is a
risk that email messages may be corrupted or infected by viruses or other
interferences. No responsibility is accepted for such interference. Unless
expressly stated, the views of the writer are not those of the company.
Tricky Solutions always does our best to provide accurate forecasts and
analyses based on the data supplied, however it is possible that some
important predictors were not included in the data sent to us. Information
provided by us should not be solely relied upon when making decisions and
clients should use their own judgement.



More information about the R-help mailing list