[R] Vector size limit for table() in R-2.15.1
Prof Brian Ripley
ripley at stats.ox.ac.uk
Fri Aug 10 07:05:53 CEST 2012
As the posting guide asked you to before posting, try R-patched. That
has the NEWS items
• duplicated(), unique() and similar now support vectors of
lengths above 2^29 on 64-bit platforms.
• unique() and similar would infinite-loop if called on a vector of
length > 2^29 (but reported that the vector was too long for 2^30
or more).
If you want to work on such large datasets, you might want to consider
using R-devel which has a number of enhancements already with more in
the pipeline.
On 10/08/2012 01:29, Sean Ruddy wrote:
> Hi,
>
> First, thanks in advance. Some useful info:
>
>> version
> platform x86_64-unknown-linux-gnu
> arch x86_64
> os linux-gnu
> system x86_64, linux-gnu
> version.string R version 2.15.1 (2012-06-22)
>
> I'm trying to use the table() function on a 2 column matrix that has 711
> million rows (see below). However, it freezes. If I subset the matrix to be
> less than or equal to 2^29 (500+ million) then the table() function
> finishes in minutes. As soon as I go larger than that--beginning with
> 2^29+1--it gets stuck, ie. nothing happens even after hours of running. I
> assume it has something to do with memory since I believe that's the 32 bit
> limit but I'm running on a 64 bit machine.
>
> Here's the matrix:
>
>> head(DRI.mtx)
>
> POSITION BP
> 38076904 C
> 38076905 C
> 38076906 A
> 38076907 T
> 38076908 C
> 38076909 C
>
>
> The result from table (if the matrix has less than 2^29 rows) is
>
>> head(table(DRI.mtx))
>
> BP
> POSITION A C G N T
> 115247036 17 0 0 0 0
> 115247037 31 0 0 0 0
> 115247038 46 0 0 0 0
> 115247039 0 0 54 0 0
> 115247040 0 0 1 0 66
> 115247041 0 0 0 0 78
>
>
> I've tracked the problem down to the C-file, "unique.c". table() calls
> factor() which calls unique() which I believe calls "unique.c". Browsing
> through the C file I found an if statement that checks if the size of the
> vector is larger than 2^30-1. If TRUE it gives the error message "too large
> for hashing". I do not get any error message when I run table() on the full
> matrix but I wonder if maybe I should be and if the limit of 2^30 is too
> high and should be lowered. Maybe it's just my set up or maybe it has
> nothing to do with unique.c. I don't know.
>
> Here's the part of unique.c I was referring to:
>
> /*
> Choose M to be the smallest power of 2
> not less than 2*n and set K = log2(M).
> Need K >= 1 and hence M >= 2, and 2^M <= 2^31 -1, hence n <= 2^30.
>
> Dec 2004: modified from 4*n to 2*n, since in the worst case we have
> a 50% full table, and that is still rather efficient -- see
> R. Sedgewick (1998) Algorithms in C++ 3rd edition p.606.
> */
> static void MKsetup(int n, HashData *d)
> {
> int n2 = 2 * n;
> if(n < 0 || n > 1073741824) /* protect against overflow to -ve */
> error(_("length %d is too large for hashing"), n);
> d->M = 2;
> d->K = 1;
> while (d->M < n2) {
> d->M *= 2;
> d->K += 1;
> }
> }
>
> "n" I presume is the number of rows of the matrix so I don't see why this
> wouldn't run properly though I'm not sure what is causing the problem in
> the unique.c file and I have no idea how to troubleshoot.
>
> I have a work around that reads in chunks at a time, but I'm very
> interested in why there appears to be a limit at 2^29 when according to the
> unique.c file it should be twice that.
>
> Thanks for the help.
>
> -Sean
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list