[R] Vector size limit for table() in R-2.15.1

Sean ruddy sruddy17 at gmail.com
Fri Aug 10 07:30:17 CEST 2012


Thanks for the help all! Good to know that there's an answer. Unfortunately, I don't have the rights to install programs so I wasn't able to try devel and I've never heard of R patched but I'm guessing I can't install that either. I'll see if I can get someone to do that.

Much appreciated!

Sean



On Aug 9, 2012, at 10:05 PM, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:

> As the posting guide asked you to before posting, try R-patched.  That has the NEWS items
> 
>    • duplicated(), unique() and similar now support vectors of
>      lengths above 2^29 on 64-bit platforms.
> 
>    • unique() and similar would infinite-loop if called on a vector of
>      length > 2^29 (but reported that the vector was too long for 2^30
>      or more).
> 
> If you want to work on such large datasets, you might want to consider using R-devel which has a number of enhancements already with more in the pipeline.
> 
> On 10/08/2012 01:29, Sean Ruddy wrote:
>> Hi,
>> 
>> First, thanks in advance. Some useful info:
>> 
>>> version
>> platform       x86_64-unknown-linux-gnu
>> arch           x86_64
>> os             linux-gnu
>> system         x86_64, linux-gnu
>> version.string R version 2.15.1 (2012-06-22)
>> 
>> I'm trying to use the table() function on a 2 column matrix that has 711
>> million rows (see below). However, it freezes. If I subset the matrix to be
>> less than or equal to 2^29 (500+ million) then the table() function
>> finishes in minutes. As soon as I go larger than that--beginning with
>> 2^29+1--it gets stuck, ie. nothing happens even after hours of running. I
>> assume it has something to do with memory since I believe that's the 32 bit
>> limit but I'm running on a 64 bit machine.
>> 
>> Here's the matrix:
>> 
>>> head(DRI.mtx)
>> 
>>   POSITION BP
>>  38076904  C
>>  38076905  C
>>  38076906  A
>>  38076907  T
>>  38076908  C
>>  38076909  C
>> 
>> 
>> The result from table (if the matrix has less than 2^29 rows) is
>> 
>>> head(table(DRI.mtx))
>> 
>>                   BP
>> POSITION     A C  G N  T
>>   115247036 17 0  0 0  0
>>   115247037 31 0  0 0  0
>>   115247038 46 0  0 0  0
>>   115247039  0 0 54 0  0
>>   115247040  0 0  1 0 66
>>   115247041  0 0  0 0 78
>> 
>> 
>> I've tracked the problem down to the C-file, "unique.c". table() calls
>> factor() which calls unique() which I believe calls "unique.c". Browsing
>> through the C file I found an if statement that checks if the size of the
>> vector is larger than 2^30-1. If TRUE it gives the error message "too large
>> for hashing". I do not get any error message when I run table() on the full
>> matrix but I wonder if maybe I should be and if the limit of 2^30 is too
>> high and should be lowered. Maybe it's just my set up or maybe it has
>> nothing to do with unique.c. I don't know.
>> 
>> Here's the part of unique.c I was referring to:
>> 
>> /*
>>   Choose M to be the smallest power of 2
>>   not less than 2*n and set K = log2(M).
>>   Need K >= 1 and hence M >= 2, and 2^M <= 2^31 -1, hence n <= 2^30.
>> 
>>   Dec 2004: modified from 4*n to 2*n, since in the worst case we have
>>   a 50% full table, and that is still rather efficient -- see
>>   R. Sedgewick (1998) Algorithms in C++ 3rd edition p.606.
>> */
>> static void MKsetup(int n, HashData *d)
>> {
>>     int n2 = 2 * n;
>>     if(n < 0 || n > 1073741824) /* protect against overflow to -ve */
>>         error(_("length %d is too large for hashing"), n);
>>     d->M = 2;
>>     d->K = 1;
>>     while (d->M < n2) {
>>         d->M *= 2;
>>         d->K += 1;
>>     }
>> }
>> 
>> "n" I presume is the number of rows of the matrix so I don't see why this
>> wouldn't run properly though I'm not sure what is causing the problem in
>> the unique.c file and I have no idea how to troubleshoot.
>> 
>> I have a work around that reads in chunks at a time, but I'm very
>> interested in why there appears to be a limit at 2^29 when according to the
>> unique.c file it should be twice that.
>> 
>> Thanks for the help.
>> 
>> -Sean
>> 
>>    [[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
> 
> 
> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list