[R] Memory management in R
Lorenzo Isella
lorenzo.isella at gmail.com
Sat Oct 9 15:45:03 CEST 2010
Hi David,
I am replying to you and to the other people who provided some insight
into my problems with grepl.
Well, at least we now know that the bug is reproducible.
Indeed it is a strange sequence the one I am postprocessing, probably
pathological to some extent, nevertheless the problem is given by grepl
crushing when a long (but not huge) chunk of repeated data is loaded has
to be acknowledged.
Now, my problem is the following: given a potentially long string (or
before that a sequence, where every element has been generated via the
hash function, algo='crc32' of the digest package), how can I, starting
from an arbitrary position i along the list, calculate the shortest
substring in the future of i (i.e. the interval i:end of the series)
that has not occurred in the past of i (i.e. [1:i-1])?
Efficiency is not the main point here, I need to run this code only once
to get what I need, but it cannot crush on a 2000-entry string.
Cheers
Lorenzo
On 10/09/2010 01:30 AM, David Winsemius wrote:
>> What puzzles me is that the list is not really long (less than 2000
>> entries) and I have not experienced the same problem even with longer
>> lists.
>
> But maybe your loop terminated in them eaarlier/ Someplace between
> 11*225 and 11*240 the grepping machine gives up:
>
> > eprs <- paste(rep("aaaaaaaaaa", 225), collapse="#")
> > grepl(eprs, eprs)
> [1] TRUE
>
> > eprs <- paste(rep("aaaaaaaaaa", 240), collapse="#")
> > grepl(eprs, eprs)
> Error in grepl(eprs, eprs) :
> invalid regular expression
> 'aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaa
>
> In addition: Warning message:
> In grepl(eprs, eprs) : regcomp error: 'Out of memory'
>
> The complexity of the problem may depend on the distribution of values.
> You have a very skewed distribution with the vast majority being in the
> same value as appeared in your error message :
>
> > table(x)
> x
> 12653a6 202fbcc4 48bef8c3 4e084ddc 51f342a4 5d64d58a 78087f5e abddf3d1
> 1419 299 1 1 1 3 1 1
> ac76183b b955be36 c600173a e96f6bbd e9c56275
> 1 30 5 1 9
>
> And you have 1159 of them in one clump (which would seem to be somewhat
> improbably under a random null hypothesis:
>
> > max(rle(x)$lengths)
> [1] 1159
> > which(rle(x)$lengths == 1159)
> [1] 123
> > rle(x)$values[123]
> [1] "12653a6"
>
> HTH (although I think it means you need to construct a different
> implementation strategy);
>
> David.
>
>
>> Many thanks
>>
>> Lorenzo
>>
>>
More information about the R-help
mailing list