[R] Memory management in R

Sat Oct 9 16:48:19 CEST 2010

On Oct 9, 2010, at 9:45 AM, Lorenzo Isella wrote:

> Hi David,
> I am replying to you and to the other people who provided some  
> insight into my problems with grepl.
> Well, at least we now know that the bug is reproducible.
> Indeed it is a strange sequence the one I am postprocessing,  
> probably pathological to some extent, nevertheless the problem is  
> given by grepl crushing when a long (but not huge) chunk of repeated  
> data is loaded has to be acknowledged.
> Now, my problem is the following: given a potentially long string  
> (or before that a sequence, where every element has been generated  
> via the hash function, algo='crc32' of the digest package), how can  
> I, starting from an arbitrary position i along the list, calculate  
> the shortest substring in the future of i (i.e. the interval i:end  
> of the series) that has not occurred in the past of i (i.e. [1:i-1])?

Maybe you should work on a less convoluted explanation of the test? Or  
perhaps a couple of compact examples, preferably in R-copy-paste format?

> Efficiency is not the main point here, I need to run this code only  
> once to get what I need, but it cannot crush on a 2000-entry string.

My suggestion is to explore other alternatives. (I will admit that I  
don't yet fully understand the test that you are applying.) The two  
that have occurred to me are Biostrings which I have already mentioned  
and rle() which I have illustrated the use of but not referenced as an  
avenue. The Biostrings package is part of bioConductor (part of the R  
universe) although you should be prepared for a coffee break when you  
install it if you haven't gotten at least bioClite already installed.  
When I installed it last night it had 54 other package dependents also  
downloaded and installed. It seems to me that taking advantage of the  
coding resources in the molecular biology domain that are currently  
directed at decoding the information storage mechanism of life might  
be a smart strategy. You have not described the domain you are working  
in but I would guess that the "digest" package might be biological in  
primary application? So forgive me if I am preaching to the choir.

The rle option also occurred to me but it might take a smarter coder  
than I to fully implement it. (But maybe Holtman would be up to it.  
He's a _lot_ smarter than I.)  In your example the long "x" string is  
faithfully represented by two aligned vectors, each 197 characters in  
length. The long repeat sequence that broke the grepl mechanism are  
just one pair of values.
 > rle(x)
Run Length Encoding
   lengths: int [1:197] 1 1 2 1 1 4 1 9 1 1 ...
   values : chr [1:197] "5d64d58a" "ac76183b" "202fbcc4" "78087f5e" ...

So maybe as soon as you got to a bundle that was greater than 1/2 the  
overall length (as happened in the "x" case) you could stop, since it  
could not have "occurred before".

-- 
David.

> Cheers
>
> Lorenzo
>
>
> On 10/09/2010 01:30 AM, David Winsemius wrote:
>
>>> What puzzles me is that the list is not really long (less than 2000
>>> entries) and I have not experienced the same problem even with  
>>> longer
>>> lists.
>>
>> But maybe your loop terminated in them eaarlier/ Someplace between
>> 11*225 and 11*240 the grepping machine gives up:
>>
>> > eprs <- paste(rep("aaaaaaaaaa", 225), collapse="#")
>> > grepl(eprs, eprs)
>> [1] TRUE
>>
>> > eprs <- paste(rep("aaaaaaaaaa", 240), collapse="#")
>> > grepl(eprs, eprs)
>> Error in grepl(eprs, eprs) :
>> invalid regular expression
>> 'aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaa
>>
>> In addition: Warning message:
>> In grepl(eprs, eprs) : regcomp error: 'Out of memory'
>>
>> The complexity of the problem may depend on the distribution of  
>> values.
>> You have a very skewed distribution with the vast majority being in  
>> the
>> same value as appeared in your error message :
>>
>> > table(x)
>> x
>> 12653a6 202fbcc4 48bef8c3 4e084ddc 51f342a4 5d64d58a 78087f5e  
>> abddf3d1
>> 1419 299 1 1 1 3 1 1
>> ac76183b b955be36 c600173a e96f6bbd e9c56275
>> 1 30 5 1 9
>>
>> And you have 1159 of them in one clump (which would seem to be  
>> somewhat
>> improbably under a random null hypothesis:
>>
>> > max(rle(x)$lengths)
>> [1] 1159
>> > which(rle(x)$lengths == 1159)
>> [1] 123
>> > rle(x)$values[123]
>> [1] "12653a6"
>>
>> HTH (although I think it means you need to construct a different
>> implementation strategy);
>>
>> David.
>>
>>
>>> Many thanks
>>>
>>> Lorenzo
>>>
>>>

David Winsemius, MD
West Hartford, CT