[R] Memory management in R

Sat Oct 9 01:30:45 CEST 2010

On Oct 8, 2010, at 6:42 PM, Lorenzo Isella wrote:

> Thanks for lending a helping hand.
> I put together a self-contained example. Basically, it all relies on  
> a couple of functions, where one function simply iterates the  
> application of the other function.
> I am trying to implement the so-called Lempel-Ziv entropy estimator.  
> The idea is to choose a position i along a string x (standing for a  
> time series) and find the length of the shortest string starting  
> from i which has never occurred before i.
> Please find below the R snippet which requires an input file (a  
> simple text file) you can download from
>
> http://dl.dropbox.com/u/5685598/time_series25_.dat
>
> What puzzles me is that the list is not really long (less than 2000  
> entries) and I have not experienced the same problem even with  
> longer lists.

But maybe your loop terminated in them eaarlier/ Someplace between  
11*225 and 11*240 the grepping machine gives up:

 > eprs <- paste(rep("aaaaaaaaaa", 225), collapse="#")
 > grepl(eprs, eprs)
[1] TRUE

 > eprs <- paste(rep("aaaaaaaaaa", 240), collapse="#")
 > grepl(eprs, eprs)
Error in grepl(eprs, eprs) :
   invalid regular expression  
'aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaa
In addition: Warning message:
In grepl(eprs, eprs) : regcomp error:  'Out of memory'

The complexity of the problem may depend on the distribution of  
values. You have a very skewed distribution with the vast majority  
being in the same value as appeared in your error message :

 > table(x)
x
  12653a6 202fbcc4 48bef8c3 4e084ddc 51f342a4 5d64d58a 78087f5e abddf3d1
     1419      299        1        1        1        3        1        1
ac76183b b955be36 c600173a e96f6bbd e9c56275
        1       30        5        1        9

And you have 1159 of them in one clump (which would seem to be  
somewhat improbably under a random null hypothesis:

 > max(rle(x)$lengths)
[1] 1159
 > which(rle(x)$lengths == 1159)
[1] 123
 > rle(x)$values[123]
[1] "12653a6"

HTH (although I think it means you need to construct a different  
implementation strategy);

David.

> Many thanks
>
> Lorenzo
>
> ######################################
>
>
> total_entropy_lz <- function(x){
>
> if (length(x)==1){
>
> print("sequence too short")
>
> return("error")
>
> } else{
>
>
> n <- length(x)
>
> prefactor <- 1/(n*log(n)/log(2))
>
> n_seq <- seq(n)
>
> entropy_list <- n_seq
>
> for (i in n_seq){
>
> entropy_list[i] <- entropy_lz(x,i)
>
>
> }
>
>
> }
>
> total_entropy <- 1/(prefactor*sum(entropy_list))
>
>
> return(total_entropy)
>
> }
>
>
> entropy_lz <- function(x,i){
>
> past <- x[1:i-1]
>
> n <- length(x)
>
> lp <- length(past)
>
> future <- x[i:n]
>
> go_on <- 1
>
> count_len <- 0
>
> past_string <- paste(past, collapse="#")
>
> while (go_on>0){
>
> new_seq <- x[i:(i+count_len)]
>
> fut_string <- paste(new_seq, collapse="#")
>
> count_len <- count_len+1
>
> if (grepl(fut_string,past_string)!=1){
>
> go_on <- -1
> }
> }
> return(count_len)
> }
>
> x <- scan("time_series25_.dat", what="")
>
>
> S <- total_entropy_lz(x)
>
>
>
>
>
>
> On 10/08/2010 07:30 PM, jim holtman wrote:
>> More specificity: how long is the string, what is the pattern you are
>> matching against?  It sounds like you might have a complex pattern
>> that in trying to match the string might be doing a lot of back
>> tracking and such.  There is an O'Reilly book on Mastering Regular
>> Expression that might help you understand what might be happening.   
>> So
>> if you can provide a better example than just the error message, it
>> would be helpful.
>>
>> On Fri, Oct 8, 2010 at 1:11 PM, Lorenzo Isella<lorenzo.isella at gmail.com 
>> >  wrote:
>>> Dear All,
>>> I am experiencing some problems with a script of mine.
>>> It crashes with this message
>>>
>>> Error in grepl(fut_string, past_string) :
>>>  invalid regular expression
>>> '12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12
>>> Calls: entropy_estimate_hash ->  total_entropy_lz ->  entropy_lz - 
>>> >  grepl
>>> In addition: Warning message:
>>> In grepl(fut_string, past_string) : regcomp error:  'Out of memory'
>>> Execution halted
>>>
>>> To make a long story short, I use some functions which eventually  
>>> call grepl
>>> on very long strings to check whether a certain substring is part  
>>> of a
>>> longer string.
>>> Now, the script technically works (it never crashes when I run it  
>>> on a
>>> smaller dataset) and the problem does not seem to be RAM memory (I  
>>> have
>>> several GB of RAM on my machine and its consumption never shoots  
>>> up so my
>>> machine never resorts to swap memory).
>>> So (though I am not an expert) it looks like the problem is some  
>>> limitation
>>> of grepl or R memory management.
>>> Any idea about how I could tackle this problem or how I can  
>>> profile my code
>>> to fix it (though it really seems to me that I have to find a way  
>>> to allow R
>>> to process longer strings).
>>> Any suggestion is appreciated.
>>> Cheers
>>>
>>> Lorenzo
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT