[R] Memory management in R
David Winsemius
dwinsemius at comcast.net
Sat Oct 9 01:30:45 CEST 2010
On Oct 8, 2010, at 6:42 PM, Lorenzo Isella wrote:
> Thanks for lending a helping hand.
> I put together a self-contained example. Basically, it all relies on
> a couple of functions, where one function simply iterates the
> application of the other function.
> I am trying to implement the so-called Lempel-Ziv entropy estimator.
> The idea is to choose a position i along a string x (standing for a
> time series) and find the length of the shortest string starting
> from i which has never occurred before i.
> Please find below the R snippet which requires an input file (a
> simple text file) you can download from
>
> http://dl.dropbox.com/u/5685598/time_series25_.dat
>
> What puzzles me is that the list is not really long (less than 2000
> entries) and I have not experienced the same problem even with
> longer lists.
But maybe your loop terminated in them eaarlier/ Someplace between
11*225 and 11*240 the grepping machine gives up:
> eprs <- paste(rep("aaaaaaaaaa", 225), collapse="#")
> grepl(eprs, eprs)
[1] TRUE
> eprs <- paste(rep("aaaaaaaaaa", 240), collapse="#")
> grepl(eprs, eprs)
Error in grepl(eprs, eprs) :
invalid regular expression
'aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaa
In addition: Warning message:
In grepl(eprs, eprs) : regcomp error: 'Out of memory'
The complexity of the problem may depend on the distribution of
values. You have a very skewed distribution with the vast majority
being in the same value as appeared in your error message :
> table(x)
x
12653a6 202fbcc4 48bef8c3 4e084ddc 51f342a4 5d64d58a 78087f5e abddf3d1
1419 299 1 1 1 3 1 1
ac76183b b955be36 c600173a e96f6bbd e9c56275
1 30 5 1 9
And you have 1159 of them in one clump (which would seem to be
somewhat improbably under a random null hypothesis:
> max(rle(x)$lengths)
[1] 1159
> which(rle(x)$lengths == 1159)
[1] 123
> rle(x)$values[123]
[1] "12653a6"
HTH (although I think it means you need to construct a different
implementation strategy);
David.
> Many thanks
>
> Lorenzo
>
> ######################################
>
>
> total_entropy_lz <- function(x){
>
> if (length(x)==1){
>
> print("sequence too short")
>
> return("error")
>
> } else{
>
>
> n <- length(x)
>
> prefactor <- 1/(n*log(n)/log(2))
>
> n_seq <- seq(n)
>
> entropy_list <- n_seq
>
> for (i in n_seq){
>
> entropy_list[i] <- entropy_lz(x,i)
>
>
> }
>
>
> }
>
> total_entropy <- 1/(prefactor*sum(entropy_list))
>
>
> return(total_entropy)
>
> }
>
>
> entropy_lz <- function(x,i){
>
> past <- x[1:i-1]
>
> n <- length(x)
>
> lp <- length(past)
>
> future <- x[i:n]
>
> go_on <- 1
>
> count_len <- 0
>
> past_string <- paste(past, collapse="#")
>
> while (go_on>0){
>
> new_seq <- x[i:(i+count_len)]
>
> fut_string <- paste(new_seq, collapse="#")
>
> count_len <- count_len+1
>
> if (grepl(fut_string,past_string)!=1){
>
> go_on <- -1
> }
> }
> return(count_len)
> }
>
> x <- scan("time_series25_.dat", what="")
>
>
> S <- total_entropy_lz(x)
>
>
>
>
>
>
> On 10/08/2010 07:30 PM, jim holtman wrote:
>> More specificity: how long is the string, what is the pattern you are
>> matching against? It sounds like you might have a complex pattern
>> that in trying to match the string might be doing a lot of back
>> tracking and such. There is an O'Reilly book on Mastering Regular
>> Expression that might help you understand what might be happening.
>> So
>> if you can provide a better example than just the error message, it
>> would be helpful.
>>
>> On Fri, Oct 8, 2010 at 1:11 PM, Lorenzo Isella<lorenzo.isella at gmail.com
>> > wrote:
>>> Dear All,
>>> I am experiencing some problems with a script of mine.
>>> It crashes with this message
>>>
>>> Error in grepl(fut_string, past_string) :
>>> invalid regular expression
>>> '12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12
>>> Calls: entropy_estimate_hash -> total_entropy_lz -> entropy_lz -
>>> > grepl
>>> In addition: Warning message:
>>> In grepl(fut_string, past_string) : regcomp error: 'Out of memory'
>>> Execution halted
>>>
>>> To make a long story short, I use some functions which eventually
>>> call grepl
>>> on very long strings to check whether a certain substring is part
>>> of a
>>> longer string.
>>> Now, the script technically works (it never crashes when I run it
>>> on a
>>> smaller dataset) and the problem does not seem to be RAM memory (I
>>> have
>>> several GB of RAM on my machine and its consumption never shoots
>>> up so my
>>> machine never resorts to swap memory).
>>> So (though I am not an expert) it looks like the problem is some
>>> limitation
>>> of grepl or R memory management.
>>> Any idea about how I could tackle this problem or how I can
>>> profile my code
>>> to fix it (though it really seems to me that I have to find a way
>>> to allow R
>>> to process longer strings).
>>> Any suggestion is appreciated.
>>> Cheers
>>>
>>> Lorenzo
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list