[R] Memory management in R

Sat Oct 9 22:23:03 CEST 2010

> My suggestion is to explore other alternatives. (I will admit that I
> don't yet fully understand the test that you are applying.)

Hi,
I am trying to partially implement the Lempel Ziv compression algorithm.
The point is that compressibility and entropy of a time series are 
related, hence my final goal is to evaluate the entropy of a time series.
You can find more at

http://bit.ly/93zX4T
http://en.wikipedia.org/wiki/LZ77_and_LZ78
http://bit.ly/9NgIFt

The two that
> have occurred to me are Biostrings which I have already mentioned and
> rle() which I have illustrated the use of but not referenced as an
> avenue. The Biostrings package is part of bioConductor (part of the R
> universe) although you should be prepared for a coffee break when you
> install it if you haven't gotten at least bioClite already installed.
> When I installed it last night it had 54 other package dependents also
> downloaded and installed. It seems to me that taking advantage of the
> coding resources in the molecular biology domain that are currently
> directed at decoding the information storage mechanism of life might be
> a smart strategy. You have not described the domain you are working in
> but I would guess that the "digest" package might be biological in
> primary application? So forgive me if I am preaching to the choir.
>
> The rle option also occurred to me but it might take a smarter coder
> than I to fully implement it. (But maybe Holtman would be up to it. He's
> a _lot_ smarter than I.) In your example the long "x" string is
> faithfully represented by two aligned vectors, each 197 characters in
> length. The long repeat sequence that broke the grepl mechanism are just
> one pair of values.
>  > rle(x)
> Run Length Encoding
> lengths: int [1:197] 1 1 2 1 1 4 1 9 1 1 ...
> values : chr [1:197] "5d64d58a" "ac76183b" "202fbcc4" "78087f5e" ...
>
> So maybe as soon as you got to a bundle that was greater than 1/2 the
> overall length (as happened in the "x" case) you could stop, since it
> could not have "occurred before".
>

I doubt that rle() can be deployed to replace Lempel-Ziv (LZ) algorithm 
in a trivial way. As a less convoluted example, consider the series

x <- c("d","a","b","d","a","b","e","z")

If i=4 and therefore the i-th element is the second 'd' in the series, 
the shortest series starting from i=4 that I do not see in the past of 
'd' is

"d","a","b","e", whose length is equal to 4 and that is the value 
returned by the function below.
The frustrating thing is that I already have the tools I need, just they 
crash for reasons beyond my control on relatively short series.
If anyone can make the function below more robust, that is really a big 
help for me.
Cheers

Lorenzo

###########################################################
entropy_lz <- function(x,i){

past <- x[1:i-1]

n <- length(x)

lp <- length(past)

future <- x[i:n]

go_on <- 1

count_len <- 0

past_string <- paste(past, collapse="#")

while (go_on>0){

new_seq <- x[i:(i+count_len)]

fut_string <- paste(new_seq, collapse="#")

count_len <- count_len+1

if (grepl(fut_string,past_string)!=1){

go_on <- -1

}
}
return(count_len)

}

x <- c("c","a","b","c","a","b","e","z")

S <- entropy_lz(x,4)