[R] Memory management in R
Lorenzo Isella
lorenzo.isella at gmail.com
Sat Oct 9 22:23:03 CEST 2010
> My suggestion is to explore other alternatives. (I will admit that I
> don't yet fully understand the test that you are applying.)
Hi,
I am trying to partially implement the Lempel Ziv compression algorithm.
The point is that compressibility and entropy of a time series are
related, hence my final goal is to evaluate the entropy of a time series.
You can find more at
http://bit.ly/93zX4T
http://en.wikipedia.org/wiki/LZ77_and_LZ78
http://bit.ly/9NgIFt
The two that
> have occurred to me are Biostrings which I have already mentioned and
> rle() which I have illustrated the use of but not referenced as an
> avenue. The Biostrings package is part of bioConductor (part of the R
> universe) although you should be prepared for a coffee break when you
> install it if you haven't gotten at least bioClite already installed.
> When I installed it last night it had 54 other package dependents also
> downloaded and installed. It seems to me that taking advantage of the
> coding resources in the molecular biology domain that are currently
> directed at decoding the information storage mechanism of life might be
> a smart strategy. You have not described the domain you are working in
> but I would guess that the "digest" package might be biological in
> primary application? So forgive me if I am preaching to the choir.
>
> The rle option also occurred to me but it might take a smarter coder
> than I to fully implement it. (But maybe Holtman would be up to it. He's
> a _lot_ smarter than I.) In your example the long "x" string is
> faithfully represented by two aligned vectors, each 197 characters in
> length. The long repeat sequence that broke the grepl mechanism are just
> one pair of values.
> > rle(x)
> Run Length Encoding
> lengths: int [1:197] 1 1 2 1 1 4 1 9 1 1 ...
> values : chr [1:197] "5d64d58a" "ac76183b" "202fbcc4" "78087f5e" ...
>
> So maybe as soon as you got to a bundle that was greater than 1/2 the
> overall length (as happened in the "x" case) you could stop, since it
> could not have "occurred before".
>
I doubt that rle() can be deployed to replace Lempel-Ziv (LZ) algorithm
in a trivial way. As a less convoluted example, consider the series
x <- c("d","a","b","d","a","b","e","z")
If i=4 and therefore the i-th element is the second 'd' in the series,
the shortest series starting from i=4 that I do not see in the past of
'd' is
"d","a","b","e", whose length is equal to 4 and that is the value
returned by the function below.
The frustrating thing is that I already have the tools I need, just they
crash for reasons beyond my control on relatively short series.
If anyone can make the function below more robust, that is really a big
help for me.
Cheers
Lorenzo
###########################################################
entropy_lz <- function(x,i){
past <- x[1:i-1]
n <- length(x)
lp <- length(past)
future <- x[i:n]
go_on <- 1
count_len <- 0
past_string <- paste(past, collapse="#")
while (go_on>0){
new_seq <- x[i:(i+count_len)]
fut_string <- paste(new_seq, collapse="#")
count_len <- count_len+1
if (grepl(fut_string,past_string)!=1){
go_on <- -1
}
}
return(count_len)
}
x <- c("c","a","b","c","a","b","e","z")
S <- entropy_lz(x,4)
More information about the R-help
mailing list