[R] Memory management in R
David Winsemius
dwinsemius at comcast.net
Sun Oct 10 01:57:33 CEST 2010
On Oct 9, 2010, at 4:23 PM, Lorenzo Isella wrote:
>
>> My suggestion is to explore other alternatives. (I will admit that I
>> don't yet fully understand the test that you are applying.)
>
> Hi,
> I am trying to partially implement the Lempel Ziv compression
> algorithm.
> The point is that compressibility and entropy of a time series are
> related, hence my final goal is to evaluate the entropy of a time
> series.
> You can find more at
>
> http://bit.ly/93zX4T
> http://en.wikipedia.org/wiki/LZ77_and_LZ78
> http://bit.ly/9NgIFt
>
>
>
>
> The two that
>> have occurred to me are Biostrings which I have already mentioned and
>> rle() which I have illustrated the use of but not referenced as an
>> avenue. The Biostrings package is part of bioConductor (part of the R
>> universe) although you should be prepared for a coffee break when you
>> install it if you haven't gotten at least bioClite already installed.
>> When I installed it last night it had 54 other package dependents
>> also
>> downloaded and installed. It seems to me that taking advantage of the
>> coding resources in the molecular biology domain that are currently
>> directed at decoding the information storage mechanism of life
>> might be
>> a smart strategy. You have not described the domain you are working
>> in
>> but I would guess that the "digest" package might be biological in
>> primary application? So forgive me if I am preaching to the choir.
>>
>> The rle option also occurred to me but it might take a smarter coder
>> than I to fully implement it. (But maybe Holtman would be up to it.
>> He's
>> a _lot_ smarter than I.) In your example the long "x" string is
>> faithfully represented by two aligned vectors, each 197 characters in
>> length. The long repeat sequence that broke the grepl mechanism are
>> just
>> one pair of values.
>> > rle(x)
>> Run Length Encoding
>> lengths: int [1:197] 1 1 2 1 1 4 1 9 1 1 ...
>> values : chr [1:197] "5d64d58a" "ac76183b" "202fbcc4" "78087f5e" ...
>>
>> So maybe as soon as you got to a bundle that was greater than 1/2 the
>> overall length (as happened in the "x" case) you could stop, since it
>> could not have "occurred before".
>>
>
> I doubt that rle() can be deployed to replace Lempel-Ziv (LZ)
> algorithm in a trivial way. As a less convoluted example, consider
> the series
>
> x <- c("d","a","b","d","a","b","e","z")
>
> If i=4 and therefore the i-th element is the second 'd' in the
> series, the shortest series starting from i=4 that I do not see in
> the past of 'd' is
>
> "d","a","b","e", whose length is equal to 4 and that is the value
> returned by the function below.
> The frustrating thing is that I already have the tools I need, just
> they crash for reasons beyond my control on relatively short series.
> If anyone can make the function below more robust, that is really a
> big help for me.
I already offered the Biostrings package. It provides more robust
methods for string matching than does grepl. Is there a reason that
you choose not to?
--
David.
> Cheers
>
> Lorenzo
>
> ###########################################################
> entropy_lz <- function(x,i){
>
> past <- x[1:i-1]
>
> n <- length(x)
>
> lp <- length(past)
>
> future <- x[i:n]
>
> go_on <- 1
>
> count_len <- 0
>
> past_string <- paste(past, collapse="#")
>
> while (go_on>0){
>
> new_seq <- x[i:(i+count_len)]
>
> fut_string <- paste(new_seq, collapse="#")
>
> count_len <- count_len+1
>
> if (grepl(fut_string,past_string)!=1){
>
> go_on <- -1
>
> }
> }
> return(count_len)
>
> }
>
> x <- c("c","a","b","c","a","b","e","z")
>
> S <- entropy_lz(x,4)
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list