[R] Memory management in R

David Winsemius dwinsemius at comcast.net
Sun Oct 10 01:57:33 CEST 2010


On Oct 9, 2010, at 4:23 PM, Lorenzo Isella wrote:

>
>> My suggestion is to explore other alternatives. (I will admit that I
>> don't yet fully understand the test that you are applying.)
>
> Hi,
> I am trying to partially implement the Lempel Ziv compression  
> algorithm.
> The point is that compressibility and entropy of a time series are  
> related, hence my final goal is to evaluate the entropy of a time  
> series.
> You can find more at
>
> http://bit.ly/93zX4T
> http://en.wikipedia.org/wiki/LZ77_and_LZ78
> http://bit.ly/9NgIFt
>
>
>
>
> The two that
>> have occurred to me are Biostrings which I have already mentioned and
>> rle() which I have illustrated the use of but not referenced as an
>> avenue. The Biostrings package is part of bioConductor (part of the R
>> universe) although you should be prepared for a coffee break when you
>> install it if you haven't gotten at least bioClite already installed.
>> When I installed it last night it had 54 other package dependents  
>> also
>> downloaded and installed. It seems to me that taking advantage of the
>> coding resources in the molecular biology domain that are currently
>> directed at decoding the information storage mechanism of life  
>> might be
>> a smart strategy. You have not described the domain you are working  
>> in
>> but I would guess that the "digest" package might be biological in
>> primary application? So forgive me if I am preaching to the choir.
>>
>> The rle option also occurred to me but it might take a smarter coder
>> than I to fully implement it. (But maybe Holtman would be up to it.  
>> He's
>> a _lot_ smarter than I.) In your example the long "x" string is
>> faithfully represented by two aligned vectors, each 197 characters in
>> length. The long repeat sequence that broke the grepl mechanism are  
>> just
>> one pair of values.
>> > rle(x)
>> Run Length Encoding
>> lengths: int [1:197] 1 1 2 1 1 4 1 9 1 1 ...
>> values : chr [1:197] "5d64d58a" "ac76183b" "202fbcc4" "78087f5e" ...
>>
>> So maybe as soon as you got to a bundle that was greater than 1/2 the
>> overall length (as happened in the "x" case) you could stop, since it
>> could not have "occurred before".
>>
>
> I doubt that rle() can be deployed to replace Lempel-Ziv (LZ)  
> algorithm in a trivial way. As a less convoluted example, consider  
> the series
>
> x <- c("d","a","b","d","a","b","e","z")
>
> If i=4 and therefore the i-th element is the second 'd' in the  
> series, the shortest series starting from i=4 that I do not see in  
> the past of 'd' is
>
> "d","a","b","e", whose length is equal to 4 and that is the value  
> returned by the function below.
> The frustrating thing is that I already have the tools I need, just  
> they crash for reasons beyond my control on relatively short series.
> If anyone can make the function below more robust, that is really a  
> big help for me.

I already offered the Biostrings package. It provides more robust  
methods for string matching than does grepl. Is there a reason that  
you choose not to?

-- 
David.
> Cheers
>
> Lorenzo
>
> ###########################################################
> entropy_lz <- function(x,i){
>
> past <- x[1:i-1]
>
> n <- length(x)
>
> lp <- length(past)
>
> future <- x[i:n]
>
> go_on <- 1
>
> count_len <- 0
>
> past_string <- paste(past, collapse="#")
>
> while (go_on>0){
>
> new_seq <- x[i:(i+count_len)]
>
> fut_string <- paste(new_seq, collapse="#")
>
> count_len <- count_len+1
>
> if (grepl(fut_string,past_string)!=1){
>
> go_on <- -1
>
> }
> }
> return(count_len)
>
> }
>
> x <- c("c","a","b","c","a","b","e","z")
>
> S <- entropy_lz(x,4)

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list