[Rd] HOW TO AVOID LOOPS

Sun Apr 13 05:45:18 CEST 2008

Taking Hadley's clue, I guess

x * unlist(lapply(rle(x)$lengths, seq_len))

is faster than my previous suggestion (which Dan had inspired in the
first place). 

For 

> A vector of the following format:
> (0,0,1,0,0,0,3,0,0,0,2,0,1,0,0,0,0,0,6)

one might

> z <- numeric(length(x))
> r <- rle(x)
> i <- r$values==1
> z[cumsum(r$lengths)[i]] <- r$lengths[i]
> z
 [1] 0 0 1 0 0 0 3 0 0 0 2 0 1 0 0 0 0 0 0 6

but since 'rle' keeps coming up, perhaps you're really wanting to know
basic things about run-length encoding, such as the length of each run
of 1's

> r$lengths[r$values==1]
[1] 1 3 2 1 6

? If space and time are an issue, you might also consider reprsenting
your data as 'raw' to save space

> x <- rbinom(1000000, 1, .5)
> object.size(x)
[1] 8000040
> system.time(x*unlist(lapply(rle(x)$lengths, seq_len)), gcFirst=TRUE)
   user  system elapsed 
  1.800   0.000   1.798 

> x <- as.raw(x)
> head(x)
[1] 01 01 01 01 00 01
> object.size(x)
[1] 1000040
> system.time((x==1)*unlist(lapply(rle(x)$lengths, seq_len)), gcFirst=TRUE)
   user  system elapsed 
  1.730   0.000   1.730 

(the timings are quite variable; perhaps they're about equal?)

Martin

"carlos martinez" <martinezbula at earthlink.net> writes:

> Appreciate the ingenious and effective suggestions and feedback from:
>
> Dan Davison
> Vincent Goulet
> Martin Morgan
> Hadley Wickham
>
> The variety of technical approaches proposes so far are clear prove of the
> strong and flexible capabilites of the R system, and specially the dynamics
> and technical understanding of the R user base.
>
> We tested all four recommendations with an input vector of more than 850000
> components, and got time-responses from about 40-second to 20-seconds.
>
> All four approches produced the desired vector. The Wickham's approach
> produced and extra vector, but the second vector included the correct
> format.
>
> Just one additional follow up, to obtain from the same input vector:
> c(0,0,1,0,1,1,1,0,0,1,1,0,1,0,1,1,1,1,1,1)
>
> A vector of the following format:
> (0,0,1,0,0,0,3,0,0,0,2,0,1,0,0,0,0,0,6)
>
> Will be easier and more efficient to start from the original input vector,
> or start from the above second vector
> (0,0,1,0,1,2,3,0,0,1,2,0,1,0,1,2,3,4,5,6)
>
> Thanks for your responses.
>
> -------------------------------------------------------------------------
> Hadley Wickham Approach
>
> How about:
>
> unlist(lapply(split(x, cumsum(x == 0)), seq_along)) - 1
>
> Hadley
> --------------------------------------------------------------------------
> -----Original Message-----
> From: Martin Morgan [mailto:mtmorgan at fhcrc.org] 
> Sent: Saturday, April 12, 2008 5:00 PM
> To: Dan Davison
> Cc: martinezbula at earthlink.net
> Subject: Re: [Rd] HOW TO AVOID LOOPS
>
> (anonymous 'off-list' response; some extra calcs but tidy)
>
>> x=c(0,0,1,0,1,1,1,0,0,1,1,0,1,0,1,1,1,1,1,1)
>> x * unlist(lapply(rle(x)$lengths, seq))
>  [1] 0 0 1 0 1 2 3 0 0 1 2 0 1 0 1 2 3 4 5 6
>
>
> Dan Davison <davison at stats.ox.ac.uk> writes:
>
>> On Sat, Apr 12, 2008 at 06:45:00PM +0100, Dan Davison wrote:
>>> On Sat, Apr 12, 2008 at 01:30:13PM -0400, Vincent Goulet wrote:
>>> > Le sam. 12 avr. à 12:47, carlos martinez a écrit :
>>> > >> Looking for a simple, effective a minimum execution time solution.
>>> > >>
>>> > >> For a vector as:
>>> > >>
>>> > >> c(0,0,1,0,1,1,1,0,0,1,1,0,1,0,1,1,1,1,1,1)
>>> > >>
>>> > > To transform it to the following vector without using any loops:
>>> > >
>>> > >> (0,0,1,0,1,2,3,0,0,1,2,0,1,0,1,2,3,4,5,6)
>>> > >>
>>> > > Appreciate any suggetions.
>>> > 
>>> > This does it -- but it is admittedly ugly:
>>> > 
>>> >  > x <- c(0,0,1,0,1,1,1,0,0,1,1,0,1,0,1,1,1,1,1,1)
>>> >  > ind <- which(x == 0)
>>> >  > unlist(lapply(mapply(seq, ind, c(tail(ind, -1) - 1, length(x))),
>>> > function(y) cumsum(x[y])))
>>> >   [1] 0 0 1 0 1 2 3 0 0 1 2 0 1 0 1 2 3 4 5 6
>>> > 
>>> > (The mapply() part is used to create the indexes of each sequence 
>>> > in x starting with a 0. The rest is then straightforward.)
>>> 
>>> 
>>> Here's my effort. Maybe a bit easier to digest? Only one *apply so
> probably more efficient.
>>> 
>>> function(x=c(0,0,1,0,1,1,1,0,0,1,1,0,1,0,1,1,1,1,1,1)) {
>>>     d <- diff(c(0,x,0))
>>>     starts <- which(d == 1)
>>>     ends <- which(d == -1)
>>>     x[x == 1] <- unlist(lapply(ends - starts, function(n) 1:n))
>>>     x
>>> }
>>> 
>>
>> Come to think of it, I suggest using the existing R function rle(), rather
> than my dodgy substitute.
>>
>> e.g.
>>
>> g <- function(x=c(0,0,1,0,1,1,1,0,0,1,1,0,1,0,1,1,1,1,1,1)) {
>>
>>     runs <- rle(x)
>>     runlengths <- runs$lengths[runs$values == 1]
>>     x[x == 1] <- unlist(lapply(runlengths, function(n) 1:n))
>>     x
>> }
>>
>> Dan
>>
>> p.s. R-help would perhaps have been more appropriate than R-devel
>>
>>
>>> Dan
>>> 
>>> 
>>> > 
>>> > HTH
>>> > 
>>> > ---
>>> >    Vincent Goulet, Associate Professor
>>> >    École d'actuariat
>>> >    Université Laval, Québec
>>> >    Vincent.Goulet at act.ulaval.ca   http://vgoulet.act.ulaval.ca
>>> > 
>>> > ______________________________________________
>>> > R-devel at r-project.org mailing list
>>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview
> Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M2 B169
> Phone: (206) 667-2793
>

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793