[R] algorithm help

Fri Jan 7 09:58:59 CET 2011

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/06/2011 11:57 PM, (Ted Harding) wrote:
> On 06-Jan-11 22:16:38, array chip wrote:
>> Hi, I am seeking help on designing an algorithm to identify the
>> locations of stretches of 1s in a vector of 0s and 1s. Below is
>> an simple example:
>>
>>> dat<-as.data.frame(cbind(a=c(F,F,T,T,T,T,F,F,T,T,F,T,T,T,T,F,F,F,F,T)
>>   ,b=c(4,12,13,16,18,20,28,30,34,46,47,49,61,73,77,84,87,90,95,97)))
>>
>>> dat
>>    a  b
>> 1  0  4
>> 2  0 12
>> 3  1 13
>> 4  1 16
>> 5  1 18
>> 6  1 20
>> 7  0 28
>> 8  0 30
>> 9  1 34
>> 10 1 46
>> 11 0 47
>> 12 1 49
>> 13 1 61
>> 14 1 73
>> 15 1 77
>> 16 0 84
>> 17 0 87
>> 18 0 90
>> 19 0 95
>> 20 1 97
>>
>> In this dataset, "b" is sorted and denotes the location for each
>> number in "a". 
>> So I would like to find the starting & ending locations for each
>> stretch of 1s within "a", also counting the number of 1s in each
>> stretch as well.
>> Hope the results from the algorithm would be:
>>
>> stretch   start   end   No.of.1s
>> 1         13      20    4
>> 2         34      46    2
>> 3         49      77    4
>> 4         97      97    1
>>
>> I can imagine using for loops can do the job, but I feel it's not a
>> clever way to do this. Is there an efficient algorithm that can do
>> this fast?
>>
>> Thanks for any suggestions.
>> John
> 
> The basic information you need can be got using rle() ("run length
> encoding"). See '?rle'. In your example:
> 
>   rle(dat$a)
>   # Run Length Encoding
>   #   lengths: int [1:8] 2 4 2 2 1 4 4 1
>   #   values : num [1:8] 0 1 0 1 0 1 0 1
>   ## Note: F -> 0, T -> 1
> 
> The following has a somewhat twisted logic at the end, and may
> be flawed, but you can probably adapt it!
> 
>   L <- rle(dat$a)$lengths
>   V <- rle(dat$a)$values
>   pos <- c(1,cumsum(L))
>   V1 <- c(-1,V)
>   1+pos[V1==0]
>   # [1]  3  9 12 20
>   ## Positions in the series dat$a where each run of "T" (i.e. 1)
>   ##   starts

A different approach would be to use the diff() function:

Where

> diff(dat$a)
 [1]  0  1  0  0  0 -1  0  1  0 -1  1  0  0  0 -1  0  0  0  1

is not equal 0, the value is changing from 0 to 1 or one to 0.
The indices of the first new value can be found by:

> which(diff(dat$a)!=0) + 1
[1]  3  7  9 11 12 16 20

where it is changing from 0 to 1 is at

> which(diff(dat$a)==1) + 1
[1]  3  9 12 20

where it is changing from 1 to 0 is at

> which(diff(dat$a)==-1) + 1
[1]  7 11 16

By taking into consideration if the first value and the last values are
0 or 1, you can calculate the length.

Cheers,

Rainer

> 
> Hoping this helps,
> Ted.
> 
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <ted.harding at wlandres.net>
> Fax-to-email: +44 (0)870 094 0861
> Date: 06-Jan-11                                       Time: 22:57:44
> ------------------------------ XFMail ------------------------------
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

- -- 
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
Biology, UCT), Dipl. Phys. (Germany)

Centre of Excellence for Invasion Biology
Natural Sciences Building
Office Suite 2039
Stellenbosch University
Main Campus, Merriman Avenue
Stellenbosch
South Africa

Tel:        +33 - (0)9 53 10 27 44
Cell:       +27 - (0)8 39 47 90 42
Fax (SA):   +27 - (0)8 65 16 27 82
Fax (D) :   +49 - (0)3 21 21 25 22 44
Fax (FR):   +33 - (0)9 58 10 27 44
email:      Rainer at krugs.de

Skype:      RMkrug
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk0m1dMACgkQoYgNqgF2egoQbACcCB3iFQ6SKYfL4KVX8AMAN9Gp
1awAn0Z+8KXnOmwCLu61gihc8xZIT++j
=O+xA
-----END PGP SIGNATURE-----