[BioC] extracting regions of consecutive values from dataframe

Fri May 30 13:47:47 CEST 2008

On Fri, May 30, 2008 at 6:35 AM, Niels Høgslund <nj at birc.au.dk> wrote:
> Hi,
>
> I have a lot of data frames looking like this (SNP chromosome position and a
> local state ID):
>
>        Position        State
> 1       3088998 0
> 2       4215064 6
> 3       5034491 6
> 4       5211912 6
> 5       5697261 6
> 6       5809727 0
> 7       6818872 NA
> 8       6867391 0
> 9       7346904 1
> 10      7347824 1
> 11      7358232 1
> 12      7833686 1
> 13      8295795 0
> 14      10755448        0
> 15      10919778        NA
> 16      11217061        3
> 17      12463350        3
> 18      13678626        0
> 19      13892992        0
> 20      13965452        0
> 21      13969222        0
> ........
>
> Now, I want to collapse or summarize consecutive occurences of a state into
> a region with a start+end position,
> i.e. something like this:
>
>        Position        State
> 2       4215064 6
> 5       5697261 6
> 9       73469041        1
> 12      7833686 1
> 16      11217061        3
> 17      12463350        3
>
> Can anyone help me with this?

The rle() function is one way to do this.  You will need to write a
little wrapper function to do exactly what you want, but rle() should
get you going.

Sean