[BioC] extracting regions of consecutive values from dataframe
Herve Pages
hpages at fhcrc.org
Sat May 31 02:51:48 CEST 2008
Hi Niels,
You can do this:
df0 <- data.frame(
Position=c(2, 5, 8, 9, 15, 17, 20, 21, 24, 25),
State=as.character(c(0, 6, 6, 6, 1, 1, 0, 3, 3, 2))
)
x <- split(df0$Position, df0$State)
df1 <- data.frame(start=sapply(x, min), end=sapply(x, max), State=names(x))
Now 'df1' contains one row per state with the 'start' and 'end' positions
for this state:
> df1
start end State
0 2 20 0
1 15 17 1
2 25 25 2
3 21 24 3
6 5 9 6
Note that state 0 seems to be special in your data because the positions at
which it occurs are interlaced with the positions at which other states occur.
Cheers,
H.
Niels Høgslund wrote:
> Hi,
>
> I have a lot of data frames looking like this (SNP chromosome position
> and a local state ID):
>
> Position State
> 1 3088998 0
> 2 4215064 6
> 3 5034491 6
> 4 5211912 6
> 5 5697261 6
> 6 5809727 0
> 7 6818872 NA
> 8 6867391 0
> 9 7346904 1
> 10 7347824 1
> 11 7358232 1
> 12 7833686 1
> 13 8295795 0
> 14 10755448 0
> 15 10919778 NA
> 16 11217061 3
> 17 12463350 3
> 18 13678626 0
> 19 13892992 0
> 20 13965452 0
> 21 13969222 0
> ........
>
> Now, I want to collapse or summarize consecutive occurences of a state
> into a region with a start+end position,
> i.e. something like this:
>
> Position State
> 2 4215064 6
> 5 5697261 6
> 9 73469041 1
> 12 7833686 1
> 16 11217061 3
> 17 12463350 3
>
> Can anyone help me with this?
>
> Thanks in advance.....
>
>
>
> Niels Høgslund
> BiRC -Bioinformatics Research Center
> Høegh-Guldbergs Gade 10
> DK-8000 Århus C
> Denmark
> phone: +45 89423100
> mail: nj at birc.au.dk
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list