[BioC] extracting regions of consecutive values from dataframe
Vincent Carey 525-2265
stvjc at channing.harvard.edu
Fri May 30 13:59:22 CEST 2008
On Fri, 30 May 2008, Sean Davis wrote:
> > On Fri, May 30, 2008 at 6:35 AM, Niels Høgslund <nj at birc.au.dk> wrote:
> > Hi,
> >
> > I have a lot of data frames looking like this (SNP chromosome position and a
> > local state ID):
> >
> > Position State
> > 1 3088998 0
> > 2 4215064 6
> > 3 5034491 6
> > 4 5211912 6
> > 5 5697261 6
> > 6 5809727 0
> > 7 6818872 NA
> > 8 6867391 0
> > 9 7346904 1
> > 10 7347824 1
> > 11 7358232 1
> > 12 7833686 1
> > 13 8295795 0
> > 14 10755448 0
> > 15 10919778 NA
> > 16 11217061 3
> > 17 12463350 3
> > 18 13678626 0
> > 19 13892992 0
> > 20 13965452 0
> > 21 13969222 0
> > ........
> >
> > Now, I want to collapse or summarize consecutive occurences of a state into
> > a region with a start+end position,
> > i.e. something like this:
> >
> > Position State
> > 2 4215064 6
> > 5 5697261 6
> > 9 73469041 1
> > 12 7833686 1
> > 16 11217061 3
> > 17 12463350 3
> >
> > Can anyone help me with this?
>
> The rle() function is one way to do this. You will need to write a
> little wrapper function to do exactly what you want, but rle() should
> get you going.
>
> Sean
>
Indeed. It seems that the combination of rle and split
will do the job -- but split reorders the data, so we
have a subroutine split.preserveord in s2reg below. The
following dputs the example and the code with the
print of the result. There must be a better way.
> dput(y)
structure(list(Position = c(3088998L, 4215064L, 5034491L, 5211912L,
5697261L, 5809727L, 6867391L, 7346904L, 7347824L, 7358232L, 7833686L,
8295795L, 10755448L, 11217061L, 12463350L, 13678626L, 13892992L,
13965452L, 13969222L), State = c(0L, 6L, 6L, 6L, 6L, 0L, 0L,
1L, 1L, 1L, 1L, 0L, 0L, 3L, 3L, 0L, 0L, 0L, 0L)), .Names = c("Position",
"State"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 9L, 10L, 11L,
12L, 13L, 14L, 15L, 17L, 18L, 19L, 20L, 21L, 22L), class = "data.frame", na.action = structure(c(7L,
8L, 16L), .Names = c("7", "8", "16"), class = "omit"))
> dput(s2reg)
function (df)
{
# assumes df[,1] is position, df[,2] is state
# and contiguous rows sharing value of state are to be grouped
# first a tweak of split()
split.preserveord = function(x, disc) {
tmpn <- unique(disc)
ansscr <- split(x, disc)
ord <- match(tmpn, names(ansscr))
ansscr[ord]
}
# now get rle of state
rr = rle(df[, 2])
# etc.
tags = make.unique(as.character(rr$value))
ftags = rep(tags, rr$len)
sdf = split.preserveord(df, ftags)
ra = sapply(sdf, function(x) range(x[, 1]))
rownames(ra) = c("start", "end")
cbind(t(ra), state = rr$val)
}
> s2reg(y)
start end state
0 3088998 3088998 0
6 4215064 5697261 6
0.1 5809727 6867391 0
1 7346904 7833686 1
0.2 8295795 10755448 0
3 11217061 12463350 3
0.3 13678626 13969222 0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
-------------- next part --------------
The information transmitted in this electronic communica...{{dropped:18}}
More information about the Bioconductor
mailing list