[BioC] extracting regions of consecutive values from dataframe

Fri May 30 13:59:22 CEST 2008

On Fri, 30 May 2008, Sean Davis wrote:

> > On Fri, May 30, 2008 at 6:35 AM, Niels HÃ¸gslund <nj at birc.au.dk> wrote:
> > Hi,
> >
> > I have a lot of data frames looking like this (SNP chromosome position and a
> > local state ID):
> >
> >        Position        State
> > 1       3088998 0
> > 2       4215064 6
> > 3       5034491 6
> > 4       5211912 6
> > 5       5697261 6
> > 6       5809727 0
> > 7       6818872 NA
> > 8       6867391 0
> > 9       7346904 1
> > 10      7347824 1
> > 11      7358232 1
> > 12      7833686 1
> > 13      8295795 0
> > 14      10755448        0
> > 15      10919778        NA
> > 16      11217061        3
> > 17      12463350        3
> > 18      13678626        0
> > 19      13892992        0
> > 20      13965452        0
> > 21      13969222        0
> > ........
> >
> > Now, I want to collapse or summarize consecutive occurences of a state into
> > a region with a start+end position,
> > i.e. something like this:
> >
> >        Position        State
> > 2       4215064 6
> > 5       5697261 6
> > 9       73469041        1
> > 12      7833686 1
> > 16      11217061        3
> > 17      12463350        3
> >
> > Can anyone help me with this?
>
> The rle() function is one way to do this.  You will need to write a
> little wrapper function to do exactly what you want, but rle() should
> get you going.
>
> Sean
>

Indeed.  It seems that the combination of rle and split
will do the job -- but split reorders the data, so we
have a subroutine split.preserveord in s2reg below.  The
following dputs the example and the code with the
print of the result.  There must be a better way.

> dput(y)
structure(list(Position = c(3088998L, 4215064L, 5034491L, 5211912L,
5697261L, 5809727L, 6867391L, 7346904L, 7347824L, 7358232L, 7833686L,
8295795L, 10755448L, 11217061L, 12463350L, 13678626L, 13892992L,
13965452L, 13969222L), State = c(0L, 6L, 6L, 6L, 6L, 0L, 0L,
1L, 1L, 1L, 1L, 0L, 0L, 3L, 3L, 0L, 0L, 0L, 0L)), .Names = c("Position",
"State"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 9L, 10L, 11L,
12L, 13L, 14L, 15L, 17L, 18L, 19L, 20L, 21L, 22L), class = "data.frame", na.action = structure(c(7L,
8L, 16L), .Names = c("7", "8", "16"), class = "omit"))
> dput(s2reg)
function (df)
{
# assumes df[,1] is position, df[,2] is state
# and contiguous rows sharing value of state are to be grouped
# first a tweak of split()
    split.preserveord = function(x, disc) {
        tmpn <- unique(disc)
        ansscr <- split(x, disc)
        ord <- match(tmpn, names(ansscr))
        ansscr[ord]
    }
# now get rle of state
    rr = rle(df[, 2])
# etc.
    tags = make.unique(as.character(rr$value))
    ftags = rep(tags, rr$len)
    sdf = split.preserveord(df, ftags)
    ra = sapply(sdf, function(x) range(x[, 1]))
    rownames(ra) = c("start", "end")
    cbind(t(ra), state = rr$val)
}
> s2reg(y)
       start      end state
0    3088998  3088998     0
6    4215064  5697261     6
0.1  5809727  6867391     0
1    7346904  7833686     1
0.2  8295795 10755448     0
3   11217061 12463350     3
0.3 13678626 13969222     0

>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
-------------- next part --------------

The information transmitted in this electronic communica...{{dropped:18}}