[R] 'Record' row values every time the binary value in a collumn changes

Dennis Murphy djmuser at gmail.com
Wed Apr 20 19:39:27 CEST 2011


Hi:

Here are a couple more options using packages plyr and data.table. The
labels in the second part are changed because they didn't make sense
in a 2M line file (well, mine may not either, but it's a start). You
can always change them to something more pertinent.

# Question 1:
Table <- data.frame(binary, chromosome = Chromosome, start)

library(plyr)
(df <- ddply(Table, .(chromosome, binary), summarise, position_start =
min(start),
         position_end = max(start)))
  chromosome binary position_start position_end
1          1      0             20           36
2          1      1             12           18
3          2      0             17           19
4          2      1             12           16

library(data.table)
dTable <- data.table(Table, key = 'chromosome, binary')
(dt <- dTable[, list(position_start = min(start),
               position_end = max(start)), by = 'chromosome, binary'])
     chromosome binary position_start position_end
[1,]          1      0             20           36
[2,]          1      1             12           18
[3,]          2      0             17           19
[4,]          2      1             12           16

## Question 2:

For plyr, it's easy to write a function that takes a generic input data frame
(in this case, a single line) and then outputs a data frame with
positions and labels.

tfun <- function(df) {
     diff <- with(df, position_end - position_start + 1)
     position <- with(df, seq(position_start, position_end))
     value <- paste(df$chromosome, df$binary, letters[1:diff], sep = '.')
     data.frame(chromosome = df$chromosome, position, value, binary = df$binary)
    }

# Then:

> ddply(df, .(chromosome, binary), tfun)
   chromosome position value binary
1           1       20 1.0.a      0
2           1       21 1.0.b      0
3           1       22 1.0.c      0
4           1       23 1.0.d      0
5           1       24 1.0.e      0
6           1       25 1.0.f      0
7           1       26 1.0.g      0
8           1       27 1.0.h      0
9           1       28 1.0.i      0
10          1       29 1.0.j      0
11          1       30 1.0.k      0
12          1       31 1.0.l      0
13          1       32 1.0.m      0
14          1       33 1.0.n      0
15          1       34 1.0.o      0
16          1       35 1.0.p      0
17          1       36 1.0.q      0
18          1       12 1.1.a      1
19          1       13 1.1.b      1
20          1       14 1.1.c      1
21          1       15 1.1.d      1
22          1       16 1.1.e      1
23          1       17 1.1.f      1
24          1       18 1.1.g      1
25          2       17 2.0.a      0
26          2       18 2.0.b      0
27          2       19 2.0.c      0
28          2       12 2.1.a      1
29          2       13 2.1.b      1
30          2       14 2.1.c      1
31          2       15 2.1.d      1
32          2       16 2.1.e      1

# For data.table, one can apply the internals of tfun directly:

dt[, list(chromosome = chromosome, position = seq(position_start, position_end),
            value = paste(chromosome, binary,
                      letters[1:(position_end - position_start + 1)],
sep = '.'),
            binary = binary), by = 'chromosome, binary']
   chromosome binary chromosome.1 position value binary.1
            1      0            1       20 1.0.a        0
            1      0            1       21 1.0.b        0
            1      0            1       22 1.0.c        0
            1      0            1       23 1.0.d        0
            1      0            1       24 1.0.e        0
            1      0            1       25 1.0.f        0
            1      0            1       26 1.0.g        0
            1      0            1       27 1.0.h        0
            1      0            1       28 1.0.i        0
            1      0            1       29 1.0.j        0
            1      0            1       30 1.0.k        0
            1      0            1       31 1.0.l        0
            1      0            1       32 1.0.m        0
            1      0            1       33 1.0.n        0
            1      0            1       34 1.0.o        0
            1      0            1       35 1.0.p        0
            1      0            1       36 1.0.q        0
            1      1            1       12 1.1.a        1
            1      1            1       13 1.1.b        1
            1      1            1       14 1.1.c        1
            1      1            1       15 1.1.d        1
            1      1            1       16 1.1.e        1
            1      1            1       17 1.1.f        1
            1      1            1       18 1.1.g        1
            2      0            2       17 2.0.a        0
            2      0            2       18 2.0.b        0
            2      0            2       19 2.0.c        0
            2      1            2       12 2.1.a        1
            2      1            2       13 2.1.b        1
            2      1            2       14 2.1.c        1
            2      1            2       15 2.1.d        1
            2      1            2       16 2.1.e        1
cn chromosome binary   chromosome position value   binary

HTH,
Dennis

On Wed, Apr 20, 2011 at 2:01 AM, baboon2010 <nielsvanderaa at live.be> wrote:
> My question is twofold.
>
> Part 1:
> My data looks like this:
>
> (example set, real data has 2*10^6 rows)
> binary<-c(1,1,1,0,0,0,1,1,1,0,0)
> Chromosome<-c(1,1,1,1,1,1,2,2,2,2,2)
> start<-c(12,17,18,20,25,36,12,15,16,17,19)
> Table<-cbind(Chromosome,start,binary)
>      Chromosome start binary
>  [1,]          1    12      1
>  [2,]          1    17      1
>  [3,]          1    18      1
>  [4,]          1    20      0
>  [5,]          1    25      0
>  [6,]          1    36      0
>  [7,]          2    12      1
>  [8,]          2    15      1
>  [9,]          2    16      1
> [10,]          2    17      0
> [11,]          2    19      0
>
> As output I need a shortlist for each binary block: giving me the starting
> and ending position of each block.
> Which for these example would look like this:
>     Chromosome2 position_start position_end binary2
> [1,]           1             12           18       1
> [2,]           1             20           36       0
> [3,]           2             12           16       1
> [4,]           2             17           19       0
>
> Part 2:
> Based on the output of part 1, I need to assign the binary to rows of
> another data set. If the position value in this second data set falls in one
> of the blocks defined in the shortlist made in part1,the binary value of the
> shortlist should be assigned to an extra column for this row.  This would
> look something like this:
>     Chromosome3 position Value binary3
>  [1,] "1"         "12"     "a"   "1"
>  [2,] "1"         "13"     "b"   "1"
>  [3,] "1"         "14"     "c"   "1"
>  [4,] "1"         "15"     "d"   "1"
>  [5,] "1"         "16"     "e"   "1"
>  [6,] "1"         "18"     "f"   "1"
>  [7,] "1"         "20"     "g"   "0"
>  [8,] "1"         "21"     "h"   "0"
>  [9,] "1"         "22"     "i"   "0"
> [10,] "1"         "23"     "j"   "0"
> [11,] "1"         "25"     "k"   "0"
> [12,] "1"         "35"     "l"   "0"
> [13,] "2"         "12"     "m"   "1"
> [14,] "2"         "13"     "n"   "1"
> [15,] "2"         "14"     "o"   "1"
> [16,] "2"         "15"     "p"   "1"
> [17,] "2"         "16"     "q"   "1"
> [18,] "2"         "17"     "s"   "0"
> [19,] "2"         "18"     "d"   "0"
> [20,] "2"         "19"     "f"   "0"
>
>
> Many thanks in advance,
>
> Niels
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Record-row-values-every-time-the-binary-value-in-a-collumn-changes-tp3462496p3462496.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list