[R] 'Record' row values every time the binary value in a collumn changes
Dennis Murphy
djmuser at gmail.com
Wed Apr 20 19:39:27 CEST 2011
Hi:
Here are a couple more options using packages plyr and data.table. The
labels in the second part are changed because they didn't make sense
in a 2M line file (well, mine may not either, but it's a start). You
can always change them to something more pertinent.
# Question 1:
Table <- data.frame(binary, chromosome = Chromosome, start)
library(plyr)
(df <- ddply(Table, .(chromosome, binary), summarise, position_start =
min(start),
position_end = max(start)))
chromosome binary position_start position_end
1 1 0 20 36
2 1 1 12 18
3 2 0 17 19
4 2 1 12 16
library(data.table)
dTable <- data.table(Table, key = 'chromosome, binary')
(dt <- dTable[, list(position_start = min(start),
position_end = max(start)), by = 'chromosome, binary'])
chromosome binary position_start position_end
[1,] 1 0 20 36
[2,] 1 1 12 18
[3,] 2 0 17 19
[4,] 2 1 12 16
## Question 2:
For plyr, it's easy to write a function that takes a generic input data frame
(in this case, a single line) and then outputs a data frame with
positions and labels.
tfun <- function(df) {
diff <- with(df, position_end - position_start + 1)
position <- with(df, seq(position_start, position_end))
value <- paste(df$chromosome, df$binary, letters[1:diff], sep = '.')
data.frame(chromosome = df$chromosome, position, value, binary = df$binary)
}
# Then:
> ddply(df, .(chromosome, binary), tfun)
chromosome position value binary
1 1 20 1.0.a 0
2 1 21 1.0.b 0
3 1 22 1.0.c 0
4 1 23 1.0.d 0
5 1 24 1.0.e 0
6 1 25 1.0.f 0
7 1 26 1.0.g 0
8 1 27 1.0.h 0
9 1 28 1.0.i 0
10 1 29 1.0.j 0
11 1 30 1.0.k 0
12 1 31 1.0.l 0
13 1 32 1.0.m 0
14 1 33 1.0.n 0
15 1 34 1.0.o 0
16 1 35 1.0.p 0
17 1 36 1.0.q 0
18 1 12 1.1.a 1
19 1 13 1.1.b 1
20 1 14 1.1.c 1
21 1 15 1.1.d 1
22 1 16 1.1.e 1
23 1 17 1.1.f 1
24 1 18 1.1.g 1
25 2 17 2.0.a 0
26 2 18 2.0.b 0
27 2 19 2.0.c 0
28 2 12 2.1.a 1
29 2 13 2.1.b 1
30 2 14 2.1.c 1
31 2 15 2.1.d 1
32 2 16 2.1.e 1
# For data.table, one can apply the internals of tfun directly:
dt[, list(chromosome = chromosome, position = seq(position_start, position_end),
value = paste(chromosome, binary,
letters[1:(position_end - position_start + 1)],
sep = '.'),
binary = binary), by = 'chromosome, binary']
chromosome binary chromosome.1 position value binary.1
1 0 1 20 1.0.a 0
1 0 1 21 1.0.b 0
1 0 1 22 1.0.c 0
1 0 1 23 1.0.d 0
1 0 1 24 1.0.e 0
1 0 1 25 1.0.f 0
1 0 1 26 1.0.g 0
1 0 1 27 1.0.h 0
1 0 1 28 1.0.i 0
1 0 1 29 1.0.j 0
1 0 1 30 1.0.k 0
1 0 1 31 1.0.l 0
1 0 1 32 1.0.m 0
1 0 1 33 1.0.n 0
1 0 1 34 1.0.o 0
1 0 1 35 1.0.p 0
1 0 1 36 1.0.q 0
1 1 1 12 1.1.a 1
1 1 1 13 1.1.b 1
1 1 1 14 1.1.c 1
1 1 1 15 1.1.d 1
1 1 1 16 1.1.e 1
1 1 1 17 1.1.f 1
1 1 1 18 1.1.g 1
2 0 2 17 2.0.a 0
2 0 2 18 2.0.b 0
2 0 2 19 2.0.c 0
2 1 2 12 2.1.a 1
2 1 2 13 2.1.b 1
2 1 2 14 2.1.c 1
2 1 2 15 2.1.d 1
2 1 2 16 2.1.e 1
cn chromosome binary chromosome position value binary
HTH,
Dennis
On Wed, Apr 20, 2011 at 2:01 AM, baboon2010 <nielsvanderaa at live.be> wrote:
> My question is twofold.
>
> Part 1:
> My data looks like this:
>
> (example set, real data has 2*10^6 rows)
> binary<-c(1,1,1,0,0,0,1,1,1,0,0)
> Chromosome<-c(1,1,1,1,1,1,2,2,2,2,2)
> start<-c(12,17,18,20,25,36,12,15,16,17,19)
> Table<-cbind(Chromosome,start,binary)
> Chromosome start binary
> [1,] 1 12 1
> [2,] 1 17 1
> [3,] 1 18 1
> [4,] 1 20 0
> [5,] 1 25 0
> [6,] 1 36 0
> [7,] 2 12 1
> [8,] 2 15 1
> [9,] 2 16 1
> [10,] 2 17 0
> [11,] 2 19 0
>
> As output I need a shortlist for each binary block: giving me the starting
> and ending position of each block.
> Which for these example would look like this:
> Chromosome2 position_start position_end binary2
> [1,] 1 12 18 1
> [2,] 1 20 36 0
> [3,] 2 12 16 1
> [4,] 2 17 19 0
>
> Part 2:
> Based on the output of part 1, I need to assign the binary to rows of
> another data set. If the position value in this second data set falls in one
> of the blocks defined in the shortlist made in part1,the binary value of the
> shortlist should be assigned to an extra column for this row. This would
> look something like this:
> Chromosome3 position Value binary3
> [1,] "1" "12" "a" "1"
> [2,] "1" "13" "b" "1"
> [3,] "1" "14" "c" "1"
> [4,] "1" "15" "d" "1"
> [5,] "1" "16" "e" "1"
> [6,] "1" "18" "f" "1"
> [7,] "1" "20" "g" "0"
> [8,] "1" "21" "h" "0"
> [9,] "1" "22" "i" "0"
> [10,] "1" "23" "j" "0"
> [11,] "1" "25" "k" "0"
> [12,] "1" "35" "l" "0"
> [13,] "2" "12" "m" "1"
> [14,] "2" "13" "n" "1"
> [15,] "2" "14" "o" "1"
> [16,] "2" "15" "p" "1"
> [17,] "2" "16" "q" "1"
> [18,] "2" "17" "s" "0"
> [19,] "2" "18" "d" "0"
> [20,] "2" "19" "f" "0"
>
>
> Many thanks in advance,
>
> Niels
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Record-row-values-every-time-the-binary-value-in-a-collumn-changes-tp3462496p3462496.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list