[R] Conditional editing of rows in a data frame
David Winsemius
dwinsemius at comcast.net
Thu Jan 28 14:26:57 CET 2010
On Jan 28, 2010, at 7:05 AM, Irene Gallego Romero wrote:
> Dear R users,
>
> I have a dataframe (main.table) with ~30,000 rows and 6 columns, of
> which here are a few rows:
>
> id chr window gene xp.norm xp.top
> 129 1_32 1 32 TAS1R1 1.28882115 FALSE
> 130 1_32 1 32 ZBTB48 1.28882115 FALSE
> 131 1_32 1 32 KLHL21 1.28882115 FALSE
> 132 1_32 1 32 PHF13 1.28882115 FALSE
> 133 1_33 1 33 PHF13 1.02727430 FALSE
> 134 1_33 1 33 THAP3 1.02727430 FALSE
> 135 1_33 1 33 DNAJC11 1.02727430 FALSE
> 136 1_33 1 33 CAMTA1 1.02727430 FALSE
> 137 1_34 1 34 CAMTA1 1.40312732 TRUE
> 138 1_35 1 35 CAMTA1 1.52104538 FALSE
> 139 1_36 1 36 CAMTA1 1.04853732 FALSE
> 140 1_37 1 37 CAMTA1 0.64794094 FALSE
> 141 1_38 1 38 CAMTA1 1.23026086 TRUE
> 142 1_38 1 38 VAMP3 1.23026086 TRUE
> 143 1_38 1 38 PER3 1.23026086 TRUE
> 144 1_39 1 39 PER3 1.18154967 TRUE
> 145 1_39 1 39 UTS2 1.18154967 TRUE
> 146 1_39 1 39 TNFRSF9 1.18154967 TRUE
> 147 1_39 1 39 PARK7 1.18154967 TRUE
> 148 1_39 1 39 ERRFI1 1.18154967 TRUE
> 149 1_40 1 40 no_gene 1.79796879 FALSE
> 150 1_41 1 41 SLC45A1 0.20193560 FALSE
>
> I want to create two new columns, xp.bg and xp.n.top, using the
> following criteria:
>
> If gene is the same in consecutive rows, xp.bg is the minimum value of
> xp.norm in those rows; if gene is not the same, xp.bg is simply the
> value of xp.norm for that row;
Assuming that gene values are adjacent in a dataframe named df1, then
this would work:
df1$xp.bg<- with(df1, ave(xp.norm, gene, FUN=min))
>
> Likewise, if there's a run of contiguous xp.top = TRUE values,
> xp.n.top is the minimum value in that range, and if xp.top is false or
> NA, xp.n.top is NA, or 0 (I don't care).
df1$seqgrp <- c(0, diff(df1$xp.top))
df1$seqgrp2 <- cumsum(df1$seqgrp != 0)
df1$xp.n.top <- with(df1, ave(xp.norm, seqgrp2, FUN=min))
is.na(df1$xp.n.top) <- !xp.top
> df1$xp.bg<- with(df1, ave(xp.norm, gene, FUN=min))
> df1
id chr window gene xp.norm xp.top seqgrp seqgrp2
xp.n.top xp.bg
129 1_32 1 32 TAS1R1 1.2888211 FALSE 0 0 NA
1.2888211
130 1_32 1 32 ZBTB48 1.2888211 FALSE 0 0 NA
1.2888211
131 1_32 1 32 KLHL21 1.2888211 FALSE 0 0 NA
1.2888211
132 1_32 1 32 PHF13 1.2888211 FALSE 0 0 NA
1.0272743
133 1_33 1 33 PHF13 1.0272743 FALSE 0 0 NA
1.0272743
134 1_33 1 33 THAP3 1.0272743 FALSE 0 0 NA
1.0272743
135 1_33 1 33 DNAJC11 1.0272743 FALSE 0 0 NA
1.0272743
136 1_33 1 33 CAMTA1 1.0272743 FALSE 0 0 NA
0.6479409
137 1_34 1 34 CAMTA1 1.4031273 TRUE 1 1 1.403127
0.6479409
138 1_35 1 35 CAMTA1 1.5210454 FALSE -1 2 NA
0.6479409
139 1_36 1 36 CAMTA1 1.0485373 FALSE 0 2 NA
0.6479409
140 1_37 1 37 CAMTA1 0.6479409 FALSE 0 2 NA
0.6479409
141 1_38 1 38 CAMTA1 1.2302609 TRUE 1 3 1.181550
0.6479409
142 1_38 1 38 VAMP3 1.2302609 TRUE 0 3 1.181550
1.2302609
143 1_38 1 38 PER3 1.2302609 TRUE 0 3 1.181550
1.1815497
144 1_39 1 39 PER3 1.1815497 TRUE 0 3 1.181550
1.1815497
145 1_39 1 39 UTS2 1.1815497 TRUE 0 3 1.181550
1.1815497
146 1_39 1 39 TNFRSF9 1.1815497 TRUE 0 3 1.181550
1.1815497
147 1_39 1 39 PARK7 1.1815497 TRUE 0 3 1.181550
1.1815497
148 1_39 1 39 ERRFI1 1.1815497 TRUE 0 3 1.181550
1.1815497
149 1_40 1 40 no_gene 1.7979688 FALSE -1 4 NA
1.7979688
150 1_41 1 41 SLC45A1 0.2019356 FALSE 0 4 NA
0.2019356
And if the adjacent-gene assumption of the first request above were
not met, then the first portion of this method could be used instead
to great group indices.
--
David.
>
> So, in the above example,
> xp.bg for rows 136:141 should be 0.64794094, and is equal to xp.norm
> for all other rows,
> xp.n.top for row 137 is 1.40312732, 1.18154967 for rows 141:148, and
> 0/NA for all other rows.
>
> Is there a way to combine indexing and if statements or some such to
> accomplish this? I want to it this without using split(main.table,
> main.table$gene), because there's about 20,000 unique entries for
> gene, and one of the entries, no_gene, is repeated throughout. I
> thought briefly of subsetting the rows where xp.top is TRUE, but I
> then don't know how to set the range for min, so that it only looks at
> what would originally have been consecutive rows, and searching the
> help has not proved particularly useful.
>
> Thanks in advance,
> Irene Gallego Romero
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
More information about the R-help
mailing list