[R] Conditional editing of rows in a data frame

Thu Jan 28 14:26:57 CET 2010

On Jan 28, 2010, at 7:05 AM, Irene Gallego Romero wrote:

> Dear R users,
>
> I have a dataframe (main.table) with ~30,000 rows and 6 columns, of
> which here are a few rows:
>
>      id chr window         gene     xp.norm    xp.top
> 129 1_32   1     32       TAS1R1  1.28882115     FALSE
> 130 1_32   1     32       ZBTB48  1.28882115     FALSE
> 131 1_32   1     32       KLHL21  1.28882115     FALSE
> 132 1_32   1     32        PHF13  1.28882115     FALSE
> 133 1_33   1     33        PHF13  1.02727430     FALSE
> 134 1_33   1     33        THAP3  1.02727430     FALSE
> 135 1_33   1     33      DNAJC11  1.02727430     FALSE
> 136 1_33   1     33       CAMTA1  1.02727430     FALSE
> 137 1_34   1     34       CAMTA1  1.40312732      TRUE
> 138 1_35   1     35       CAMTA1  1.52104538     FALSE
> 139 1_36   1     36       CAMTA1  1.04853732     FALSE
> 140 1_37   1     37       CAMTA1  0.64794094     FALSE
> 141 1_38   1     38       CAMTA1  1.23026086      TRUE
> 142 1_38   1     38        VAMP3  1.23026086      TRUE
> 143 1_38   1     38         PER3  1.23026086      TRUE
> 144 1_39   1     39         PER3  1.18154967      TRUE
> 145 1_39   1     39         UTS2  1.18154967      TRUE
> 146 1_39   1     39      TNFRSF9  1.18154967      TRUE
> 147 1_39   1     39        PARK7  1.18154967      TRUE
> 148 1_39   1     39       ERRFI1  1.18154967      TRUE
> 149 1_40   1     40      no_gene  1.79796879     FALSE
> 150 1_41   1     41      SLC45A1  0.20193560     FALSE
>
> I want to create two new columns, xp.bg and xp.n.top, using the
> following criteria:
>
> If gene is the same in consecutive rows, xp.bg is the minimum value of
> xp.norm in those rows; if gene is not the same, xp.bg is simply the
> value of xp.norm for that row;

Assuming that gene values are adjacent in a dataframe named df1, then  
this would work:

df1$xp.bg<- with(df1, ave(xp.norm, gene, FUN=min))

>
> Likewise, if there's a run of contiguous xp.top = TRUE values,
> xp.n.top is the minimum value in that range, and if xp.top is false or
> NA, xp.n.top is NA, or 0 (I don't care).

df1$seqgrp <- c(0, diff(df1$xp.top))
df1$seqgrp2 <- cumsum(df1$seqgrp != 0)
df1$xp.n.top <- with(df1, ave(xp.norm, seqgrp2, FUN=min))
is.na(df1$xp.n.top) <- !xp.top

 > df1$xp.bg<- with(df1, ave(xp.norm, gene, FUN=min))
 > df1
       id chr window    gene   xp.norm xp.top seqgrp seqgrp2  
xp.n.top     xp.bg
129 1_32   1     32  TAS1R1 1.2888211  FALSE      0       0       NA  
1.2888211
130 1_32   1     32  ZBTB48 1.2888211  FALSE      0       0       NA  
1.2888211
131 1_32   1     32  KLHL21 1.2888211  FALSE      0       0       NA  
1.2888211
132 1_32   1     32   PHF13 1.2888211  FALSE      0       0       NA  
1.0272743
133 1_33   1     33   PHF13 1.0272743  FALSE      0       0       NA  
1.0272743
134 1_33   1     33   THAP3 1.0272743  FALSE      0       0       NA  
1.0272743
135 1_33   1     33 DNAJC11 1.0272743  FALSE      0       0       NA  
1.0272743
136 1_33   1     33  CAMTA1 1.0272743  FALSE      0       0       NA  
0.6479409
137 1_34   1     34  CAMTA1 1.4031273   TRUE      1       1 1.403127  
0.6479409
138 1_35   1     35  CAMTA1 1.5210454  FALSE     -1       2       NA  
0.6479409
139 1_36   1     36  CAMTA1 1.0485373  FALSE      0       2       NA  
0.6479409
140 1_37   1     37  CAMTA1 0.6479409  FALSE      0       2       NA  
0.6479409
141 1_38   1     38  CAMTA1 1.2302609   TRUE      1       3 1.181550  
0.6479409
142 1_38   1     38   VAMP3 1.2302609   TRUE      0       3 1.181550  
1.2302609
143 1_38   1     38    PER3 1.2302609   TRUE      0       3 1.181550  
1.1815497
144 1_39   1     39    PER3 1.1815497   TRUE      0       3 1.181550  
1.1815497
145 1_39   1     39    UTS2 1.1815497   TRUE      0       3 1.181550  
1.1815497
146 1_39   1     39 TNFRSF9 1.1815497   TRUE      0       3 1.181550  
1.1815497
147 1_39   1     39   PARK7 1.1815497   TRUE      0       3 1.181550  
1.1815497
148 1_39   1     39  ERRFI1 1.1815497   TRUE      0       3 1.181550  
1.1815497
149 1_40   1     40 no_gene 1.7979688  FALSE     -1       4       NA  
1.7979688
150 1_41   1     41 SLC45A1 0.2019356  FALSE      0       4       NA  
0.2019356

And if the adjacent-gene assumption of the first request above were  
not met, then the first portion of this method could be used instead  
to great group indices.

-- 
David.

>
> So, in the above example,
> xp.bg for rows 136:141 should be 0.64794094, and is equal to xp.norm
> for all other rows,
> xp.n.top for row 137 is 1.40312732, 1.18154967 for rows 141:148, and
> 0/NA for all other rows.
>
> Is there a way to combine indexing and if statements or some such to
> accomplish this? I want to it this without using split(main.table,
> main.table$gene), because there's about 20,000 unique entries for
> gene, and one of the entries, no_gene, is repeated throughout. I
> thought briefly of subsetting the rows where xp.top is TRUE, but I
> then don't know how to set the range for min, so that it only looks at
> what would originally have been consecutive rows, and searching the
> help has not proved particularly useful.
>
> Thanks in advance,
> Irene Gallego Romero

David Winsemius, MD
Heritage Laboratories
West Hartford, CT