[R] Conditional editing of rows in a data frame

Thu Jan 28 14:34:56 CET 2010

If DF is your data frame then:

DF$xp.bg <- ave(DF$xp.norm, DF$gene, FUN = min)

will create a new column such that the entry in each row has the
minimum xp.norm of all rows with the same gene.  ave does  use split
internally but I think it would be worth trying anyways since its only
one short line of code.

See help(ave)

On Thu, Jan 28, 2010 at 7:05 AM, Irene Gallego Romero <ig247 at cam.ac.uk> wrote:
> Dear R users,
>
> I have a dataframe (main.table) with ~30,000 rows and 6 columns, of
> which here are a few rows:
>
>      id chr window         gene     xp.norm    xp.top
> 129 1_32   1     32       TAS1R1  1.28882115     FALSE
> 130 1_32   1     32       ZBTB48  1.28882115     FALSE
> 131 1_32   1     32       KLHL21  1.28882115     FALSE
> 132 1_32   1     32        PHF13  1.28882115     FALSE
> 133 1_33   1     33        PHF13  1.02727430     FALSE
> 134 1_33   1     33        THAP3  1.02727430     FALSE
> 135 1_33   1     33      DNAJC11  1.02727430     FALSE
> 136 1_33   1     33       CAMTA1  1.02727430     FALSE
> 137 1_34   1     34       CAMTA1  1.40312732      TRUE
> 138 1_35   1     35       CAMTA1  1.52104538     FALSE
> 139 1_36   1     36       CAMTA1  1.04853732     FALSE
> 140 1_37   1     37       CAMTA1  0.64794094     FALSE
> 141 1_38   1     38       CAMTA1  1.23026086      TRUE
> 142 1_38   1     38        VAMP3  1.23026086      TRUE
> 143 1_38   1     38         PER3  1.23026086      TRUE
> 144 1_39   1     39         PER3  1.18154967      TRUE
> 145 1_39   1     39         UTS2  1.18154967      TRUE
> 146 1_39   1     39      TNFRSF9  1.18154967      TRUE
> 147 1_39   1     39        PARK7  1.18154967      TRUE
> 148 1_39   1     39       ERRFI1  1.18154967      TRUE
> 149 1_40   1     40      no_gene  1.79796879     FALSE
> 150 1_41   1     41      SLC45A1  0.20193560     FALSE
>
> I want to create two new columns, xp.bg and xp.n.top, using the
> following criteria:
>
> If gene is the same in consecutive rows, xp.bg is the minimum value of
> xp.norm in those rows; if gene is not the same, xp.bg is simply the
> value of xp.norm for that row;
>
> Likewise, if there's a run of contiguous xp.top = TRUE values,
> xp.n.top is the minimum value in that range, and if xp.top is false or
> NA, xp.n.top is NA, or 0 (I don't care).
>
> So, in the above example,
> xp.bg for rows 136:141 should be 0.64794094, and is equal to xp.norm
> for all other rows,
> xp.n.top for row 137 is 1.40312732, 1.18154967 for rows 141:148, and
> 0/NA for all other rows.
>
> Is there a way to combine indexing and if statements or some such to
> accomplish this? I want to it this without using split(main.table,
> main.table$gene), because there's about 20,000 unique entries for
> gene, and one of the entries, no_gene, is repeated throughout. I
> thought briefly of subsetting the rows where xp.top is TRUE, but I
> then don't know how to set the range for min, so that it only looks at
> what would originally have been consecutive rows, and searching the
> help has not proved particularly useful.
>
> Thanks in advance,
> Irene Gallego Romero
>
>
> --
> Irene Gallego Romero
> Leverhulme Centre for Human Evolutionary Studies
> University of Cambridge
> Fitzwilliam St
> Cambridge
> CB1 3QH
> UK
> email: ig247 at cam.ac.uk
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>