[R] Iteratively subsetting data by factor level across multiple variables

William Dunlap wdunlap at tibco.com
Thu Jan 15 22:46:20 CET 2015


There are lots of ways to do this.  You have to decide on how you want to
organize the results.
Here are two ways that use only core R packages. Many people like the plyr
package for this
split-data/analyze-parts/combine-results sort of thing.

> df <- data.frame(x=1:27,response=log2(1:27),
           g1=rep(letters[1:2],len=27),g2=rep(LETTERS[24:26],c(10,10,7)))
> s <- split(seq_len(nrow(df)), df[c("g1","g2")])
> mean(subset(df, df$g1=="a" & df$g2=="Z")$response)
[1] 4.578656
> vapply(s, function(si)mean(df$response[si]), FUN.VALUE=0) # a.Z part is
previous result
     a.X      b.X      a.Y      b.Y      a.Z      b.Z
1.976834 2.381378 3.880430 3.976834 4.578656 4.581611
> coef(lm(response~x, data=subset(df, df$g1=="a" & df$g2=="Z"))) #
regression example
(Intercept)           x
 3.12905040  0.06040022
> vapply(s, function(si)coef(lm(response ~ x, data=df[si,])),
FUN.VALUE=rep(0,2))
                  a.X       b.X        a.Y        b.Y        a.Z        b.Z
(Intercept) 0.0862735 0.6882213 2.40741927 2.50763309 3.12905040 3.13556268
x           0.3781121 0.2821928 0.09820075 0.09182506 0.06040022 0.06025202


For the particular case of computing means of a partition of the data you
can use lm() once,
which gives the same numbers organized in a different way:
> coef(lm(response ~ x * (g1:g2) - x - 1, data=df))
   g1a:g2X    g1b:g2X    g1a:g2Y    g1b:g2Y    g1a:g2Z    g1b:g2Z
0.08627350 0.68822126 2.40741927 2.50763309 3.12905040 3.13556268
 x:g1a:g2X  x:g1b:g2X  x:g1a:g2Y  x:g1b:g2Y  x:g1a:g2Z  x:g1b:g2Z
0.37811212 0.28219281 0.09820075 0.09182506 0.06040022 0.06025202



Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Thu, Jan 15, 2015 at 11:42 AM, Reid Bryant <reidbryant at gmail.com> wrote:

> Hi R experts!
>
> I would like to have a scripted solution that will iteratively subset data
> across many variables per factor level of each variable.
>
> To illustrate, if I create a dataframe (df) by:
>
> variation <- c("A","B","C","D")
> element1 <- as.factor(c(0,1,0,1))
> element2 <- as.factor(c(0,0,1,1))
> response <- c(4,2,6,2)
> df <- data.frame(variation,element1,element2,response)
>
> I would like a function that would allow me to subset the data into four
> groups and perform analysis across the groups.  One group for each of the
> two factor levels across two variables.  In this example its fairly easy
> because I only have two variables with two levels each, but would I would
> like this to be extendable across situations where I am dealing with more
> than 2 variables and/or more than two factor levels per variable.  I am
> looking for a result that will mimic the output of the following:
>
> element1_level0 <- subset(df,df$element1=="0")
> element1_level1 <- subset(df,df$element1=="1")
> element2_level0 <- subset(df,df$element2=="0")
> element2_level1 <- subset(df,df$element2=="1")
>
> The purpose would be to perform analysis on the df across each subset.
> Simplistically this could be represented as follows:
>
> mean(element1_level0$response)
> mean(element1_level1$response)
> mean(element2_level0$response)
> mean(element2_level1$response)
>
> Thanks,
> Reid
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list