[R] Calculating subsets "on the fly" with ddply

Wed Feb 3 23:13:15 CET 2010

Hi,

[I sent this to the plyr mailing list (late) last night, but it seems
to be lost in the moderation queue, so here's a shot to the broadeR
community]

Apologies in advance for being more verbose than necessary, but I'm
not even sure how to ask this question in the context of plyr, so ...
here goes.

As meaningless as this might be to do with the `iris` data, the spirit
of it is what I'm trying to do with some other data. I get close to
what I think I need, but it appears I'm running into some internal
evaluation/`substitute`/`parse`/scoping issues ... or my lack of
understanding of how this really should be done.

Essentially, I'd like to compute some summary statistics on grouped
subsets of data. So, for iris data, let me try to take the mean of the
Petal.Width on subsets of data as grouped by:

("some range" of sepal.length, and species).

The "normal" ddply invocation would look like so:

R> my <- ddply(iris, .(w=Sepal.Length < 5.5, Species), transform,
grmean=mean(Petal.Width))
R> head(my)
     w Sepal.Length Sepal.Width Petal.Length Petal.Width    Species   grmean
1 FALSE          5.8         4.0          1.2         0.2     setosa  0.260000
2 FALSE          5.7         4.4          1.5         0.4     setosa  0.260000
3 FALSE          5.7         3.8          1.7         0.3     setosa  0.260000
4 FALSE          5.5         4.2          1.4         0.2     setosa  0.260000
5 FALSE          5.5         3.5          1.3         0.2     setosa  0.260000
6 FALSE          7.0         3.2          4.7         1.4 versicolor  1.347727

Although this appears to work, I'm not sure if the .(w= ...) is correct.

Is that how it should be done? Meaning, should we put "live"
expressions in the .variables parameter of ddply?

Moving on ... I actually want 5.5 to be passed in "on the fly". For
instance, this works:

R> val <- 5.5
R> my.2 <- ddply(iris, .(w=Sepal.Length < val, Species), transform,
grmean=mean(Petal.Width))
R> identical(my.2, my)
[1] TRUE

But what I really want is this to be part of some function that lets
me pick any value for `val` ... this doesn't work:

my.function <- function(df, my.val) {
 ddply(df, .(w=Sepal.Length < my.val, Species), transform,
grmean=mean(Petal.Width))
}

R> my.function(iris, 5.5)
Error in eval(expr, envir, enclos) : object 'my.val' not found

I can work around this by editing `df` in my function to add a w
column first:

my.function2 <- function(df, my.val) {
 df$w <- df$Sepal.Length < my.val
 ddply(df, .(w, Species), transform, grmean=mean(Petal.Width))
}

R> my2 <- my.function2(iris, 5.5)
R> head(my2)
 Sepal.Length Sepal.Width Petal.Length Petal.Width    Species     w   grmean
1          5.8         4.0          1.2         0.2     setosa FALSE  0.260000
2          5.7         4.4          1.5         0.4     setosa FALSE  0.260000
3          5.7         3.8          1.7         0.3     setosa FALSE  0.260000
4          5.5         4.2          1.4         0.2     setosa FALSE  0.260000
5          5.5         3.5          1.3         0.2     setosa FALSE  0.260000
6          7.0         3.2          4.7         1.4 versicolor FALSE  1.347727

Is that the "right" way to do it? Should I transform my data.frame
first before calling ddply?

If you've come this far, thanks for bearing with me,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact