[R] aggregate, by, *apply

Thu Sep 16 01:20:45 CEST 2010

On Sep 15, 2010, at 5:45 PM, Mark Ebbert wrote:

> Dear R gurus,
>
> I regularly come across a situation where I would like to apply a  
> function to a subset of data in a dataframe, but I have not found an  
> R function to facilitate exactly what I need. More specifically, I'd  
> like my function to have a context of where the data it's analyzing  
> came from. Here is an example:
>

> ### BEGIN ###
> func<-function(x){
> 	m<-median(x$x)

> 	if(m > 2 & m < x$y){
> 		return(T)
> 	}
> 	return(F)
> }
>

The semantic question is what are you trying to test when you say "m <  
x$y" ? "m" is a scalar and x is a vector. By default only the first  
element of x$y  will be compared (not actually callable in that manner.)

> tmp<- 
> data.frame(x=1:10,y=c(rep(34,3),rep(35,3),rep(34,4)),z=c(rep("a", 
> 3),rep("b",3),rep("c",4)))
> res<-aggregate(tmp,list(z),func)

I see Dennis has tried to move you forward to the plyr strategy, but  
some of us are mired in the traditonal ways:

?split  # returns a dataframe in segments defined by a factor

 > func<-function(x){
+ 	m<-median(x["x"], na.rm=TRUE)
+ 	if(m > 2 && m < x["y"]){
+ 		return(T)
+ 	}
+ 	return(F)
+ }
 >
 > tmp<- 
data.frame(x=1:10,y=c(rep(34,3),rep(35,3),rep(34,4)),z=c(rep("a", 
3),rep("b",3),rep("c",4)))
 > res<-lapply(split(tmp,list(tmp$z)), func)
 > res
$a
[1] FALSE

$b
[1] TRUE

$c
[1] TRUE
> ### END ###
>
> The values in the example are trivial, but the problem is that only  
> one column is passed to my function at a time, so I can't determine  
> how 'm' relates to 'x$y'. Any tips/guidance is appreciated.
-- 

David Winsemius, MD
West Hartford, CT