[R] mean for subset

Matthew Dowle mdowle at mdowle.plus.com
Wed Jan 6 12:09:05 CET 2010


As can data.table (i.e. do 'having' in one statement) :

> DT = data.table(DF)
> DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]
      NAME n       V2
[1,] James 3 64.00000
[2,]   Tom 3 78.66667
>

but data.table isn't restricted to SQL functions (such as avg),  any R 
functions can be used, sometimes for their side effects (such as plotting) 
rather than just returning data.

Further data.table has a thing called 'join inherited scoping'.   Say we 
knew the specific groups,  we can go directly to them (without even looking 
at the rest of the data in the table) in very short and convenient syntax, 
which also happens to run quickly on large data sets (but can be useful just 
for the syntax alone) :

> setkey(DT,NAME)
> DT[c("James","Tom"),mean(SCORE),mult="all"]
      NAME       V1
[1,] James 64.00000
[2,]   Tom 78.66667
>

Notice there is no "group by" or even a "by" in the above.  It inherits the 
scope from the join because mult="all" means that "James" matches to 
multiple rows, as does "Tom", creating two groups.  It does it by binary 
search to the beginning of each group,  binary search to the end of the 
group,  and runs the R expression inside the scope of that group.

An example of join inherited scoping for the side effects only :

> pdf("out.pdf")
> DT[c("James","Tom"),plot(SCORE),mult="all"]
NULL data table
> dev.off()
# out.pdf now contains 2 plots

which you couldn't do in SQL because SQL has no plotting (or any of R's 
other packages).

It aims to do this quickly.  Where 'quickly' means 1) shorter code is 
quicker to write, read, debug and maintain  and also  2) quicker to compute, 
and its 1 that often dominates 2.

Finally, consider the following two statements which are both equivalent :

> sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) = 3")
   NAME avg(SCORE)
1 James   64.00000
2   Tom   78.66667
> DT[ J(DT[,length(NAME),by="NAME"][V1==3,NAME]), mean(SCORE), mult="all"]
   NAME avg(SCORE)
1 James   64.00000
2   Tom   78.66667

Now ok I hear you groaning (!) that the 2nd looks (on first glance) ugly, 
but bear with me ... in the SQL solution do you know for sure that 
avg(SCORE) isn't computed wastefully for the all the groups that don't have 
count(*)=3 ?  It might well do the 'group by' first for all the groups, then 
do the 'having' afterwards as a 'where' on the result.  It might depend on 
the particular SQL database being used (mySQL, sqllite, etc) or the 
installation parameters, any indexes etc.  Some investigation would be 
required (taking time) if someone doesn't already know.  In the data.table 
however,  the syntax explictly makes it clear than mean(SCORE) is only 
computed for the particular groups.  For certain, always.   Maybe this 
particular example is not a good one, but I'm trying to demonstrate an 
overall syntax which is scalable (i.e. this syntax can do more complicated 
things that SQL can't, or can't do well).   Notice that the method earlier 
on i.e. "DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]"  is simpler 
but wasteful as it does compute mean(SCORE) for all the groups.  But the 
syntax explicity conveys what is being done, and the user has the choice.


"Gabor Grothendieck" <ggrothendieck at gmail.com> wrote in message 
news:971536df1001051122l58389037p4e16288aedfdeb07 at mail.gmail.com...
Here is the solution using sqldf which can do it in one statement:

> # read in data
> Lines <- "OBS     NAME   SCORE
+ 1          Tom       92
+ 2          Tom       88
+ 3          Tom       56
+ 4          James    85
+ 5          James    75
+ 6          James    32
+ 7          Dawn     56
+ 8          Dawn     91
+ 9          Clara     95
+ 10        Clara     84"
>
> DF <- read.table(textConnection(Lines), header = TRUE)
>
> # run
> library(sqldf)
> sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) = 3")
   NAME avg(SCORE)
1 James   64.00000
2   Tom   78.66667


On Tue, Jan 5, 2010 at 2:03 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> Have a look at this post and the rest of that thread:
>
> https://stat.ethz.ch/pipermail/r-help/2010-January/223420.html
>
> On Tue, Jan 5, 2010 at 1:29 PM, Geoffrey Smith <gps at asu.edu> wrote:
>> Hello, does anyone know how to take the mean for a subset of 
>> observations?
>> For example, suppose my data looks like this:
>>
>> OBS NAME SCORE
>> 1 Tom 92
>> 2 Tom 88
>> 3 Tom 56
>> 4 James 85
>> 5 James 75
>> 6 James 32
>> 7 Dawn 56
>> 8 Dawn 91
>> 9 Clara 95
>> 10 Clara 84
>>
>> Is there a way to get the mean of the SCORE variable by NAME but only 
>> when
>> the number of observations is equal to 3? In other words, is there a way 
>> to
>> get the mean of the SCORE variable for Tom and James, but not for Dawn 
>> and
>> Clara? Thank you.
>>
>> --
>> Geoffrey Smith
>> Visiting Assistant Professor
>> Department of Finance
>> W. P. Carey School of Business
>> Arizona State University
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>



More information about the R-help mailing list