[R] mean for subset

Matthew Dowle mdowle at mdowle.plus.com
Thu Jan 7 19:40:26 CET 2010


Did you really say you're using Word's mail merge to construct "hundreds" of 
pages of R code which you then paste in to R ? It sounds like you just 
missed somehow how to create a function in R.  Did you fully read the book 
Introduction to R ?  Did you know R can read xls directly, and connect to 
spreadsheets as if they were databases, see ?odbcConnectExcel.

Your graphs may exist and be beautiful but are they correct ?   This link 
contains a formal discussion of the topic :
http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html


"Jerry Floren" <jerry.floren at state.mn.us> wrote in message 
news:1262877373634-1008892.post at n4.nabble.com...
>
> As a novice R user, I face a similar challenge. I am almost afraid to 
> share
> with this group how I solved it. About 65 labs in our proficiency program
> submit data on individual Excel spreadsheets with triple replicates. There
> always are a few labs that do not complete the full set of three 
> replicates,
> and I do not want their data included in my analysis.
>
> First, I combine all the individual spreadsheets into one large Excel
> spreadsheet. The replicates are in three columns: rep1, rep2, and rep3. I
> sort on each individual rep column in Excel. Then I go to both the top and
> the bottom of the list.
>
> For example, I sort on rep1 and go to the top of the list to delete any 
> rows
> where a value for rep1 was not recorded. Then I go to the bottom of the 
> list
> and delete any rows where rep1 is text instead of a number, for example,
> <0.001. I should say that the labs are instructed that they must complete
> all three replicates, and they must not enter results as text. Next I 
> repeat
> the process for rep2 and rep3.
>
> I'll do a little more work in Excel on the large, combined table with all
> the lab data. I calculate in Excel the mean, standard deviation, and
> coefficient of variation for each of the three reps. Finally, I filter all
> the data and delete duplicate rows. This is necessary as I sometimes
> accidentally copy the same spreadsheet two times from a lab into my large
> table. Finally, I save the cleaned up table in *.csv format that is easily
> read into R.
>
> I know that R can do all of these things, but if you are just learning how
> to use R it might be easier to do some initial work in Excel, or a similar
> spreadsheet, before running your data through R.
>
> I also use MS-Word's mail merge feature to generate my code. I'll get 
> three
> or four pages of code doing what I want for a single analytical test, for
> example, calcium. Then I'll use the mail merge feature to generate 
> hundreds
> of pages of code with the other analytical tests (nitrogen, phosphorus,
> potassium, etc.). I just copy and paste the large, merged Word document 
> into
> R. R cranks away for 30 minutes and I end up with several large tables 
> (and
> these get additional editing in Ecel) and hundreds of beautiful graphs 
> that
> would take weeks to create in Excel.
>
> I was amazed that Word would work. I expected all of Word's special print
> control codes would mess things up. I just recently received a new laptop
> computer, and now I have an occassional problem with Word's "pretty print
> quotes," but if you know about that problem, it is easy to fix.
>
> Jerry Floren
> Minnesota Department of Agriculture
>
>
>
>
>
> Matthew Dowle-3 wrote:
>>
>>
>> As can data.table (i.e. do 'having' in one statement) :
>>
>>> DT = data.table(DF)
>>> DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]
>>       NAME n       V2
>> [1,] James 3 64.00000
>> [2,]   Tom 3 78.66667
>>>
>>
>> but data.table isn't restricted to SQL functions (such as avg),  any R
>> functions can be used, sometimes for their side effects (such as 
>> plotting)
>> rather than just returning data.
>>
>> Further data.table has a thing called 'join inherited scoping'.   Say we
>> knew the specific groups,  we can go directly to them (without even
>> looking
>> at the rest of the data in the table) in very short and convenient 
>> syntax,
>> which also happens to run quickly on large data sets (but can be useful
>> just
>> for the syntax alone) :
>>
>>> setkey(DT,NAME)
>>> DT[c("James","Tom"),mean(SCORE),mult="all"]
>>       NAME       V1
>> [1,] James 64.00000
>> [2,]   Tom 78.66667
>>>
>>
>> Notice there is no "group by" or even a "by" in the above.  It inherits
>> the
>> scope from the join because mult="all" means that "James" matches to
>> multiple rows, as does "Tom", creating two groups.  It does it by binary
>> search to the beginning of each group,  binary search to the end of the
>> group,  and runs the R expression inside the scope of that group.
>>
>> An example of join inherited scoping for the side effects only :
>>
>>> pdf("out.pdf")
>>> DT[c("James","Tom"),plot(SCORE),mult="all"]
>> NULL data table
>>> dev.off()
>> # out.pdf now contains 2 plots
>>
>> which you couldn't do in SQL because SQL has no plotting (or any of R's
>> other packages).
>>
>> It aims to do this quickly.  Where 'quickly' means 1) shorter code is
>> quicker to write, read, debug and maintain  and also  2) quicker to
>> compute,
>> and its 1 that often dominates 2.
>>
>> Finally, consider the following two statements which are both equivalent 
>> :
>>
>>> sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) =
>>> 3")
>>    NAME avg(SCORE)
>> 1 James   64.00000
>> 2   Tom   78.66667
>>> DT[ J(DT[,length(NAME),by="NAME"][V1==3,NAME]), mean(SCORE), mult="all"]
>>    NAME avg(SCORE)
>> 1 James   64.00000
>> 2   Tom   78.66667
>>
>> Now ok I hear you groaning (!) that the 2nd looks (on first glance) ugly,
>> but bear with me ... in the SQL solution do you know for sure that
>> avg(SCORE) isn't computed wastefully for the all the groups that don't
>> have
>> count(*)=3 ?  It might well do the 'group by' first for all the groups,
>> then
>> do the 'having' afterwards as a 'where' on the result.  It might depend 
>> on
>> the particular SQL database being used (mySQL, sqllite, etc) or the
>> installation parameters, any indexes etc.  Some investigation would be
>> required (taking time) if someone doesn't already know.  In the 
>> data.table
>> however,  the syntax explictly makes it clear than mean(SCORE) is only
>> computed for the particular groups.  For certain, always.   Maybe this
>> particular example is not a good one, but I'm trying to demonstrate an
>> overall syntax which is scalable (i.e. this syntax can do more 
>> complicated
>> things that SQL can't, or can't do well).   Notice that the method 
>> earlier
>> on i.e. "DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]"  is
>> simpler
>> but wasteful as it does compute mean(SCORE) for all the groups.  But the
>> syntax explicity conveys what is being done, and the user has the choice.
>>
>>
>> "Gabor Grothendieck" <ggrothendieck at gmail.com> wrote in message
>> news:971536df1001051122l58389037p4e16288aedfdeb07 at mail.gmail.com...
>> Here is the solution using sqldf which can do it in one statement:
>>
>>> # read in data
>>> Lines <- "OBS     NAME   SCORE
>> + 1          Tom       92
>> + 2          Tom       88
>> + 3          Tom       56
>> + 4          James    85
>> + 5          James    75
>> + 6          James    32
>> + 7          Dawn     56
>> + 8          Dawn     91
>> + 9          Clara     95
>> + 10        Clara     84"
>>>
>>> DF <- read.table(textConnection(Lines), header = TRUE)
>>>
>>> # run
>>> library(sqldf)
>>> sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) =
>>> 3")
>>    NAME avg(SCORE)
>> 1 James   64.00000
>> 2   Tom   78.66667
>>
>>
>> On Tue, Jan 5, 2010 at 2:03 PM, Gabor Grothendieck
>> <ggrothendieck at gmail.com> wrote:
>>> Have a look at this post and the rest of that thread:
>>>
>>> https://stat.ethz.ch/pipermail/r-help/2010-January/223420.html
>>>
>>> On Tue, Jan 5, 2010 at 1:29 PM, Geoffrey Smith <gps at asu.edu> wrote:
>>>> Hello, does anyone know how to take the mean for a subset of
>>>> observations?
>>>> For example, suppose my data looks like this:
>>>>
>>>> OBS NAME SCORE
>>>> 1 Tom 92
>>>> 2 Tom 88
>>>> 3 Tom 56
>>>> 4 James 85
>>>> 5 James 75
>>>> 6 James 32
>>>> 7 Dawn 56
>>>> 8 Dawn 91
>>>> 9 Clara 95
>>>> 10 Clara 84
>>>>
>>>> Is there a way to get the mean of the SCORE variable by NAME but only
>>>> when
>>>> the number of observations is equal to 3? In other words, is there a 
>>>> way
>>>> to
>>>> get the mean of the SCORE variable for Tom and James, but not for Dawn
>>>> and
>>>> Clara? Thank you.
>>>>
>>>> --
>>>> Geoffrey Smith
>>>> Visiting Assistant Professor
>>>> Department of Finance
>>>> W. P. Carey School of Business
>>>> Arizona State University
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
> -- 
> View this message in context: 
> http://n4.nabble.com/mean-for-subset-tp999254p1008892.html
> Sent from the R help mailing list archive at Nabble.com.
>



More information about the R-help mailing list