[R] mean for subset
Jerry Floren
jerry.floren at state.mn.us
Thu Jan 7 21:16:55 CET 2010
Thanks for your insights Matthew. Actually, some of the merged documents are
over 1,000 pages. I have never had a programming class, and I had one
college statistics course in 1975. I might be in over my head, but R along
with Word's mail merge has allowed me to put together some pretty useful
reports.
Thanks for your suggestions; I am sure they will help improve my R skill
set.
Jerry Floren
Minnesota Department of Agriculture
Matthew Dowle-3 wrote:
>
>
> Did you really say you're using Word's mail merge to construct "hundreds"
> of
> pages of R code which you then paste in to R ? It sounds like you just
> missed somehow how to create a function in R. Did you fully read the book
> Introduction to R ? Did you know R can read xls directly, and connect to
> spreadsheets as if they were databases, see ?odbcConnectExcel.
>
> Your graphs may exist and be beautiful but are they correct ? This link
> contains a formal discussion of the topic :
> http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html
>
>
> "Jerry Floren" <jerry.floren at state.mn.us> wrote in message
> news:1262877373634-1008892.post at n4.nabble.com...
>>
>> As a novice R user, I face a similar challenge. I am almost afraid to
>> share
>> with this group how I solved it. About 65 labs in our proficiency program
>> submit data on individual Excel spreadsheets with triple replicates.
>> There
>> always are a few labs that do not complete the full set of three
>> replicates,
>> and I do not want their data included in my analysis.
>>
>> First, I combine all the individual spreadsheets into one large Excel
>> spreadsheet. The replicates are in three columns: rep1, rep2, and rep3. I
>> sort on each individual rep column in Excel. Then I go to both the top
>> and
>> the bottom of the list.
>>
>> For example, I sort on rep1 and go to the top of the list to delete any
>> rows
>> where a value for rep1 was not recorded. Then I go to the bottom of the
>> list
>> and delete any rows where rep1 is text instead of a number, for example,
>> <0.001. I should say that the labs are instructed that they must complete
>> all three replicates, and they must not enter results as text. Next I
>> repeat
>> the process for rep2 and rep3.
>>
>> I'll do a little more work in Excel on the large, combined table with all
>> the lab data. I calculate in Excel the mean, standard deviation, and
>> coefficient of variation for each of the three reps. Finally, I filter
>> all
>> the data and delete duplicate rows. This is necessary as I sometimes
>> accidentally copy the same spreadsheet two times from a lab into my large
>> table. Finally, I save the cleaned up table in *.csv format that is
>> easily
>> read into R.
>>
>> I know that R can do all of these things, but if you are just learning
>> how
>> to use R it might be easier to do some initial work in Excel, or a
>> similar
>> spreadsheet, before running your data through R.
>>
>> I also use MS-Word's mail merge feature to generate my code. I'll get
>> three
>> or four pages of code doing what I want for a single analytical test, for
>> example, calcium. Then I'll use the mail merge feature to generate
>> hundreds
>> of pages of code with the other analytical tests (nitrogen, phosphorus,
>> potassium, etc.). I just copy and paste the large, merged Word document
>> into
>> R. R cranks away for 30 minutes and I end up with several large tables
>> (and
>> these get additional editing in Ecel) and hundreds of beautiful graphs
>> that
>> would take weeks to create in Excel.
>>
>> I was amazed that Word would work. I expected all of Word's special print
>> control codes would mess things up. I just recently received a new laptop
>> computer, and now I have an occassional problem with Word's "pretty print
>> quotes," but if you know about that problem, it is easy to fix.
>>
>> Jerry Floren
>> Minnesota Department of Agriculture
>>
>>
>>
>>
>>
>> Matthew Dowle-3 wrote:
>>>
>>>
>>> As can data.table (i.e. do 'having' in one statement) :
>>>
>>>> DT = data.table(DF)
>>>> DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]
>>> NAME n V2
>>> [1,] James 3 64.00000
>>> [2,] Tom 3 78.66667
>>>>
>>>
>>> but data.table isn't restricted to SQL functions (such as avg), any R
>>> functions can be used, sometimes for their side effects (such as
>>> plotting)
>>> rather than just returning data.
>>>
>>> Further data.table has a thing called 'join inherited scoping'. Say we
>>> knew the specific groups, we can go directly to them (without even
>>> looking
>>> at the rest of the data in the table) in very short and convenient
>>> syntax,
>>> which also happens to run quickly on large data sets (but can be useful
>>> just
>>> for the syntax alone) :
>>>
>>>> setkey(DT,NAME)
>>>> DT[c("James","Tom"),mean(SCORE),mult="all"]
>>> NAME V1
>>> [1,] James 64.00000
>>> [2,] Tom 78.66667
>>>>
>>>
>>> Notice there is no "group by" or even a "by" in the above. It inherits
>>> the
>>> scope from the join because mult="all" means that "James" matches to
>>> multiple rows, as does "Tom", creating two groups. It does it by binary
>>> search to the beginning of each group, binary search to the end of the
>>> group, and runs the R expression inside the scope of that group.
>>>
>>> An example of join inherited scoping for the side effects only :
>>>
>>>> pdf("out.pdf")
>>>> DT[c("James","Tom"),plot(SCORE),mult="all"]
>>> NULL data table
>>>> dev.off()
>>> # out.pdf now contains 2 plots
>>>
>>> which you couldn't do in SQL because SQL has no plotting (or any of R's
>>> other packages).
>>>
>>> It aims to do this quickly. Where 'quickly' means 1) shorter code is
>>> quicker to write, read, debug and maintain and also 2) quicker to
>>> compute,
>>> and its 1 that often dominates 2.
>>>
>>> Finally, consider the following two statements which are both equivalent
>>> :
>>>
>>>> sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) =
>>>> 3")
>>> NAME avg(SCORE)
>>> 1 James 64.00000
>>> 2 Tom 78.66667
>>>> DT[ J(DT[,length(NAME),by="NAME"][V1==3,NAME]), mean(SCORE),
>>>> mult="all"]
>>> NAME avg(SCORE)
>>> 1 James 64.00000
>>> 2 Tom 78.66667
>>>
>>> Now ok I hear you groaning (!) that the 2nd looks (on first glance)
>>> ugly,
>>> but bear with me ... in the SQL solution do you know for sure that
>>> avg(SCORE) isn't computed wastefully for the all the groups that don't
>>> have
>>> count(*)=3 ? It might well do the 'group by' first for all the groups,
>>> then
>>> do the 'having' afterwards as a 'where' on the result. It might depend
>>> on
>>> the particular SQL database being used (mySQL, sqllite, etc) or the
>>> installation parameters, any indexes etc. Some investigation would be
>>> required (taking time) if someone doesn't already know. In the
>>> data.table
>>> however, the syntax explictly makes it clear than mean(SCORE) is only
>>> computed for the particular groups. For certain, always. Maybe this
>>> particular example is not a good one, but I'm trying to demonstrate an
>>> overall syntax which is scalable (i.e. this syntax can do more
>>> complicated
>>> things that SQL can't, or can't do well). Notice that the method
>>> earlier
>>> on i.e. "DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]" is
>>> simpler
>>> but wasteful as it does compute mean(SCORE) for all the groups. But the
>>> syntax explicity conveys what is being done, and the user has the
>>> choice.
>>>
>>>
>>> "Gabor Grothendieck" <ggrothendieck at gmail.com> wrote in message
>>> news:971536df1001051122l58389037p4e16288aedfdeb07 at mail.gmail.com...
>>> Here is the solution using sqldf which can do it in one statement:
>>>
>>>> # read in data
>>>> Lines <- "OBS NAME SCORE
>>> + 1 Tom 92
>>> + 2 Tom 88
>>> + 3 Tom 56
>>> + 4 James 85
>>> + 5 James 75
>>> + 6 James 32
>>> + 7 Dawn 56
>>> + 8 Dawn 91
>>> + 9 Clara 95
>>> + 10 Clara 84"
>>>>
>>>> DF <- read.table(textConnection(Lines), header = TRUE)
>>>>
>>>> # run
>>>> library(sqldf)
>>>> sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) =
>>>> 3")
>>> NAME avg(SCORE)
>>> 1 James 64.00000
>>> 2 Tom 78.66667
>>>
>>>
>>> On Tue, Jan 5, 2010 at 2:03 PM, Gabor Grothendieck
>>> <ggrothendieck at gmail.com> wrote:
>>>> Have a look at this post and the rest of that thread:
>>>>
>>>> https://stat.ethz.ch/pipermail/r-help/2010-January/223420.html
>>>>
>>>> On Tue, Jan 5, 2010 at 1:29 PM, Geoffrey Smith <gps at asu.edu> wrote:
>>>>> Hello, does anyone know how to take the mean for a subset of
>>>>> observations?
>>>>> For example, suppose my data looks like this:
>>>>>
>>>>> OBS NAME SCORE
>>>>> 1 Tom 92
>>>>> 2 Tom 88
>>>>> 3 Tom 56
>>>>> 4 James 85
>>>>> 5 James 75
>>>>> 6 James 32
>>>>> 7 Dawn 56
>>>>> 8 Dawn 91
>>>>> 9 Clara 95
>>>>> 10 Clara 84
>>>>>
>>>>> Is there a way to get the mean of the SCORE variable by NAME but only
>>>>> when
>>>>> the number of observations is equal to 3? In other words, is there a
>>>>> way
>>>>> to
>>>>> get the mean of the SCORE variable for Tom and James, but not for Dawn
>>>>> and
>>>>> Clara? Thank you.
>>>>>
>>>>> --
>>>>> Geoffrey Smith
>>>>> Visiting Assistant Professor
>>>>> Department of Finance
>>>>> W. P. Carey School of Business
>>>>> Arizona State University
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>> --
>> View this message in context:
>> http://n4.nabble.com/mean-for-subset-tp999254p1008892.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
--
View this message in context: http://n4.nabble.com/mean-for-subset-tp999254p1009194.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list