[R] Summary information by groups programming assitance

Tue Dec 23 03:29:13 CET 2008

The sorting should have been by Lake, psd and vol (not what I had)
so it should be revised to:

DFo <- DF[order(DF$Lake, DF$psd, DF$vol), ]
aggregate(DFo[c("Length", "vol")], DFo[c("Lake", "psd")], tail, 1)

This is the same as before except DF$psd is used in place of DF$Length
in the first line.

On Mon, Dec 22, 2008 at 9:14 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> Just sort the data first and then apply any of the solutions but with tail(x, 1)
> instead of max, e.g.
>
> DFo <- DF[order(DF$Lake, DF$Length, DF$vol), ]
> aggregate(DFo[c("Length", "vol")], DFo[c("Lake", "psd")], tail, 1)
>
>
> On Mon, Dec 22, 2008 at 8:15 PM, Ranney, Steven
> <steven.ranney at montana.edu> wrote:
>> Thank you all for your help.  I appreciate the assistance. I'm thinking I should have been more specific in my original question.
>>
>> Unless I'm mistaken, all of the suggestions so far have been for maximum vol and maximum Length by Lake and psd.  I'm trying to extract the max vol by Lake and psd along with the corresponding value of Length.  So, instead of maximum vol and maximum Length, I'd like to find the max vol and the Length associated with that value.
>>
>> Sorry for any confusion,
>>
>> SR
>>
>> Steven H. Ranney
>> Graduate Research Assistant (Ph.D)
>> USGS Montana Cooperative Fishery Research Unit
>> Montana State University
>> P.O. Box 173460
>> Bozeman, MT 59717-3460
>>
>> phone: (406) 994-6643
>> fax: (406) 994-7479
>>
>> http://studentweb.montana.edu/steven.ranney
>> ________________________________
>>
>> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
>> Sent: Mon 12/22/2008 5:15 PM
>> To: Ranney, Steven
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Summary information by groups programming assitance
>>
>>
>> Here are two solutions assuming DF is your data frame:
>>
>> # 1. aggregate is in the base of R
>>
>> aggregate(DF[c("Length", "vol")], DF[c("Lake", "psd")], max)
>>
>> or the following which is the same except it labels psd as Category:
>>
>> aggregate(DF[c("Length", "vol")], with(DF, list(Lake = Lake, Category
>> = psd)), max)
>>
>>
>> # 2. sqldf.  The sqldf package allows specification using SQL notation:
>>
>> library|(sqldf)
>> sqldf("select Lake, psd as Category, max(Length), max(vol) from DF
>> group by Lake, psd")
>>
>> There are many other good solutions too using various packages which
>> have already
>> been mentioned on this thread.
>>
>> On Mon, Dec 22, 2008 at 4:51 PM, Ranney, Steven
>> <steven.ranney at montana.edu> wrote:
>>> All -
>>>
>>> I have data that looks like
>>>
>>>          psd   Species Lake Length  Weight    St.weight    Wr
>>> Wr.1     vol
>>> 432  substock     SMB      Clear    150   41.00      0.01  95.12438
>>> 95.10118  0.0105
>>> 433  substock     SMB      Clear    152   39.00      0.01  86.72916
>>> 86.70692  0.0105
>>> 434  substock     SMB      Clear    152   40.00      3.11  88.95298
>>> 82.03689  3.2655
>>> 435  substock     SMB      Clear    159   48.00      0.04  92.42095
>>> 92.34393  0.0420
>>> 436  substock     SMB      Clear    159   48.00      0.01  92.42095
>>> 92.40170  0.0105
>>> 437  substock     SMB      Clear    165   47.00      0.03  80.38023
>>> 80.32892  0.0315
>>> 438  substock     SMB      Clear    171   62.00      0.21  94.58105
>>> 94.26070  0.2205
>>> 439  substock     SMB      Clear    178   70.00      0.01  93.91912
>>> 93.90571  0.0105
>>> 440  substock     SMB      Clear    179   76.00      1.38 100.15760
>>> 98.33895  1.4490
>>> 441       S-Q     SMB      Clear    180   75.00      0.01  97.09330
>>> 97.08035  0.0105
>>> 442       S-Q     SMB      Clear    180   92.00      0.02 119.10111
>>> 119.07522  0.0210
>>> ...
>>> [truncated]
>>>
>>> where psd and lake are categorical variables, with five and four
>>> categories, respectively.  I'd like to find the maximum vol and the
>>> lengths associated with each maximum vol by each category by each lake.
>>> In other words, I'd like to have a data frame that looks something like
>>>
>>> Lake            Category        Length  vol
>>> Clear           substock        152             3.2655
>>> Clear           S-Q             266             11.73
>>> Clear           Q-P             330             14.89
>>> ...
>>> Pickerel        substock        170             3.4965
>>> Pickerel        S-Q             248             10.69
>>> Pickerel        Q-P             335             25.62
>>> Pickerel        P-M             415             32.62
>>> Pickerel        M-T             442             17.25
>>>
>>>
>>> In order to originally get this, I used
>>>
>>> with(smb[Lake=="Clear",], tapply(vol, list(Length, psd),max))
>>> with(smb[Lake=="Enemy.Swim",], tapply(vol, list(Length, psd),max))
>>> with(smb[Lake=="Pickerel",], tapply(vol, list(Length, psd),max))
>>> with(smb[Lake=="Roy",], tapply(vol, list(Length, psd),max))
>>>
>>> and pulled the values I needed out by hand and put them into a .csv.
>>> Unfortunately, I've got a number of other data sets upon which I'll need
>>> to do the same analysis.  Finding a programmable alternative would
>>> provide a much easier (and likely less error prone) method to achieve
>>> the same results.  Ideally, the "Length" and "vol" data would be in a
>>> data frame such that I could then analyze with nls.
>>>
>>> Does anyone have any thoughts as to how I might accomplish this?
>>>
>>> Thanks in advance,
>>>
>>> Steven Ranney
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>