[R] averaging between rows with repeated data
David Winsemius
dwinsemius at comcast.net
Tue Nov 15 14:53:17 CET 2011
On Nov 15, 2011, at 6:46 AM, R. Michael Weylandt wrote:
> Good morning Rob,
>
> First off, thank you for providing a reproducible example. This is one
> of those little tasks that R is pretty great at, but there exist
>> \infty ways to do so and it can be a little overwhelming for the
> beginner: here's one with the base function ave():
>
> cbind(ave(example[,2:4], example[,5]), id = example[,5])
>
> This splits example according to the fifth column (id) and averages
> the other values: we then stick another copy of the id back on the end
> and are good to go.
>
> The base function aggregate can do something similar:
>
> aggregate(example[,2:4], by = example[,5, drop = F], mean)
>
> Note that you need the little-publicized but super useful drop = F
> command to make this one work.
The way I usually deal with that is to wrap list() around the by=
argument ... since I usually forget about this aggregate quirk and
bet an error message complaining : "'by' must be a list". (drop=FALSE
has the effect of keeping data.frame columns as lists too, so I am not
disagreeing here.)
aggregate(example[,2:4], by = list(example[,5]), mean)
--
David.
>
> There are other ways to do this with the plyr or doBy packages as
> well, but this should get you started.
>
> Hope it helps,
>
> Michael
>
> On Tue, Nov 15, 2011 at 5:52 AM, robgriffin247
> <robgriffin247 at hotmail.com> wrote:
>> *The situation (or an example at least!)*
>>
>> example<-data.frame(rep(letters[1:10]))
>> colnames(example)[1]<-("Letters")
>> example$numb1<-rnorm(10,1,1)
>> example$numb2<-rnorm(10,1,1)
>> example$numb3<-rnorm(10,1,1)
>> example$id<-
>> c
>> ("CG234
>> ","CG232
>> ","CG441","CG128","CG125","CG182","CG232","CG441","CG232","CG125")
>>
>> *this produces something like this:*
>> Letters numb1 numb2 numb3 id
>> 1 a 0.8139130 -0.9775570 -0.002996244 CG234
>> 2 b 0.8268700 0.4980661 1.647717998 CG232
>> 3 c 0.2384088 1.0249684 0.120663273 CG441
>> 4 d 0.8215922 0.5686534 1.591208307 CG128
>> 5 e 0.7865918 0.5411476 0.838300185 CG125
>> 6 f 2.2385522 1.2668070 1.268005020 CG182
>> 7 g 0.7403965 -0.6224205 1.374641549 CG232
>> 8 h 0.2526634 1.0282978 -0.110449844 CG441
>> 9 i 1.9333444 1.6667486 2.937252363 CG232
>> 10 j 1.6996701 0.5964623 1.967870617 CG125
>>
>> *The Problem:*
>> Some of these id's are repeated, I want to average the values for
>> those rows
>> within each column but obviously they have different numbers in the
>> numbers
>> column, and they also have different letters in the letters column,
>> the
>> letters are not necessary for my analysis, only the duplicated id's
>> and the
>> numb columns are important
>>
>> I also need to keep the existing dataframe so would like to build a
>> new
>> dataframe that averages the repeated values and keeps their id - my
>> actual
>> dataset is much more complex (271*13890) - but the solution to this
>> can be
>> expanded out to my main data set because there is just more columns
>> of
>> numbers and still only one alphanumeric id to keep in my example
>> data, id
>> CG232 occurs 3 times, CG441 & CG125 occur twice, everthing else
>> once so the
>> new dataframe (from this example) there would be 3 number columns
>> (numb1,
>> numb2, numb3) and an id the numb column values would be the
>> averages of the
>> rows which had the same id
>>
>> so for example the new dataframe would contain an entry for CG125
>> which
>> would be something like this:
>>
>> numb1 numb2 numb3 id
>> 1.2431 0.5688 1.403 CG125
>>
>> Just as a thought, all of the IDs start with CG so could I use then
>> grep (?)
>> to delete CG and replace it with 0, that way duplicated ids could be
>> averaged as a number (they would be the same) but I still don’t
>> know how to
>> produce the new dataframe with the averaged rows in it...
>>
>> I hope this is clear enough! email me if you need further detail or
>> even
>> better, if you have a solution!!
>> also sorry to be posting my second question in under 24hours but I
>> seem to
>> have become more than a little stuck – I was making such good
>> progress with
>> R!
>>
>> Rob
>>
>> (also I'm sorry if this appears more than once on the mailing list
>> - I'm
>> having some network & windows live issues so I'm not convinced
>> previous
>> attempts to send this have worked, but have no way of telling if
>> they are
>> just milling around in the internet somewhere as we speak and will
>> decide to
>> come out of hiding later!)
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/averaging-between-rows-with-repeated-data-tp4042513p4042513.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list