[R] identify duplicate entries in data frame and calculate mean

Matthew mccormack at molbio.mgh.harvard.edu
Tue May 24 22:36:26 CEST 2016


Thanks, Tom.  I was making a mistake looking at your example and that's 
what my problem was.

Cool answer, works great. Thank you very much.

Matthew

On 5/24/2016 4:23 PM, Tom Wright wrote:
> Don't see that as being a big problem. If your data grows then dplyr 
> supports connections to external databases. Alternately if you just 
> want a mean, most databases can do that directly in SQL.
>
> On Tue, May 24, 2016 at 4:17 PM, Matthew 
> <mccormack at molbio.mgh.harvard.edu 
> <mailto:mccormack at molbio.mgh.harvard.edu>> wrote:
>
>     Thank you very much, Tom.
>     This gets me thinking in the right direction.
>     One thing I should have mentioned that I did not is that the
>     number of rows in the data frame will be a little over 40,000 rows.
>
>
>     On 5/24/2016 4:08 PM, Tom Wright wrote:
>>     Using dplyr
>>
>>     $ library(dplyr)
>>     $ x<-data.frame(Length=c(321,350,340,180,198),
>>     ID=c(rep('A234',3),'B123','B225') )
>>     $ x %>% group_by(ID) %>% summarise(m=mean(Length))
>>
>>
>>
>>     On Tue, May 24, 2016 at 3:46 PM, Matthew
>>     <mccormack at molbio.mgh.harvard.edu
>>     <mailto:mccormack at molbio.mgh.harvard.edu>> wrote:
>>
>>         I have a data frame with 10 columns.
>>         In the last column is an alphaneumaric identifier.
>>         For most rows, this alphaneumaric identifier is unique to the
>>         file, however some of these alphanemeric idenitifiers occur
>>         in duplicate, triplicate or more. When they do occur more
>>         than once they are in consecutive rows, so when there is a
>>         duplicate or triplicate or quadruplicate (let's call them
>>         multiplicates), they are in consecutive rows.
>>
>>         In column 7 there is an integer number (may or may not be
>>         unique. does not matter).
>>
>>         I want to identify each multiple entries (multiplicates)
>>         occurring in column 10 and then for each multiplicate
>>         calculate the mean of the integers column 7.
>>
>>         As an example, I will show just two columns:
>>         Length  Identifier
>>         321     A234
>>         350     A234
>>         340     A234
>>         180     B123
>>         198     B225
>>
>>         What I want to do (in the above example) is collapse all the
>>         A234's and report the mean to get this:
>>         Length  Identifier
>>         337     A234
>>         180     B123
>>         198     B225
>>
>>
>>         Matthew
>>
>>         ______________________________________________
>>         R-help at r-project.org <mailto:R-help at r-project.org> mailing
>>         list -- To UNSUBSCRIBE and more, see
>>         https://stat.ethz.ch/mailman/listinfo/r-help
>>         PLEASE do read the posting guide
>>         http://www.R-project.org/posting-guide.html
>>         and provide commented, minimal, self-contained, reproducible
>>         code.
>>
>>
>
>


	[[alternative HTML version deleted]]



More information about the R-help mailing list