[BioC] Average based on group
Steve Lianoglou
mailinglist.honeypot at gmail.com
Thu May 12 18:11:41 CEST 2011
Hi,
On Thu, May 12, 2011 at 11:20 AM, Fabrice Tourre <fabrice.ciup at gmail.com> wrote:
> Dear list,
> I have dataframe, the second column is groups factor, each group has
> 10 items. The data as fellow.
> chr10 rs9971029 71916552 0.1
> chr10 rs9971029 71916553 0.4
> chr10 rs9971029 71916554 0.3
> chr10 rs9971029 71916555 0.9
> chr10 rs9971029 71916556 1
> chr10 rs9971029 71916557 2
> chr10 rs9971029 71916558 4
> chr10 rs9971029 71916559 0.8
> chr10 rs9971029 71916560 0.9
> chr10 rs9971029 71916561 0.8
> chr10 rs9971030 71916726 0.6
> chr10 rs9971030 71916727 0.5
> chr10 rs9971030 71916728 0.4
> chr10 rs9971030 71916729 0.7
> chr10 rs9971030 71916730 0
> chr10 rs9971030 71916731 0
> chr10 rs9971030 71916732 0.6
> chr10 rs9971030 71916733 0.8
> chr10 rs9971030 71916734 0.9
> chr10 rs9971030 71916735 1
>
> I want to get a average of each item based on the group factor. So at
> last I want to get a vector which length is 10.
> The value calculated as this:
>
> (0.1+0.6)/2
> (0.4+0.5)/2
> …
> (0.8+1)/2
>
> Thank you very much in advance.
In addition to the great plyr package, if your data.frame is at all
large you could also look into using the data.table package -- it's
generally much faster[*].
I don't see how your data.frame corresponds to what you say, though --
ie. you mention that the second column is the group factor and that
you expect an answer of lenght 10, but I only see 1 snp_id in your 2nd
column ...
Anyway. Assuming your data.frame was named `df` and had columns like:
seqnames, snp.id, position, score.
Do get the average score over all snps using data.table, you do:
R> library(data.table)
R> dt <- data.table(df, key='snp.id')
R> avg <- dt[, list(avg=mean(score), by=snp.id]
(instead of mean(score), you might want to do .Internal(mean(score))
since apparently doing it the "normal" way is somehow slow)
HTH,
-steve
[*] A disclaimer is that I help develop the data.table package .. I'm
not trying to proselytize for it over plyr, as I like and use both.
It's just that for (really) large data.frame like objects, you'll
notice the speed differences between the two are quite dramatic.
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the Bioconductor
mailing list