[BioC] Average based on group

Thu May 12 18:11:41 CEST 2011

Hi,

On Thu, May 12, 2011 at 11:20 AM, Fabrice Tourre <fabrice.ciup at gmail.com> wrote:
> Dear list,
> I have dataframe, the second column is groups factor, each group has
> 10 items. The data as fellow.
> chr10   rs9971029   71916552    0.1
> chr10   rs9971029   71916553    0.4
> chr10   rs9971029   71916554    0.3
> chr10   rs9971029   71916555    0.9
> chr10   rs9971029   71916556    1
> chr10   rs9971029   71916557    2
> chr10   rs9971029   71916558    4
> chr10   rs9971029   71916559    0.8
> chr10   rs9971029   71916560    0.9
> chr10   rs9971029   71916561    0.8
> chr10   rs9971030   71916726    0.6
> chr10   rs9971030   71916727    0.5
> chr10   rs9971030   71916728    0.4
> chr10   rs9971030   71916729    0.7
> chr10   rs9971030   71916730    0
> chr10   rs9971030   71916731    0
> chr10   rs9971030   71916732    0.6
> chr10   rs9971030   71916733    0.8
> chr10   rs9971030   71916734    0.9
> chr10   rs9971030   71916735    1
>
> I want to get a average of each item based on the group factor. So at
> last I want to get a vector which length is 10.
> The value calculated as this:
>
> (0.1+0.6)/2
> (0.4+0.5)/2
> …
> (0.8+1)/2
>
> Thank you very much in advance.

In addition to the great plyr package, if your data.frame is at all
large you could also look into using the data.table package -- it's
generally much faster[*].

I don't see how your data.frame corresponds to what you say, though --
ie. you mention that the second column is the group factor and that
you expect an answer of lenght 10, but I only see 1 snp_id in your 2nd
column ...

Anyway. Assuming your data.frame was named `df` and had columns like:
seqnames, snp.id, position, score.

Do get the average score over all snps using data.table, you do:

R> library(data.table)
R> dt <- data.table(df, key='snp.id')
R> avg <- dt[, list(avg=mean(score), by=snp.id]

(instead of mean(score), you might want to do .Internal(mean(score))
since apparently doing it the "normal" way is somehow slow)

HTH,
-steve

[*] A disclaimer is that I help develop the data.table package .. I'm
not trying to proselytize for it over plyr, as I like and use both.
It's just that for (really) large data.frame like objects, you'll
notice the speed differences between the two are quite dramatic.

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact