[BioC] Average based on group

Kevin R. Coombes kevin.r.coombes at gmail.com
Thu May 12 19:36:24 CEST 2011


It would probably be better to construct a meaningful factor that 
reflects the correct interpretation. (I tend to dislike code that 
assumes that the order of things is always preserved and no rows got 
accidentally omitted....) I assume you really want to relate things 
based on their offset from the actual SNP position.  So you might want 
to compute "min" based on the SNP id grouping factor and compute 
"offset" relative to that minimum position.  You could then use the 
offset as the new grouping factor for the averages you want.

Here is (completely untested and written on the fly) pseudo-code to do this:

startpos <- tapply(df$position, df$snp.id, min)
offset <- df$position - startPos[df$snp.id]
myavg <- tapply(df$score, offset, mean)

     Kevin

> Ok I get it now,
> If your data is as shown i.e. sorted, then can you just create a dummy
> variable:
> rep(1:10,n) where n is the number of groups and then use by or tapply?
> So in your example:
> by(df[,4],rep(1:10,2),mean)
>
> cheers,
> Achilleas
>
> On Thu, May 12, 2011 at 12:38 PM, Fabrice Tourre<fabrice.ciup at gmail.com>wrote:
>
>> Thanks for your reply. But it cannot be for my purpose. In fact, there
>> are two snps in the example, rs9971029 and rs9971030.
>>
>> I expect fellow output with the fellow data:
>>
>> 0.35 0.45 0.35 0.80 0.50 1.00 2.30 0.80 0.90 0.90
>>
>> You can run this example to get above value
>>
>> -----------------------------R code------------------------------------
>> df<-structure(list(seqnames = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
>> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label =
>> "chr10", class = "factor"),
>>     snp.id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
>>     1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("rs9971029",
>>     "rs9971030"), class = "factor"), position = c(71916552L,
>>     71916553L, 71916554L, 71916555L, 71916556L, 71916557L, 71916558L,
>>     71916559L, 71916560L, 71916561L, 71916726L, 71916727L, 71916728L,
>>     71916729L, 71916730L, 71916731L, 71916732L, 71916733L, 71916734L,
>>     71916735L), score = c(0.1, 0.4, 0.3, 0.9, 1, 2, 4, 0.8, 0.9,
>>     0.8, 0.6, 0.5, 0.4, 0.7, 0, 0, 0.6, 0.8, 0.9, 1)), .Names =
>> c("seqnames",
>> "snp.id", "position", "score"), class = "data.frame", row.names = c(NA,
>> -20L))
>>
>> a<-df[1:10,]
>> b<-df[11:20,]
>> cbind(a,b)->c
>> (c[,4]+c[,8])/2
>> ----------------------------------------------------------------
>>
>> The data is :
>>
>> chr10   rs9971029   71916552    0.1
>> chr10   rs9971029   71916553    0.4
>> chr10   rs9971029   71916554    0.3
>> chr10   rs9971029   71916555    0.9
>> chr10   rs9971029   71916556    1
>> chr10   rs9971029   71916557    2
>> chr10   rs9971029   71916558    4
>> chr10   rs9971029   71916559    0.8
>> chr10   rs9971029   71916560    0.9
>> chr10   rs9971029   71916561    0.8
>> chr10   rs9971030   71916726    0.6
>> chr10   rs9971030   71916727    0.5
>> chr10   rs9971030   71916728    0.4
>> chr10   rs9971030   71916729    0.7
>> chr10   rs9971030   71916730    0
>> chr10   rs9971030   71916731    0
>> chr10   rs9971030   71916732    0.6
>> chr10   rs9971030   71916733    0.8
>> chr10   rs9971030   71916734    0.9
>> chr10   rs9971030   71916735    1
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list