[R] Help on averaging sets of rows defined by row name

Liaw, Andy andy_liaw at merck.com
Fri Apr 20 16:09:11 CEST 2007


You might want to check which of the following scales better for the
size of data you have.

## Make up some data to try.
R> dat <- data.frame(gene=rep(letters[1:3], each=3), s1=runif(9),
s2=runif(9))
R> dat
  gene        s1        s2
1    a 0.9959172 0.9531052
2    a 0.2064497 0.4257022
3    a 0.4791100 0.5977923
4    b 0.1307096 0.8256453
5    b 0.7887983 0.8904983
6    b 0.7841745 0.6901540
7    c 0.3356583 0.7125086
8    c 0.5859311 0.0509323
9    c 0.7681325 0.8677725

## Use aggregate():
R> aggregate(dat[-1], dat[1], mean)
  gene        s1        s2
1    a 0.5604923 0.6588666
2    b 0.5678941 0.8020992
3    c 0.5632407 0.5437378

## Do it "by hand": need a bit more work if there are Nas.
R> rowsum(dat[-1], dat[[1]]) / table(dat[[1]])
         s1        s2
a 0.5604923 0.6588666
b 0.5678941 0.8020992
c 0.5632407 0.5437378

Andy
 

From: Booman, M
> 
> Dear all,
> 
> This is my problem: I have a table of gene expression data, 
> where 1st column is gene name, and 2nd -39th columns each are 
> exression data for 38 samples. There are multiple 
> measurements per sample for each gene, so there are multiple 
> rows for each gene name. I want to average these measurements 
> so i end up with one value per sample for each gene name. The 
> output data frame (table.averaged) is further used in other R 
> script. The code I use now (see below) takes 20 secs for each 
> loop, so it takes 45 minutes to average my files of 13500 
> unique genes. Can anyone help me do this faster?
> 
> Cheers, marije
> 
> Code I use: 
> 
> 
> table.imputed[,1] <- as.character(table.imputed[,1])    
> #table.imputed is data.frame,1st column = gene name (class 
> factor), rest of columns = expression data (class numeric)
> 
> genesunique <- unique(table.imputed[,1])                   
> #To make list of unique genes in the set
> 
> table.averaged <- NULL
>   for (j in 1:length(genesunique)) {
>      if (j%%100 == 0){                                        
>            #To report progress
>        cat(j, "genes finished", sep=" ", fill=TRUE)
>        }
>      
> table.averaged<-rbind(table.averaged,givemean(genesunique[j], 
> table.imputed))   #collects all rows of average values and 
> binds them back into one data frame
>   }
> 
> givemean <- function (gene, table.imputed) {
>    thisgene<-table.imputed[table.imputed[,1]==gene,]          
>                              #make a subset containing only 
> the rows for one gene name
>    data.frame(gene,t(sapply(thisgene[,2:ncol(thisgene)],mean, 
> na.rm=TRUE)))     #calculates average for each sample 
> (column) and outputs one row of average values and the gene name
> }
> 
> 
> De inhoud van dit bericht is vertrouwelijk en alleen bestemd 
> voor de geadresseerde(n). Anderen dan de geadresseerde mogen 
> geen gebruik maken van dit bericht, het openbaar maken of op 
> enige wijze verspreiden of vermenigvuldigen. Het UMCG kan 
> niet aansprakelijk gesteld worden voor een incomplete 
> aankomst of vertraging van dit verzonden bericht.
> 
> The contents of this message are confidential and only 
> intended for the eyes of the addressee(s). Others than the 
> addressee(s) are not allowed to use this message, to make it 
> public or to distribute or multiply this message in any way. 
> The UMCG cannot be held responsible for incomplete reception 
> or delay of this transferred message.
> 


------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}



More information about the R-help mailing list