[BioC] aggregate_summarizing expression values over entrez gene ids

Thu Nov 13 14:23:01 CET 2008

Hi Vanessa,

Have a look at "tapply" and "by".

But you could also think a bit more about the rationale for summarizing. 
The different probesets for the same Entrez gene ID are not replicates, 
and they are not equivalent. Some may be more valid or useful than others.

An approach that I find useful is to determine the probeset that shows 
most variability, and then believe that one. Of course, one can also 
look at the actual mapping of the probes to the transcript and to the 
gene structure, and make a decision based on that. For imporant results, 
this is what I would recommend (besides, of course, wet-lab follow-up.)

  Best wishes
	Wolfgang
-- 
----------------------------------------------------
Wolfgang Huber  EMBL-EBI  http://www.ebi.ac.uk/huber

Vanessa Vermeirssen wrote:
> Hi,
> 
> I have a dataframe containing RMA normalized and summarized expression 
> values for affymetrix probesets, av.data.
> I have looked up the Entrez gene ids for the probesets in the annotation 
> package, entrezids.
> Multiple probesets map of course to the same entrez id and I would like 
> to combine these data into one row,
> by averaging the expression values for the same entrez ids over the 
> different experiments.
> I tried the function "aggregate" to do this, but somehow it gives an 
> error that the arguments are not of the same length, but they are...???
> How can I solve this or is there any other way to do this?
> 
> See my code below...
> 
> av.data <- read.table("humanGPL570avdata.txt", row.names = 1, sep = 
> "\t", header = T, na.strings = "NA", fill = T)
> av.data[1:5,1:5]
>          X1_Schwann_p1 X1_Schwann_p3 X2_accumbens X2_adipose
> 1007_s_at      9.281857      9.340795     9.151775   8.319741
> 1053_at        7.000684      6.867318     4.633061   5.101534
> 117_at         6.007608      6.124562     5.425565   5.692270
> 121_at         6.543294      6.728119     7.651856   7.692947
> 1255_g_at      3.077289      2.989938     4.622865   2.955812
>          X2_adipose_omental
> 1007_s_at           7.909480
> 1053_at             4.509407
> 117_at              6.298798
> 121_at              7.598834
> 1255_g_at           3.040816
> 
> probes <- ls(hgu133plus2ENTREZID)
> entrezids <- unlist(mget(probes,hgu133plus2ENTREZID))
> newdata <- data.frame(entrezids,av.data)
> 
> sum <- aggregate(av.data,as.list(entrezids),mean)
> Error in FUN(X[[1L]], ...) : arguments must have same length
> 
>  > length(as.list(entrezids))
> [1] 54675
>  > dim(av.data)
> [1] 54675    69
> 
> sumdata <- aggregate(newdata,as.list(newdata$entrezids),mean)
> Error in FUN(X[[1L]], ...) : arguments must have same length
>  > length(as.list(newdata$entrezids))
> [1] 54675
>  > dim(newdata)
> [1] 54675    70
> 
> 
> Thank you so much!
> Vanessa
>