[BioC] Assess inter-study consistency

Thu Sep 4 22:17:47 CEST 2008

This is one of my favorite topics.

cor.test returns a p value, so you could consider that.

But as you are intimating, most genes need to be expressed at the  
same level regardless of what you do to the system, or the cells
will die, so this doesn't really answer your question of concordance,  
because you want to report on the genes that did something, not the  
ones that just sat there.

There are many ways you could score this. My favorite is to select  
the top N most regulated genes
in each each experiment. If you pick a small number, then you will be  
focusing on the part of your experiment you are likely to report
as a result.

The basic idea of a very simple statistic is: how likely is a  
particular gene to make it into the top 1%, all 7 times? And the answer
to that is, not very often, under the null hypothesis. If you model  
this as picking genes at random out of an urn, then it would be
.01^7.

With a tiny number of genes picked, you could use dbinom, or get  
fancier and use dhyper.

All this assumes independence, which kind of strange on an array  
where you have multiple probes for the same gene, so all this is just
a starting point...

Cheers

T

On Sep 4, 2008, at 1:40 PM, Ochsner, Scott A wrote:

> Dear BioC,
>
> I would like to use simple correlation to assess the consistency  
> between a seven independent expression array datasets.  All  
> datasets are on the same platform, hgu133a.
>
> In the materials and methods section from http:// 
> cancerres.aacrjournals.org/cgi/content/full/67/21/10296#top they  
> state,
> "To assess for consistency between the three studies, Pearson  
> correlation was computed pair-wise between the mean values of  
> common genes. The three studies showed significant positive pair- 
> wise correlation."
>
> I'm having trouble following their statement.  I don't have to  
> worry about common genes as all of the seven studies I'm looking at  
> are on the same platform.
>
> I thought of doing something as below:
>
> #eset is your standard ExpressionSet object
> #treatment is a vector describing which group each array belongs  
> to.  There are two groups, cont. and drug.
>
>> avg<-function(eset,treatment){
> + tmp<-aggregate(t(exprs(eset)),by=list(treatment),mean)
> + rownames(tmp)<-tmp[,1]
> + t(tmp[,-1])
> + }
>> groupAverage<-avg(eset,treatment)
>> dim(groupAverage)
> [1] 22277    14
>
>> cor(sampleAverage)
>           c.d3529   c.d3834   c.d4006   c.d4025   c.d6800    
> c.d8540   c.d9936   e.d3529   e.d3834   e.d4006   e.d4025    
> e.d6800   e.d8540
> c.d3529 1.0000000 0.9659532 0.7933771 0.7498652 0.8957816 0.8874096  
> 0.9041292 0.9917589 0.9535454 0.7964003 0.7577108 0.8889499 0.8904473
> c.d3834 0.9659532 1.0000000 0.8071949 etc....
>
>
> Questions:
> 1. Since I'm expecting most of the probe sets on these arrays to  
> not change, shouldn't I expect high correlation even between the  
> cont. and drug groups?  Or in other words, how informative is doing  
> cor across all of the probe sets?
>
> 2. How might I assess the significance of these correlations.
>
>> sessionInfo()
> R version 2.7.0 (2008-04-22)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United  
> States.1252;LC_MONETARY=English_United States. 
> 1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>
> attached base packages:
> [1] splines   tools     stats     graphics  grDevices utils      
> datasets  methods   base
>
> other attached packages:
>  [1] affycoretools_1.12.0 annaffy_1.12.1       KEGG.db_2.2.0         
> gcrma_2.12.1         matchprobes_1.12.0   biomaRt_1.14.0
>  [7] RCurl_0.9-3          GOstats_2.6.0        Category_2.6.0        
> RBGL_1.16.0          GO.db_2.2.0          graph_1.18.1
> [13] limma_2.14.2         affy_1.18.1          preprocessCore_1.2.0  
> affyio_1.8.0         MLInterfaces_1.14.1  annotate_1.18.0
> [19] xtable_1.5-2         AnnotationDbi_1.2.1  RSQLite_0.6-8         
> DBI_0.2-4            rda_1.0              rpart_3.1-41
> [25] genefilter_1.20.0    survival_2.34-1      MASS_7.2-41           
> Biobase_2.0.1
>
> loaded via a namespace (and not attached):
> [1] class_7.2-41    cluster_1.11.10 XML_1.95-2
>
> Scott A. Ochsner, Ph.D.
> NURSA Bioinformatics
> Molecular and Cellular Biology
> Baylor College of Medicine
> Houston, TX. 77030
> phone: 713-798-6227
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/ 
> gmane.science.biology.informatics.conductor