[BioC] Looking for strongly correlated gene expression data

Tue Mar 13 11:12:35 CET 2007

On Tuesday 13 March 2007 06:01, Kim, K.I. wrote:
> I'd like to explain more. Simply I am considering multiple testings
> using gene expression data.
> In the usual two group multiple testing set-up, if we assume true null
> p-values are distributed independently and for example, 90% of p-values
> are truly null, then we can see around 90% of p-values are uniformly
> distributed. (for example, "golub" dataset in R multtest package) But if
> there exist strong correlations among p-values (or genes), then we can't
> expect such features. I guess histograms under dependent cases are more
> curved than flat line even for the large p-values.
>
> Actually, I am looking for gene expression datasets which shows "very"
> different histogram from the histograms of usual independent assumption
> and I want to do multiple testing using such datasets.
>
> I also thought downloading some gene expression files from a large
> database and then doing multiple testing but then I need to do some
> preprocessing jobs on the downloaded files and they will take some time.
> Instead I hoped to get "easy" dataset (already preprocessed like "golub"
> dataset in multtest package) in bioconductor. If there is no other
> convenient way to do it, then I may need to try NCBI GEO.

Just sticking to the NCBI GEO idea (I have a not-so-hidden agend as the author 
of GEOquery), you can simply use the GDSs from GEO.  They are already 
preprocessed and can be easily transformed into Bioconductor objects like 
exprSets and used for t-testing.  It would take only a few lines of code to 
do what you are suggesting for as many GDSs as you like.  So, before writing 
off all the data in GEO, you might look at the GEOquery vignette to see if it 
might serve your needs.

Sean