[BioC] sample size

James W. MacDonald jmacdon at med.umich.edu
Tue May 1 15:40:42 CEST 2007


Hi Lev,

Lev Soinov wrote:
> Dear List, It was a lot of e-mails from me recently for which I
> apologise and hope that my questions were of interest to you. I would
> be very much interested in your views on the following. It is
> accepted that microarray measurements contain technical and
> biological variations and that for single channel (e.g. Affy)
> normalised log2 data biological variation is ~2-4 times higher than
> technical. 

I'm not sure I would agree with that. The technical variability is 
highly correlated with the abilities of the lab techs who do the RNA 
extraction and the Affy molecular biology steps, and there is a huge 
range of abilities in the world of lab techs.

In addition, biological variability is dependent on what system you are 
considering. If for instance you take some cell lines and pour bleach on 
one set and leave the other alone, then you are correct. The biological 
variability will likely be larger than technical varibility. On the 
other hand, I have seen plenty of experiments where the biological 
variability was much smaller than the technical variability, and as a 
result the number of 'significant' genes was less than you would expect 
by chance, given no difference between the sample types.

Now, let's say we have N arrays representing N separate
> individuals (or animals) of a certain type (wild type or under the
> same treatment condition). Thus, for a given gene, variations in its
> expression level represent variations across individuals. Suppose
> that for each gene we calculate its mean expression level across all
> arrays (across population) and then in each array classify genes as
> expressed above/below their average across population levels. Now for
> each gene we have a binary profile (above/below) across population of
> N individuals. We can further perform say simple correlation analysis
> and find those genes which concurrently expressed above/below their
> population means within the given set of N arrays. This may provide
> some information on their connectivity, etc. What would you say about
> this? Is it a valid suggestion? Or the technical component of
> variation will not allow doing this? If it were relevant, what would
> be the sample size sufficient for finding accurate correlations? Do
> you know of any Bioconductor tools that may be applied here for
> sample size calculations? Looking forward to hearing your opinions, 

I don't like the idea of reducing data to a binary distribution without 
a clear reason to do so. For instance, some classifiers work better if 
you can reduce the phenotypes to a binary outcome (e.g., good/bad 
survival outcome). However, by reducing to a binary distribution you are 
throwing out (possibly valuable) information, so I think it is 
reasonable to show why using the continuous distribution of data you 
started with isn't workable.

If you want to detect connectivity of genes, you might want to look at 
the work Steve Horvath has done:

http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/

Best,

Jim


> Lev.


> 
> ---------------------------------
> 
> [[alternative HTML version deleted]]
> 
> _______________________________________________ Bioconductor mailing
> list Bioconductor at stat.math.ethz.ch 
> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
> archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623


**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.



More information about the Bioconductor mailing list