[BioC] replicates and low expression levels

Mon Jun 2 15:22:13 MEST 2003

>On Fri, May 30, 2003 at 05:28:45PM +0100, Crispin Miller wrote:
> > Hi,
> > Just a quick question about low expression levels on Affy systems - I 
> hope it's not too off-topic; it is about normalisation and data analysis...
> > I've heard a lot of people advocating that it's a good idea to perform 
> an initial filtering on either Present Marginal or Absent calls, or on 
> gene-expression levels (so that only genes with an expression > 40, say, 
> after scaling to a TGT of 100 using the MAS5.0 algorithm, are part of the 
> further analysis). Firstly, am I right in thinking that this is to 
> eliminate data that are too close to the background noise level of the system.
> >
> > I wanted to canvas opinion as to whether people feel we need to do this 
> if we have replicates and are using statistical tests - rather than just 
> fold-changes - to identify 'interesting' genes. Does the statistical 
> testing do this job for us?
>
>Hi,
>   In my opinion you should always do some sort of non-specific
>   filtering. What you have described is one form of it, others include
>   removing genes that show little or no variability across samples.
>   I think of non-specific filtering as filtering without reference to
>   phenotype (of any sort).
>
>   There are a number of reasons for doing this, some motivated by the
>   biology and some by the statistics.
>
>   First off, especially for Affy, the chip is designed for all tissue
>   types but a commonly held belief is that only about 40% of the genome
>   is expressed in any specific tissue type. So, for any experiment you
>   will have a pretty large number of probes for genes that are not
>   expressed in the tissue you are looking at.
>
>   From a statistical perspective you need to be a little bit cautious
>   if you are going to standardize genes across samples (this is pretty
>   common). If you do not remove those genes that show little
>   variability before standardization then you have just elevated the
>   noise to the same status as the signal (and if the 40% estimate is
>   right then you actually have more noise than signal - not too
>   pleasant).
>
>   Using a test statistic (such as a t-test) does not help, since that
>   measures the between group differences relative to the variation (so
>   if there is very little variation and a small difference in mean,
>   well you get an enormous t-statistic and a small p-value; of course
>   in this case looking at the "fold-change" or the size of the effect
>   will indicate a problem, but not many people check all the things
>   that need checking (and what to check depends on the test that
>   you have just carried out). It seems to me to be much easier to just
>   filter those genes with no expression or little variation out at the
>   very start.

All good points. One thing that does help though is to use a t-statistic 
(or F or posterior odds or whatever) in which some form of shrinkage to a 
common value has been applied to the standard deviations. This has the 
effect of offsetting the smaller sample variances to be not less than a 
certain size. We have found that empirical Bayes t-statistics do a good job 
of eliminating the low-signal, low-variability genes without needing an 
explicit filtering step.

I have also wondered about the biological arguement that many genes might 
be not represented in a particular sample, and whether this means that 
non-specific filtering should be applied. I guess the reason that I don't 
do it at the moment is that I'm somewhat uneasy about possible selection 
bias in the filtered intensities and standard deviations. Another factor 
which allows us to avoid non-specific filtering is the use of background 
correction methods which ensure that the lower intensities are not 
especially variable.

Just some other thoughts.

Cheers
Gordon

>   If they don't show any variation across samples they can't help to
>   classify or to cluster (there is no information about any phenotype
>   contained in them).
>
>   Robert
>
>
> >
> > Crispin
> >
> > --------------------------------------------------------
> >
> >
> > This email is confidential and intended solely for the use of th... 
> {{dropped}}
> >
> > _______________________________________________
> > Bioconductor mailing list
> > 
> <https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor>Bioconductor 
> at stat.math.ethz.ch
> > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
>
>--
>+---------------------------------------------------------------------------+
>| Robert Gentleman                 phone : (617) 632-5250                   |
>| Associate Professor              fax:   (617)  632-2444                   |
>| Department of Biostatistics      office: M1B20                            |
>| Harvard School of Public Health  email: 
><https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor>rgentlem at 
>jimmy.harvard.edu        |
>+---------------------------------------------------------------------------+