[BioC] replicates and low expression levels
Gordon Smyth
smyth at wehi.edu.au
Mon Jun 2 15:22:13 MEST 2003
>On Fri, May 30, 2003 at 05:28:45PM +0100, Crispin Miller wrote:
> > Hi,
> > Just a quick question about low expression levels on Affy systems - I
> hope it's not too off-topic; it is about normalisation and data analysis...
> > I've heard a lot of people advocating that it's a good idea to perform
> an initial filtering on either Present Marginal or Absent calls, or on
> gene-expression levels (so that only genes with an expression > 40, say,
> after scaling to a TGT of 100 using the MAS5.0 algorithm, are part of the
> further analysis). Firstly, am I right in thinking that this is to
> eliminate data that are too close to the background noise level of the system.
> >
> > I wanted to canvas opinion as to whether people feel we need to do this
> if we have replicates and are using statistical tests - rather than just
> fold-changes - to identify 'interesting' genes. Does the statistical
> testing do this job for us?
>
>Hi,
> In my opinion you should always do some sort of non-specific
> filtering. What you have described is one form of it, others include
> removing genes that show little or no variability across samples.
> I think of non-specific filtering as filtering without reference to
> phenotype (of any sort).
>
> There are a number of reasons for doing this, some motivated by the
> biology and some by the statistics.
>
> First off, especially for Affy, the chip is designed for all tissue
> types but a commonly held belief is that only about 40% of the genome
> is expressed in any specific tissue type. So, for any experiment you
> will have a pretty large number of probes for genes that are not
> expressed in the tissue you are looking at.
>
> From a statistical perspective you need to be a little bit cautious
> if you are going to standardize genes across samples (this is pretty
> common). If you do not remove those genes that show little
> variability before standardization then you have just elevated the
> noise to the same status as the signal (and if the 40% estimate is
> right then you actually have more noise than signal - not too
> pleasant).
>
> Using a test statistic (such as a t-test) does not help, since that
> measures the between group differences relative to the variation (so
> if there is very little variation and a small difference in mean,
> well you get an enormous t-statistic and a small p-value; of course
> in this case looking at the "fold-change" or the size of the effect
> will indicate a problem, but not many people check all the things
> that need checking (and what to check depends on the test that
> you have just carried out). It seems to me to be much easier to just
> filter those genes with no expression or little variation out at the
> very start.
All good points. One thing that does help though is to use a t-statistic
(or F or posterior odds or whatever) in which some form of shrinkage to a
common value has been applied to the standard deviations. This has the
effect of offsetting the smaller sample variances to be not less than a
certain size. We have found that empirical Bayes t-statistics do a good job
of eliminating the low-signal, low-variability genes without needing an
explicit filtering step.
I have also wondered about the biological arguement that many genes might
be not represented in a particular sample, and whether this means that
non-specific filtering should be applied. I guess the reason that I don't
do it at the moment is that I'm somewhat uneasy about possible selection
bias in the filtered intensities and standard deviations. Another factor
which allows us to avoid non-specific filtering is the use of background
correction methods which ensure that the lower intensities are not
especially variable.
Just some other thoughts.
Cheers
Gordon
> If they don't show any variation across samples they can't help to
> classify or to cluster (there is no information about any phenotype
> contained in them).
>
> Robert
>
>
> >
> > Crispin
> >
> > --------------------------------------------------------
> >
> >
> > This email is confidential and intended solely for the use of th...
> {{dropped}}
> >
> > _______________________________________________
> > Bioconductor mailing list
> >
> <https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor>Bioconductor
> at stat.math.ethz.ch
> > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
>
>--
>+---------------------------------------------------------------------------+
>| Robert Gentleman phone : (617) 632-5250 |
>| Associate Professor fax: (617) 632-2444 |
>| Department of Biostatistics office: M1B20 |
>| Harvard School of Public Health email:
><https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor>rgentlem at
>jimmy.harvard.edu |
>+---------------------------------------------------------------------------+
More information about the Bioconductor
mailing list