[BioC] replicates and low expression levels
Robert Gentleman
rgentlem at jimmy.harvard.edu
Sun Jun 1 17:50:20 MEST 2003
On Fri, May 30, 2003 at 05:28:45PM +0100, Crispin Miller wrote:
> Hi,
> Just a quick question about low expression levels on Affy systems - I hope it's not too off-topic; it is about normalisation and data analysis...
> I've heard a lot of people advocating that it's a good idea to perform an initial filtering on either Present Marginal or Absent calls, or on gene-expression levels (so that only genes with an expression > 40, say, after scaling to a TGT of 100 using the MAS5.0 algorithm, are part of the further analysis). Firstly, am I right in thinking that this is to eliminate data that are too close to the background noise level of the system.
>
> I wanted to canvas opinion as to whether people feel we need to do this if we have replicates and are using statistical tests - rather than just fold-changes - to identify 'interesting' genes. Does the statistical testing do this job for us?
Hi,
In my opinion you should always do some sort of non-specific
filtering. What you have described is one form of it, others include
removing genes that show little or no variability across samples.
I think of non-specific filtering as filtering without reference to
phenotype (of any sort).
There are a number of reasons for doing this, some motivated by the
biology and some by the statistics.
First off, especially for Affy, the chip is designed for all tissue
types but a commonly held belief is that only about 40% of the genome
is expressed in any specific tissue type. So, for any experiment you
will have a pretty large number of probes for genes that are not
expressed in the tissue you are looking at.
From a statistical perspective you need to be a little bit cautious
if you are going to standardize genes across samples (this is pretty
common). If you do not remove those genes that show little
variability before standardization then you have just elevated the
noise to the same status as the signal (and if the 40% estimate is
right then you actually have more noise than signal - not too
pleasant).
Using a test statistic (such as a t-test) does not help, since that
measures the between group differences relative to the variation (so
if there is very little variation and a small difference in mean,
well you get an enormous t-statistic and a small p-value; of course
in this case looking at the "fold-change" or the size of the effect
will indicate a problem, but not many people check all the things
that need checking (and what to check depends on the test that
you have just carried out). It seems to me to be much easier to just
filter those genes with no expression or little variation out at the
very start.
If they don't show any variation across samples they can't help to
classify or to cluster (there is no information about any phenotype
contained in them).
Robert
>
> Crispin
>
> --------------------------------------------------------
>
>
> This email is confidential and intended solely for the use of th... {{dropped}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
--
+---------------------------------------------------------------------------+
| Robert Gentleman phone : (617) 632-5250 |
| Associate Professor fax: (617) 632-2444 |
| Department of Biostatistics office: M1B20 |
| Harvard School of Public Health email: rgentlem at jimmy.harvard.edu |
+---------------------------------------------------------------------------+
More information about the Bioconductor
mailing list