[BioC] replicates and low expression levels

Robert Gentleman rgentlem at jimmy.harvard.edu
Sun Jun 1 17:50:20 MEST 2003


On Fri, May 30, 2003 at 05:28:45PM +0100, Crispin Miller wrote:
> Hi,
> Just a quick question about low expression levels on Affy systems - I hope it's not too off-topic; it is about normalisation and data analysis...
> I've heard a lot of people advocating that it's a good idea to perform an initial filtering on either Present Marginal or Absent calls, or on gene-expression levels (so that only genes with an expression > 40, say, after scaling to a TGT of 100 using the MAS5.0 algorithm, are part of the further analysis). Firstly, am I right in thinking that this is to eliminate data that are too close to the background noise level of the system.
> 
> I wanted to canvas opinion as to whether people feel we need to do this if we have replicates and are using statistical tests - rather than just fold-changes - to identify 'interesting' genes. Does the statistical testing do this job for us?

Hi,
  In my opinion you should always do some sort of non-specific
  filtering. What you have described is one form of it, others include 
  removing genes that show little or no variability across samples.
  I think of non-specific filtering as filtering without reference to
  phenotype (of any sort).

  There are a number of reasons for doing this, some motivated by the
  biology and some by the statistics.

  First off, especially for Affy, the chip is designed for all tissue
  types but a commonly held belief is that only about 40% of the genome
  is expressed in any specific tissue type. So, for any experiment you
  will have a pretty large number of probes for genes that are not
  expressed in the tissue you are looking at.

  From a statistical perspective you need to be a little bit cautious
  if you are going to standardize genes across samples (this is pretty
  common). If you do not remove those genes that show little
  variability before standardization then you have just elevated the
  noise to the same status as the signal (and if the 40% estimate is
  right then you actually have more noise than signal - not too
  pleasant).

  Using a test statistic (such as a t-test) does not help, since that
  measures the between group differences relative to the variation (so
  if there is very little variation and a small difference in mean,
  well you get an enormous t-statistic and a small p-value; of course
  in this case looking at the "fold-change" or the size of the effect
  will indicate a problem, but not many people check all the things
  that need checking (and what to check depends on the test that
  you have just carried out). It seems to me to be much easier to just
  filter those genes with no expression or little variation out at the
  very start.

  If they don't show any variation across samples they can't help to
  classify or to cluster (there is no information about any phenotype
  contained in them).

  Robert


> 
> Crispin
>  
> --------------------------------------------------------
> 
>  
> This email is confidential and intended solely for the use of th... {{dropped}}
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

-- 
+---------------------------------------------------------------------------+
| Robert Gentleman                 phone : (617) 632-5250                   |
| Associate Professor              fax:   (617)  632-2444                   |
| Department of Biostatistics      office: M1B20                            |
| Harvard School of Public Health  email: rgentlem at jimmy.harvard.edu        |
+---------------------------------------------------------------------------+



More information about the Bioconductor mailing list