[BioC] replicates and low expression levels
Robert Gentleman
rgentlem at jimmy.harvard.edu
Mon Jun 2 08:16:19 MEST 2003
On Mon, Jun 02, 2003 at 11:17:26AM +0100, Claire Wilson wrote:
> >On Fri, May 30, 2003 at 05:28:45PM +0100, Crispin Miller wrote:
> > > Hi,
> > > Just a quick question about low expression levels on Affy systems - I
> > hope it's not too off-topic; it is about normalisation and data analysis...
> > > I've heard a lot of people advocating that it's a good idea to perform
> > an initial filtering on either Present Marginal or Absent calls, or on
> > gene-expression levels (so that only genes with an expression > 40, say,
> > after scaling to a TGT of 100 using the MAS5.0 algorithm, are part of the
> > further analysis). Firstly, am I right in thinking that this is to
> > eliminate data that are too close to the background noise level of the system.
> > >
> > > I wanted to canvas opinion as to whether people feel we need to do this
> > if we have replicates and are using statistical tests - rather than just
> > fold-changes - to identify 'interesting' genes. Does the statistical
> > testing do this job for us?
> >
> >Hi,
> > In my opinion you should always do some sort of non-specific
> > filtering. What you have described is one form of it, others include
> > removing genes that show little or no variability across samples.
> > I think of non-specific filtering as filtering without reference to
> > phenotype (of any sort).
> >
> > There are a number of reasons for doing this, some motivated by the
> > biology and some by the statistics.
> >
> > First off, especially for Affy, the chip is designed for all tissue
> > types but a commonly held belief is that only about 40% of the genome
> > is expressed in any specific tissue type. So, for any experiment you
> > will have a pretty large number of probes for genes that are not
> > expressed in the tissue you are looking at.
> > From a statistical perspective you need to be a little bit cautious
> > if you are going to standardize genes across samples (this is pretty
> > common). If you do not remove those genes that show little
> > variability before standardization then you have just elevated the
> > noise to the same status as the signal (and if the 40% estimate is
> > right then you actually have more noise than signal - not too
> > pleasant).
>
> Hi,
>
> Just to clarify a couple of points. This suggest to me that filtering of genes with low expression is required prior to normalization and I was just wondering in Bioconductor how this is achieved without the use of Present/Absent calls and following on from a later point
>
> > you have just carried out). It seems to me to be much easier to just
> > filter those genes with no expression or little variation out at the
> > very start.
>
> what would be your filter for no expression of little variation?
>
Nope, they are important (and I'm not sure that they have been well
dealt with yet, as the variety of opinion shows).
I explored the gap filter as one way of filtering out those that were
unlikely to be informative. Otherwise simply looking for a decent
interquartile range could be helpful (decent is of course in the eye
of the beholder). (see the man pages in genefilter for more info and
examples)
As for not expressed, my current thinking is as follows, suppose that
the smallest meaningful group (by phenotype has k samples in it, eg
we have 100 ALL samples and 10 have a t(4;11) translocation and all
other subgroups of interest are larger) then I
would want to require that some number, like 8 or 9 of the samples
had high expression values for the probe. I would definitely not be
interested in a probe that had say only 3 (of the 100) samples
registering an expression value of larger than 100 (in Affy terms). I
just don't think that there is enough information in it to draw
conclusions. That said, my view changes pretty substantially if the
probes are identified with genes that are implicated in some of the
basic mechanisms of disease -- then I would poke around a little in
any event.
This remains a bit of an art and there are tradeoffs between
including irrelevant probes and excluding relevant ones. As we learn
more about the biology (or get more biological meta-data) this will
become simpler.
For example, suppose that I am studying T-cells and I know that gene
X is not normally expressed in these cells. Then in 3 (of my 100)
samples gene X is expressed. Well, I'd be pretty interested...
Mileage depends on the conditions, the vehicle and the driver, your's
may vary
Regards,
Robert
> Sorry if these questions are a little basic
>
> Thanks
>
> Claire
>
> --------------------------------------------------------
>
>
> This email is confidential and intended solely for the use of th... {{dropped}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
--
+---------------------------------------------------------------------------+
| Robert Gentleman phone : (617) 632-5250 |
| Associate Professor fax: (617) 632-2444 |
| Department of Biostatistics office: M1B20 |
| Harvard School of Public Health email: rgentlem at jimmy.harvard.edu |
+---------------------------------------------------------------------------+
More information about the Bioconductor
mailing list