[BioC] Some Genefilter questions

Fri Dec 1 02:33:10 CET 2006

Robert,
    there are two sets of studies which have suggested the ~ 40% expression level from what I remember.  Classic COT curve studies from several decades ago suggested roughly this level.  More recently, MPSS (Massive Parrelel Signature Sequencing) studies have also suggested this is a reasonable cutoff.  Based on these studies I use the same rule of thumb that you do - the median.

David Pritchard

On Thu, 30 Nov 2006, Robert Gentleman wrote:

> Hi,
>
> Lourdusamy A Anbarasu wrote:
>> Dear Dr. Robert,
>>
>> You have mentioned that the filtering on the variability is preferred
>> than raw intensity value. I have also read your previous post on this
>> issue. For filters based on CV, are there any recommended cut-off values?
>
>  Not really. A widely held, but AFAIK undocumented, belief is that in
> any given tissue/cell about 40% of the genome is expressed at any time.
> So, I usually choose the median - that is somewhat conservative with
> respect to the above cited statistic - but this is a personal
> preference. I have not seen any research (and I think it would be hard).
>
>
>   best wishes
>    Robert
>
>>
>> Thanks in advance.
>>
>> Best regards,
>> Anbarasu
>>
>> On 11/30/06, *Robert Gentleman* <rgentlem at fhcrc.org
>> <mailto:rgentlem at fhcrc.org> > wrote:
>>
>>     Hi,
>>
>>     Amy Mikhail wrote:
>>     > Dear Bioconductors,
>>     >
>>     > I am annalysing 6 PlasmodiumAnopheles genechips, which have only
>>     Anopheles
>>     > mosquito samples hybridised to them (i.e. they are not infected
>>     > mosquitoes).  The 6 chips include 3 replicates, each consisting
>>     of two
>>     > time points.  The design matrix is as follows:
>>     >
>>     >> design
>>     >      M15d M43d
>>     > [1,]    1    0
>>     > [2,]    0    1
>>     > [3,]    1    0
>>     > [4,]    0    1
>>     > [5,]    1    0
>>     > [6,]    0    1
>>     >
>>     >
>>     > I have tried both gcRMA (in AffyLMGUI), and RMA, MBEI and MAS5
>>     (in affy).
>>     > Looking at the (BH) adjusted p values <0.05, this gave me 2, 12,
>>     0 and 0
>>     > DE genes, respectively... much less than I was expecting.
>>     >
>>     > As this affy chip contains probesets for both mosquito and malaria
>>     > parasite genes, I am wondering:
>>     >
>>     > (a) if it is better to remove all the parasite probesets before
>>     my analysis;
>>
>>       Yes, if you don't intend to use them, and they are not relevant to
>>     your analysis. There is no point in doing p-value corrections for tests
>>     you know are not interesting/relevant a priori.
>>
>>     >
>>     > (b) if so at what stage I should do this (before or after
>>     normalisation
>>     > and background correction, or does it matter?)
>>
>>       After both and prior to analysis - otherwise you are likely to
>>     need to
>>     do some serious tweaking of the normalization code.
>>
>>     >
>>     > (c) how would I filter out these probesets using genefilter (all the
>>     > parasite affy IDs begin with Pf. - could I use this prefix in the
>>     affy IDs
>>     > to filter out the probesets, and if so how?)
>>
>>        you don't need genefilter at all, this is a subseting problem.
>>       If you had an ExpressionSet you would do something like:
>>
>>        parasites = grep("^Pf", featureNames(myExpressionSet))
>>
>>        mySubset = myExpressionSet[!parasites,]
>>
>>     >
>>     > Secondly, I did not add any of the polyA controls to my
>>     samples.  I would
>>     > like to know:
>>     >
>>     > (d) Do any of the bg correct / normalisation methods I tried utilise
>>     > affymetrix control probesets, and if so, how?
>>
>>        I doubt it.
>>
>>     >
>>     > (e) Should I also filter out the control sets - again, if so at
>>     what stage
>>     > in the analysis and what would be an appropriate code to use?
>>     >
>>
>>        same place as you filter the parasite genes and pretty much in the
>>     same way. They are likely to start with AFFX.
>>
>>     > I did try the code for non-specific filtering (on my RMA dataset)
>>     from pg.
>>     > 232 of the bioconductor monograph, but the reduction in the number of
>>     > probesets was quite drastic;
>>     >
>>     >> f1 <- pOverA(0.25, log2(100))
>>     >> f2 <- function(x) (IQR(x) > 0.5)
>>
>>       that is a typo in the text - you probably want to filter out those
>>     with IQR below the median, not for some fixed value.
>>
>>     >> ff <- filterfun(f1, f2)
>>     >> selected <- genefilter(Baseage.transformed , ff)
>>     >> sum(selected)
>>     > [1] 404   ###(The origninal no. of probesets is 22,726)###
>>     >> Baseage.sub <- Baseage.transformed[selected, ]
>>     >
>>     > Also, I understood from the monograph that "100" was to filter out
>>     > fluorescence intensities less than this, but I am not clear if
>>     this is
>>     > from raw intensities or log2 values?
>>
>>       raw - 100 on the log2 scale is larger than can be represented in the
>>     image file formats used. And don't do that - it is not a good idea -
>>     filter on variability.
>>
>>
>>     >
>>     > All the parasite probesets have raw intensities <35 .... so could
>>     I apply
>>     > this as a simple filter, and would this have to be on raw (rather
>>     than
>>     > normalised data)?
>>
>>
>>       Best wishes
>>         Robert
>>
>>     >
>>     > Appologies for the long posting...
>>     >
>>     > Looking forward to any replies,
>>     > Regards,
>>     > Amy
>>     >
>>     >> sessionInfo()
>>     > R version 2.4.0 (2006-10-03)
>>     > i386-pc-mingw32
>>     >
>>     > locale:
>>     > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>>     > States.1252;LC_MONETARY=English_United
>>     > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>>     >
>>     > attached base packages:
>>     >  [1] "tcltk"     "splines"   "tools"     "methods"   "stats"
>>     > "graphics"  "grDevices" "utils"     "datasets"  "base"
>>     >
>>     > other attached packages:
>>     > plasmodiumanophelescdf              tkWidgets                 DynDoc
>>     >      widgetTools            agahomology
>>     >               "1.14.0"               " 1.12.0"               "1.12.0"
>>     >         "1.10.0"               "1.14.2"
>>     >                affyPLM                  gcrma            matchprobes
>>     >         affydata                annaffy
>>     >               "1.10.0"                "2.6.0"                "1.6.0"
>>     >         "1.10.0"                "1.6.0"
>>     >                   KEGG                     GO                  limma
>>     >      geneplotter               annotate
>>     >               "1.14.0"               "1.14.0"                "2.9.1"
>>     >         "1.12.0"               "1.12.0"
>>     >                   affy                 affyio             genefilter
>>     >         survival                Biobase
>>     >               "1.12.0"                "1.2.0"               "1.12.0 "
>>     >           "2.29"               "1.12.0"
>>     >
>>     >
>>     > -------------------------------------------
>>     > Amy Mikhail
>>     > Research student
>>     > University of Aberdeen
>>     > Zoology Building
>>     > Tillydrone Avenue
>>     > Aberdeen AB24 2TZ
>>     > Scotland
>>     > Email: a.mikhail at abdn.ac.uk <mailto:a.mikhail at abdn.ac.uk>
>>     > Phone: 00-44-1224-272880 (lab)
>>     >        00-44-1224-273256 (office)
>>     >
>>     > _______________________________________________
>>     > Bioconductor mailing list
>>     > Bioconductor at stat.math.ethz.ch
>>     <mailto:Bioconductor at stat.math.ethz.ch>
>>     > https://stat.ethz.ch/mailman/listinfo/bioconductor
>>     > Search the archives:
>>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>>     <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>     >
>>
>>     --
>>     Robert Gentleman, PhD
>>     Program in Computational Biology
>>     Division of Public Health Sciences
>>     Fred Hutchinson Cancer Research Center
>>     1100 Fairview Ave. N, M2-B876
>>     PO Box 19024
>>     Seattle, Washington 98109-1024
>>     206-667-7700
>>     rgentlem at fhcrc.org <mailto:rgentlem at fhcrc.org>
>>
>>     _______________________________________________
>>     Bioconductor mailing list
>>     Bioconductor at stat.math.ethz.ch <mailto:Bioconductor at stat.math.ethz.ch>
>>     https://stat.ethz.ch/mailman/listinfo/bioconductor
>>     Search the archives:
>>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>>
>> --
>> Lourdusamy A Anbarasu
>> Dipartimento Medicina Sperimentale e Sanita Pubblica
>> Via Scalzino 3
>> 62032 Camerino (MC)
>
> --
> Robert Gentleman, PhD
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> PO Box 19024
> Seattle, Washington 98109-1024
> 206-667-7700
> rgentlem at fhcrc.org
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>