[BioC] Some Genefilter questions

Robert Gentleman rgentlem at fhcrc.org
Thu Nov 30 19:12:00 CET 2006


Hi,

Lourdusamy A Anbarasu wrote:
> Dear Dr. Robert,
> 
> You have mentioned that the filtering on the variability is preferred 
> than raw intensity value. I have also read your previous post on this 
> issue. For filters based on CV, are there any recommended cut-off values?

  Not really. A widely held, but AFAIK undocumented, belief is that in 
any given tissue/cell about 40% of the genome is expressed at any time. 
So, I usually choose the median - that is somewhat conservative with 
respect to the above cited statistic - but this is a personal 
preference. I have not seen any research (and I think it would be hard).


   best wishes
    Robert

> 
> Thanks in advance.
> 
> Best regards,
> Anbarasu
> 
> On 11/30/06, *Robert Gentleman* <rgentlem at fhcrc.org 
> <mailto:rgentlem at fhcrc.org> > wrote:
> 
>     Hi,
> 
>     Amy Mikhail wrote:
>      > Dear Bioconductors,
>      >
>      > I am annalysing 6 PlasmodiumAnopheles genechips, which have only
>     Anopheles
>      > mosquito samples hybridised to them (i.e. they are not infected
>      > mosquitoes).  The 6 chips include 3 replicates, each consisting
>     of two
>      > time points.  The design matrix is as follows:
>      >
>      >> design
>      >      M15d M43d
>      > [1,]    1    0
>      > [2,]    0    1
>      > [3,]    1    0
>      > [4,]    0    1
>      > [5,]    1    0
>      > [6,]    0    1
>      >
>      >
>      > I have tried both gcRMA (in AffyLMGUI), and RMA, MBEI and MAS5
>     (in affy).
>      > Looking at the (BH) adjusted p values <0.05, this gave me 2, 12,
>     0 and 0
>      > DE genes, respectively... much less than I was expecting.
>      >
>      > As this affy chip contains probesets for both mosquito and malaria
>      > parasite genes, I am wondering:
>      >
>      > (a) if it is better to remove all the parasite probesets before
>     my analysis;
> 
>       Yes, if you don't intend to use them, and they are not relevant to
>     your analysis. There is no point in doing p-value corrections for tests
>     you know are not interesting/relevant a priori.
> 
>      >
>      > (b) if so at what stage I should do this (before or after
>     normalisation
>      > and background correction, or does it matter?)
> 
>       After both and prior to analysis - otherwise you are likely to
>     need to
>     do some serious tweaking of the normalization code.
> 
>      >
>      > (c) how would I filter out these probesets using genefilter (all the
>      > parasite affy IDs begin with Pf. - could I use this prefix in the
>     affy IDs
>      > to filter out the probesets, and if so how?)
> 
>        you don't need genefilter at all, this is a subseting problem.
>       If you had an ExpressionSet you would do something like:
> 
>        parasites = grep("^Pf", featureNames(myExpressionSet))
> 
>        mySubset = myExpressionSet[!parasites,]
> 
>      >
>      > Secondly, I did not add any of the polyA controls to my
>     samples.  I would
>      > like to know:
>      >
>      > (d) Do any of the bg correct / normalisation methods I tried utilise
>      > affymetrix control probesets, and if so, how?
> 
>        I doubt it.
> 
>      >
>      > (e) Should I also filter out the control sets - again, if so at
>     what stage
>      > in the analysis and what would be an appropriate code to use?
>      >
> 
>        same place as you filter the parasite genes and pretty much in the
>     same way. They are likely to start with AFFX.
> 
>      > I did try the code for non-specific filtering (on my RMA dataset)
>     from pg.
>      > 232 of the bioconductor monograph, but the reduction in the number of
>      > probesets was quite drastic;
>      >
>      >> f1 <- pOverA(0.25, log2(100))
>      >> f2 <- function(x) (IQR(x) > 0.5)
> 
>       that is a typo in the text - you probably want to filter out those
>     with IQR below the median, not for some fixed value.
> 
>      >> ff <- filterfun(f1, f2)
>      >> selected <- genefilter(Baseage.transformed , ff)
>      >> sum(selected)
>      > [1] 404   ###(The origninal no. of probesets is 22,726)###
>      >> Baseage.sub <- Baseage.transformed[selected, ]
>      >
>      > Also, I understood from the monograph that "100" was to filter out
>      > fluorescence intensities less than this, but I am not clear if
>     this is
>      > from raw intensities or log2 values?
> 
>       raw - 100 on the log2 scale is larger than can be represented in the
>     image file formats used. And don't do that - it is not a good idea -
>     filter on variability.
> 
> 
>      >
>      > All the parasite probesets have raw intensities <35 .... so could
>     I apply
>      > this as a simple filter, and would this have to be on raw (rather
>     than
>      > normalised data)?
> 
> 
>       Best wishes
>         Robert
> 
>      >
>      > Appologies for the long posting...
>      >
>      > Looking forward to any replies,
>      > Regards,
>      > Amy
>      >
>      >> sessionInfo()
>      > R version 2.4.0 (2006-10-03)
>      > i386-pc-mingw32
>      >
>      > locale:
>      > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>      > States.1252;LC_MONETARY=English_United
>      > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>      >
>      > attached base packages:
>      >  [1] "tcltk"     "splines"   "tools"     "methods"   "stats"
>      > "graphics"  "grDevices" "utils"     "datasets"  "base"
>      >
>      > other attached packages:
>      > plasmodiumanophelescdf              tkWidgets                 DynDoc
>      >      widgetTools            agahomology
>      >               "1.14.0"               " 1.12.0"               "1.12.0"
>      >         "1.10.0"               "1.14.2"
>      >                affyPLM                  gcrma            matchprobes
>      >         affydata                annaffy
>      >               "1.10.0"                "2.6.0"                "1.6.0"
>      >         "1.10.0"                "1.6.0"
>      >                   KEGG                     GO                  limma
>      >      geneplotter               annotate
>      >               "1.14.0"               "1.14.0"                "2.9.1"
>      >         "1.12.0"               "1.12.0"
>      >                   affy                 affyio             genefilter
>      >         survival                Biobase
>      >               "1.12.0"                "1.2.0"               "1.12.0 "
>      >           "2.29"               "1.12.0"
>      >
>      >
>      > -------------------------------------------
>      > Amy Mikhail
>      > Research student
>      > University of Aberdeen
>      > Zoology Building
>      > Tillydrone Avenue
>      > Aberdeen AB24 2TZ
>      > Scotland
>      > Email: a.mikhail at abdn.ac.uk <mailto:a.mikhail at abdn.ac.uk>
>      > Phone: 00-44-1224-272880 (lab)
>      >        00-44-1224-273256 (office)
>      >
>      > _______________________________________________
>      > Bioconductor mailing list
>      > Bioconductor at stat.math.ethz.ch
>     <mailto:Bioconductor at stat.math.ethz.ch>
>      > https://stat.ethz.ch/mailman/listinfo/bioconductor
>      > Search the archives:
>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>     <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>      >
> 
>     --
>     Robert Gentleman, PhD
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M2-B876
>     PO Box 19024
>     Seattle, Washington 98109-1024
>     206-667-7700
>     rgentlem at fhcrc.org <mailto:rgentlem at fhcrc.org>
> 
>     _______________________________________________
>     Bioconductor mailing list
>     Bioconductor at stat.math.ethz.ch <mailto:Bioconductor at stat.math.ethz.ch>
>     https://stat.ethz.ch/mailman/listinfo/bioconductor
>     Search the archives:
>     http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> 
> 
> -- 
> Lourdusamy A Anbarasu
> Dipartimento Medicina Sperimentale e Sanita Pubblica
> Via Scalzino 3
> 62032 Camerino (MC)

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org



More information about the Bioconductor mailing list