[BioC] Some Genefilter questions

Thu Nov 30 20:06:34 CET 2006

Hi Amy,

Don't you just love it when you get one response suggesting you do one 
thing (remove malarial genes after pre-processing) and another response 
suggesting the opposite?  Although I think in this case Robert was 
suggesting you remove them after pre-processing because it was easier than 
trying to modify either the normalization code or the cdf environment, 
which is what Jim pointed out to you. I ran into this same problem with 
having probesets for other species on the soybean array, which is why I 
used Ariel's code. I think that if you're using a mixed species array but 
only put one of the species on it, then you should remove the other 
species' probesets BEFORE doing the normalization because they really have 
no bearing on the transcriptome you're trying to measure. On the other 
hand, if you also want to filter your species' probesets based on 
presence/absence, minimum cutoff, variation, etc.* , then you should filter 
these genes AFTER doing the pre-processing because these probesets do 
contain information about the transcriptome, even if it is just 'not 
detectably expressed'.

Cheers,
Jenny

* Contrary to Robert, I prefer to filter on presence/absence (using Affy's 
calls) rather than variability :) I don't know if there is any 
documentation on which may be "better"...

At 05:15 PM 11/29/2006, Robert Gentleman wrote:
>Hi,
>
>Amy Mikhail wrote:
> > Dear Bioconductors,
> >
> > I am annalysing 6 PlasmodiumAnopheles genechips, which have only Anopheles
> > mosquito samples hybridised to them (i.e. they are not infected
> > mosquitoes).  The 6 chips include 3 replicates, each consisting of two
> > time points.  The design matrix is as follows:
> >
> >> design
> >      M15d M43d
> > [1,]    1    0
> > [2,]    0    1
> > [3,]    1    0
> > [4,]    0    1
> > [5,]    1    0
> > [6,]    0    1
> >
> >
> > I have tried both gcRMA (in AffyLMGUI), and RMA, MBEI and MAS5 (in affy).
> > Looking at the (BH) adjusted p values <0.05, this gave me 2, 12, 0 and 0
> > DE genes, respectively... much less than I was expecting.
> >
> > As this affy chip contains probesets for both mosquito and malaria
> > parasite genes, I am wondering:
> >
> > (a) if it is better to remove all the parasite probesets before my 
> analysis;
>
>   Yes, if you don't intend to use them, and they are not relevant to
>your analysis. There is no point in doing p-value corrections for tests
>you know are not interesting/relevant a priori.
>
> >
> > (b) if so at what stage I should do this (before or after normalisation
> > and background correction, or does it matter?)
>
>   After both and prior to analysis - otherwise you are likely to need to
>do some serious tweaking of the normalization code.
>
> >
> > (c) how would I filter out these probesets using genefilter (all the
> > parasite affy IDs begin with Pf. - could I use this prefix in the affy IDs
> > to filter out the probesets, and if so how?)
>
>    you don't need genefilter at all, this is a subseting problem.
>   If you had an ExpressionSet you would do something like:
>
>    parasites = grep("^Pf", featureNames(myExpressionSet))
>
>    mySubset = myExpressionSet[!parasites,]
>
> >
> > Secondly, I did not add any of the polyA controls to my samples.  I would
> > like to know:
> >
> > (d) Do any of the bg correct / normalisation methods I tried utilise
> > affymetrix control probesets, and if so, how?
>
>    I doubt it.
>
> >
> > (e) Should I also filter out the control sets - again, if so at what stage
> > in the analysis and what would be an appropriate code to use?
> >
>
>    same place as you filter the parasite genes and pretty much in the
>same way. They are likely to start with AFFX.
>
> > I did try the code for non-specific filtering (on my RMA dataset) from pg.
> > 232 of the bioconductor monograph, but the reduction in the number of
> > probesets was quite drastic;
> >
> >> f1 <- pOverA(0.25, log2(100))
> >> f2 <- function(x) (IQR(x) > 0.5)
>
>   that is a typo in the text - you probably want to filter out those
>with IQR below the median, not for some fixed value.
>
> >> ff <- filterfun(f1, f2)
> >> selected <- genefilter(Baseage.transformed, ff)
> >> sum(selected)
> > [1] 404   ###(The origninal no. of probesets is 22,726)###
> >> Baseage.sub <- Baseage.transformed[selected, ]
> >
> > Also, I understood from the monograph that "100" was to filter out
> > fluorescence intensities less than this, but I am not clear if this is
> > from raw intensities or log2 values?
>
>   raw - 100 on the log2 scale is larger than can be represented in the
>image file formats used. And don't do that - it is not a good idea -
>filter on variability.
>
>
> >
> > All the parasite probesets have raw intensities <35 .... so could I apply
> > this as a simple filter, and would this have to be on raw (rather than
> > normalised data)?
>
>
>   Best wishes
>     Robert
>
> >
> > Appologies for the long posting...
> >
> > Looking forward to any replies,
> > Regards,
> > Amy
> >
> >> sessionInfo()
> > R version 2.4.0 (2006-10-03)
> > i386-pc-mingw32
> >
> > locale:
> > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> > States.1252;LC_MONETARY=English_United
> > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
> >
> > attached base packages:
> >  [1] "tcltk"     "splines"   "tools"     "methods"   "stats"
> > "graphics"  "grDevices" "utils"     "datasets"  "base"
> >
> > other attached packages:
> > plasmodiumanophelescdf              tkWidgets                 DynDoc
> >      widgetTools            agahomology
> >               "1.14.0"               "1.12.0"               "1.12.0"
> >         "1.10.0"               "1.14.2"
> >                affyPLM                  gcrma            matchprobes
> >         affydata                annaffy
> >               "1.10.0"                "2.6.0"                "1.6.0"
> >         "1.10.0"                "1.6.0"
> >                   KEGG                     GO                  limma
> >      geneplotter               annotate
> >               "1.14.0"               "1.14.0"                "2.9.1"
> >         "1.12.0"               "1.12.0"
> >                   affy                 affyio             genefilter
> >         survival                Biobase
> >               "1.12.0"                "1.2.0"               "1.12.0"
> >           "2.29"               "1.12.0"
> >
> >
> > -------------------------------------------
> > Amy Mikhail
> > Research student
> > University of Aberdeen
> > Zoology Building
> > Tillydrone Avenue
> > Aberdeen AB24 2TZ
> > Scotland
> > Email: a.mikhail at abdn.ac.uk
> > Phone: 00-44-1224-272880 (lab)
> >        00-44-1224-273256 (office)
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
>
>--
>Robert Gentleman, PhD
>Program in Computational Biology
>Division of Public Health Sciences
>Fred Hutchinson Cancer Research Center
>1100 Fairview Ave. N, M2-B876
>PO Box 19024
>Seattle, Washington 98109-1024
>206-667-7700
>rgentlem at fhcrc.org
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives: 
>http://news.gmane.org/gmane.science.biology.informatics.conductor

Jenny Drnevich, Ph.D.

Functional Genomics Bioinformatics Specialist
W.M. Keck Center for Comparative and Functional Genomics
Roy J. Carver Biotechnology Center
University of Illinois, Urbana-Champaign

330 ERML
1201 W. Gregory Dr.
Urbana, IL 61801
USA

ph: 217-244-7355
fax: 217-265-5066
e-mail: drnevich at uiuc.edu