[BioC] Some Genefilter questions

Claus Mayer claus at bioss.ac.uk
Fri Dec 1 15:04:49 CET 2006


Hello,

just to throw in my own bits of wisdom: I am clearly on Robert's side in 
this argument, i.e. normalise with ALL genes, analyse just the species 
specific ones. When you use GCRMA, you have three main steps in the 
algorithm:

1)Background correction: As Robert points out, the foreign genes should 
improve this

2) Quantile Normalisation: Obviously the distribution across all probes 
will change (mainly it will have more mass on the low-intensity range), 
but that will be the case for all arrays in the same way, as the foreign 
genes are not expected to change, so I can't see why these extra genes 
should be harmful.

3)Summarizing the Probesets: For each gene only the values of all probes 
correspoding to that gene are used, so this step will not be influenced 
by additional genes.

For the analysis its a different thing. Obviously you want to get rid of 
  genes which are not of interest before p-value adjustment for multiple 
testing, because you will be more conservative then necessary otherwise.
There is also a case for not wanting them to be in the limma analysis I 
think. The foreign genes will be less variable, as they only show 
background noise and thus are not affected by biological variability. 
This will reduce the average variance across all genes and as limma 
shrinks individual gene variances towards this average the denominators 
in the moderated t-statistics will be reduced too, thus leading to false 
positives. I am not sure whether it will really make a big difference 
practically, but theoretically there is certainly an issue here.

Interesting discussion anyway,

Claus

Jenny Drnevich wrote:
> Hi Amy,
> 
> Don't you just love it when you get one response suggesting you do one 
> thing (remove malarial genes after pre-processing) and another response 
> suggesting the opposite?  Although I think in this case Robert was 
> suggesting you remove them after pre-processing because it was easier than 
> trying to modify either the normalization code or the cdf environment, 
> which is what Jim pointed out to you. I ran into this same problem with 
> having probesets for other species on the soybean array, which is why I 
> used Ariel's code. I think that if you're using a mixed species array but 
> only put one of the species on it, then you should remove the other 
> species' probesets BEFORE doing the normalization because they really have 
> no bearing on the transcriptome you're trying to measure. On the other 
> hand, if you also want to filter your species' probesets based on 
> presence/absence, minimum cutoff, variation, etc.* , then you should filter 
> these genes AFTER doing the pre-processing because these probesets do 
> contain information about the transcriptome, even if it is just 'not 
> detectably expressed'.
> 
> Cheers,
> Jenny
> 
> * Contrary to Robert, I prefer to filter on presence/absence (using Affy's 
> calls) rather than variability :) I don't know if there is any 
> documentation on which may be "better"...
> 
> At 05:15 PM 11/29/2006, Robert Gentleman wrote:
>> Hi,
>>
>> Amy Mikhail wrote:
>>> Dear Bioconductors,
>>>
>>> I am annalysing 6 PlasmodiumAnopheles genechips, which have only Anopheles
>>> mosquito samples hybridised to them (i.e. they are not infected
>>> mosquitoes).  The 6 chips include 3 replicates, each consisting of two
>>> time points.  The design matrix is as follows:
>>>
>>>> design
>>>      M15d M43d
>>> [1,]    1    0
>>> [2,]    0    1
>>> [3,]    1    0
>>> [4,]    0    1
>>> [5,]    1    0
>>> [6,]    0    1
>>>
>>>
>>> I have tried both gcRMA (in AffyLMGUI), and RMA, MBEI and MAS5 (in affy).
>>> Looking at the (BH) adjusted p values <0.05, this gave me 2, 12, 0 and 0
>>> DE genes, respectively... much less than I was expecting.
>>>
>>> As this affy chip contains probesets for both mosquito and malaria
>>> parasite genes, I am wondering:
>>>
>>> (a) if it is better to remove all the parasite probesets before my 
>> analysis;
>>
>>   Yes, if you don't intend to use them, and they are not relevant to
>> your analysis. There is no point in doing p-value corrections for tests
>> you know are not interesting/relevant a priori.
>>
>>> (b) if so at what stage I should do this (before or after normalisation
>>> and background correction, or does it matter?)
>>   After both and prior to analysis - otherwise you are likely to need to
>> do some serious tweaking of the normalization code.
>>
>>> (c) how would I filter out these probesets using genefilter (all the
>>> parasite affy IDs begin with Pf. - could I use this prefix in the affy IDs
>>> to filter out the probesets, and if so how?)
>>    you don't need genefilter at all, this is a subseting problem.
>>   If you had an ExpressionSet you would do something like:
>>
>>    parasites = grep("^Pf", featureNames(myExpressionSet))
>>
>>    mySubset = myExpressionSet[!parasites,]
>>
>>> Secondly, I did not add any of the polyA controls to my samples.  I would
>>> like to know:
>>>
>>> (d) Do any of the bg correct / normalisation methods I tried utilise
>>> affymetrix control probesets, and if so, how?
>>    I doubt it.
>>
>>> (e) Should I also filter out the control sets - again, if so at what stage
>>> in the analysis and what would be an appropriate code to use?
>>>
>>    same place as you filter the parasite genes and pretty much in the
>> same way. They are likely to start with AFFX.
>>
>>> I did try the code for non-specific filtering (on my RMA dataset) from pg.
>>> 232 of the bioconductor monograph, but the reduction in the number of
>>> probesets was quite drastic;
>>>
>>>> f1 <- pOverA(0.25, log2(100))
>>>> f2 <- function(x) (IQR(x) > 0.5)
>>   that is a typo in the text - you probably want to filter out those
>> with IQR below the median, not for some fixed value.
>>
>>>> ff <- filterfun(f1, f2)
>>>> selected <- genefilter(Baseage.transformed, ff)
>>>> sum(selected)
>>> [1] 404   ###(The origninal no. of probesets is 22,726)###
>>>> Baseage.sub <- Baseage.transformed[selected, ]
>>> Also, I understood from the monograph that "100" was to filter out
>>> fluorescence intensities less than this, but I am not clear if this is
>>> from raw intensities or log2 values?
>>   raw - 100 on the log2 scale is larger than can be represented in the
>> image file formats used. And don't do that - it is not a good idea -
>> filter on variability.
>>
>>
>>> All the parasite probesets have raw intensities <35 .... so could I apply
>>> this as a simple filter, and would this have to be on raw (rather than
>>> normalised data)?
>>
>>   Best wishes
>>     Robert
>>
>>> Appologies for the long posting...
>>>
>>> Looking forward to any replies,
>>> Regards,
>>> Amy
>>>
>>>> sessionInfo()
>>> R version 2.4.0 (2006-10-03)
>>> i386-pc-mingw32
>>>
>>> locale:
>>> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>>> States.1252;LC_MONETARY=English_United
>>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>>>
>>> attached base packages:
>>>  [1] "tcltk"     "splines"   "tools"     "methods"   "stats"
>>> "graphics"  "grDevices" "utils"     "datasets"  "base"
>>>
>>> other attached packages:
>>> plasmodiumanophelescdf              tkWidgets                 DynDoc
>>>      widgetTools            agahomology
>>>               "1.14.0"               "1.12.0"               "1.12.0"
>>>         "1.10.0"               "1.14.2"
>>>                affyPLM                  gcrma            matchprobes
>>>         affydata                annaffy
>>>               "1.10.0"                "2.6.0"                "1.6.0"
>>>         "1.10.0"                "1.6.0"
>>>                   KEGG                     GO                  limma
>>>      geneplotter               annotate
>>>               "1.14.0"               "1.14.0"                "2.9.1"
>>>         "1.12.0"               "1.12.0"
>>>                   affy                 affyio             genefilter
>>>         survival                Biobase
>>>               "1.12.0"                "1.2.0"               "1.12.0"
>>>           "2.29"               "1.12.0"
>>>
>>>
>>> -------------------------------------------
>>> Amy Mikhail
>>> Research student
>>> University of Aberdeen
>>> Zoology Building
>>> Tillydrone Avenue
>>> Aberdeen AB24 2TZ
>>> Scotland
>>> Email: a.mikhail at abdn.ac.uk
>>> Phone: 00-44-1224-272880 (lab)
>>>        00-44-1224-273256 (office)
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> --
>> Robert Gentleman, PhD
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M2-B876
>> PO Box 19024
>> Seattle, Washington 98109-1024
>> 206-667-7700
>> rgentlem at fhcrc.org
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> Jenny Drnevich, Ph.D.
> 
> Functional Genomics Bioinformatics Specialist
> W.M. Keck Center for Comparative and Functional Genomics
> Roy J. Carver Biotechnology Center
> University of Illinois, Urbana-Champaign
> 
> 330 ERML
> 1201 W. Gregory Dr.
> Urbana, IL 61801
> USA
> 
> ph: 217-244-7355
> fax: 217-265-5066
> e-mail: drnevich at uiuc.edu
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
>  
> 
> 

-- 
***********************************************************************************
  Dr Claus-D. Mayer                    | http://www.bioss.ac.uk
  Biomathematics & Statistics Scotland | email: claus at bioss.ac.uk
  Rowett Research Institute            | Telephone: +44 (0) 1224 716652
  Aberdeen AB21 9SB, Scotland, UK.     | Fax: +44 (0) 1224 715349



More information about the Bioconductor mailing list