[BioC] Cutoff to use for IQR filtering in genefilter

Mark Cowley m.cowley0 at gmail.com
Wed Jun 25 09:20:58 CEST 2008


Hi Fraser,
I'm glad I made sense!
I am a bit of a fan of the using the affy calls to remove genes that  
are absent on all arrays, but I rarely use them more than as that  
initial filter. I find that this filter alone removes all of the  
unchanging genes.
I appreciate that there are disadvantages to this, and when you look  
at the expression levels of genes in comparison to their affy calls,  
some genes with an RMA expression level of 2.5 (on a log scale) can be  
called Present, and genes with expression level of up-to 9 (quite  
high) can be called Absent.
If a collaborator really presses me for an "expression threshold", i'd  
typically advise that around 5-6 on RMA data is a good delineator of  
expressed vs non-expressed genes, but again, this can vary from data  
set to data set.

There's quite a nice article on this:
McClintick, J. N. & Edenberg, H. J.
Effects of filtering by Present call on analysis of microarray  
experiments.
BMC Bioinformatics, 2006, 7, 49

I'd love to get someone else's opinion on this,  perhaps someone who's  
passionately opposed to expression level filtering???

cheers,
Mark

On 25/06/2008, at 12:05 AM, Sim, Fraser wrote:

> Hi Mark,
>
> Thanks for the informative reply, you're interpretation was right  
> about
> my question.
>
> I've not been using the IQR filter either for limma based statistics  
> for
> that reason. But I have been using it as a filter, prior to examining
> specific genes-of-interest in say a heatmap.
>
> I'll try the additional plots you suggest and see what comes out.
>
> In addition to filtering on variability, what are your experiences  
> with
> intensity-based filtering or, if using mas5, present/absent based
> filtering? I used to use them extensively when I used Genespring but
> having read through the Bioconductor chapters I can see why some  
> people
> do not use them. What are your thoughts?
>
> Cheers,
> Fraser
>
> Fraser Sim, PhD
> Assistant Professor of Neurology & Neurosurgery
> University of Rochester
> 601 Elmwood Avenue, Box 645
> Rochester, NY 14642
> (T) 585 275 0987
> (P) 585 220 0474
> (F) 585 276 0232
> Confidentiality Notice: This message, including any attachments, is  
> for
> the sole use of the intended recipient(s) and may contain confidential
> and privileged information. Any unauthorized review, use,  
> disclosure, or
> distribution is strictly prohibited. If you are not the intended
> recipient, please contact the sender by reply e-mail and destroy all
> copies of the original message.
>
>
> -----Original Message-----
> From: Mark Cowley [mailto:m.cowley0 at gmail.com]
> Sent: Monday, June 23, 2008 8:08 PM
> To: Sim, Fraser
> Cc: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] Cutoff to use for IQR filtering in genefilter
>
> Hi Fraser,
> that's exactly right, using the median IQR as the filter will remove
> 50% of your data every time.
> Other alternatives could be the 20th percentile of the IQR's as your
> filter to remove the least variable 20%.
>
> Since all of the IQR's make up a distribution of numbers, there will
> always be a median of that distribution. I think that the question
> you're asking is: what if the median IQR is still not variable enough
> in a biological context, or in a system with large changes, perhaps a
> median IQR filter would remove too many genes that have large
> variability.
> That would be where plotting the data, perhaps against the t-tests as
> you have suggested would be a good means of choosing the best filter.
> perhaps IQR vs average expression level, or IQR vs standard deviation
> might also help?
>
> Incidentally, I rarely use a variability filter, I rely on the
> statistics with FDR < 5%, and accept that some of these will be due to
> genes with small, but consistent differences.
>
> cheers,
> Mark
>
> On 24/06/2008, at 3:06 AM, Sim, Fraser wrote:
>
>> Hi Mark,
>>
>> Am I right in the interpretation that using the median cutoff of the
>> distribution of IQRs would remove 50% of the genes in every analysis.
>>
>> As below:
>>
>> eset <- readAffy()
>> IQRs <- esApply(eset,1,IQR)
>> f1 <- function(x) ( IQR(x) > median(IQRs) )
>> selected <- genefilter(eset, f1)
>>
>> What happens if more than 50% of genes are variable or for that  
>> matter
>> less than 50%? Should one plot the IQRs against some value of
>> interest,
>> e.g. t-test statistic and determine the IQR cut-off on that basis?
>>
>> Thanks, Fraser
>>
>> -----Original Message-----
>> From: bioconductor-bounces at stat.math.ethz.ch
>> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Mark
>> Cowley
>> Sent: Sunday, June 22, 2008 7:32 PM
>> To: swhwang10 at yahoo.com
>> Cc: bioconductor at stat.math.ethz.ch
>> Subject: Re: [BioC] Cutoff to use for IQR filtering in genefilter
>>
>> Hi Seungwoo,
>> The range/IQR/SE/SD of your data is dependent on a number of factors,
>> including biological variability, and other sources of technical
>> variability, which can include the type of normalisation algorithm
>> (think RMA vs MAS5).
>> Basically, applying a filter on IQR of 0.1 in my study might remove
>> half the genes, whereas in your study it may remove 10% of them.
>> Suggestions such as Robert's are useful because they use the IQR of
>> YOUR data in order to set that cutoff.
>>
>> I suggest caculating the IQR's for all of your genes, and then either
>> plotting them plot(density(IQRs)) or just try summary( IQRs ) which
>> will give you a good feel for just how variable your data is.
>>
>> If you need help calculating the IQR's and/or variances of your  
>> genes,
>> please post back to the list.
>>
>> cheers,
>> Mark
>>
>> On 22/06/2008, at 9:05 PM, Seungwoo Hwang wrote:
>>
>>> I am wondering what cutoff value I should use for IQR filtering in
>>> genefilter. I did some literature search. It varies from paper to
>>> paper. I have read two papers so far. One used 0.5, the other used
>>> 0.18. affylmGUI has an option of 0.5, 0.25, and 0.1.
>>>
>>> I also searched Bioconductor archive and read that Dr. Robert
>>> Gentleman suggested to filter out the genes whose IQR below median,
>>> not for some fixed value.
>>>
>>> I have two questions on this vein.
>>>
>>> (1) How small is a gene's variance (in terms of number) if its IQR
>>> is some value, say, 0.5 or 0.1? Can I calculate it?
>>> (2) When median is used instead of fixed number, wouldn't it be too
>>> large, since median of a gene's expression intensities across
>>> samples can be anything?
>>>
>>> Thanks,
>>>
>>> Seungwoo
>>> ------------------------------------
>>> Seungwoo Hwang, Ph.D.
>>> Senior Research Scientist
>>> Korean Bioinformation Center
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> ----------------------------------------------------------------------
>> Mark Cowley, BSc (Bioinformatics)(Hons)
>>
>> Peter Wills Bioinformatics Centre
>> Garvan Institute of Medical Research
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list