[BioC] Genefilter parameters for mouse 430 2 #2

Richard Friedman friedman at cancercenter.columbia.edu
Wed Mar 19 23:08:47 CET 2008


Jim,

	Thank you for your detailed and helpful reply.
On Mar 19, 2008, at 4:52 PM, James W. MacDonald wrote:

>
> That depends. If you are using rma(), then no ;-P

what about gcrma.

>
> You might try something like
>
> eset2 <- nsFilter(eset)$eset
>
> and see how many probesets you end up with.

I have tried

 > xen2nsSUB<-nsFilter(xen2dataeset)$xen2dataeset
 > sum(xen2nsSUB)
[1] 0
 > xen2nsSUB
NULL

I then tried (which I think is the correct way to do it)/

 > xen2nsSUB2<-nsFilter(xen2dataeset)
 > xen2nsSUB2
$eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 623 features, 4 samples
   element names: exprs
phenoData
   sampleNames: A_xen_1_21.cel, A_xen_2_22.cel, D_nodal_1_27.cel,  
D_nodal_2_2
   8.cel
   varLabels and varMetadata description:
     sample: arbitrary numbering
featureData
   featureNames: 1428670_at, 1457283_at, ..., 1426849_at  (623 total)
   fvarLabels and fvarMetadata description: none
experimentData: use 'experimentData(object)'
Annotation: mouse4302

$filter.log
$filter.log$numDupsRemoved
[1] 77

$filter.log$numLowVar
[1] 41507

$filter.log$feature.exclude
[1] 13

$filter.log$numRemoved.ENTREZID
[1] 2881



>
>> 2, Is the only effect of filtering to reduce the multiplier in  
>> the  false discovery
>>         analysis OR does it reduce false positives in other ways by
>> 	A. In the case of intensity filters by reducing the number of  
>> large  fold changes resulting
>> 	    from the ratios of small numbers.
>> 	B. In the case of IQR filters eliminating large t-statistics   
>> resulting for genes with small variation	
>> 	     across samples but fortuitously low standard deviations,
>
> Yes and yes, to a certain extent. If you are just doing fold  
> changes, you might consider filtering on each fold change rather  
> than overall. For instance you could create a filter
>
> filt <- filterfun(kOverA(1, 100))
>
> that you would then use for each fold change comparison to ensure  
> that at least one of the samples had an expression > 100. Shameless  
> plug - see foldFilt() in affycoretools.

I think that that is basically what I did with genefilter pOverA 
(0.25,log2(100)
described in my first note (.25 of 4 =1). Or am I getting somehing  
wrong.


>
> If you are doing t-stats with a very small number of replicates  
> (like 2 vs 2), then you should be using limma, and in which case  
> over-filtering the data can be detrimental as well. The reason for  
> that is the prior will be estimated on all the probesets that  
> remain, and if all you have are highly variable probesets then the  
> prior will be larger than you might want. I have seen cases with  
> very small numbers of replicates where using all the data on the  
> chip resulted in many more significant probesets than if I did what  
> I thought was a reasonable filter.
>

I am using Limma. I asked for at least 3 sreplicates from the  
experimentalist but she only gave me 2 (story of my life).
I got 731 with just the variance filter, 289 with the log2 filter,  
and 619 with nsFilter.

How many probesets do I  need for limma to function properly? Do the  
above numbers seem to small.


> Of course the question remains; is more better?
> And if more is better, does that mean the ideal would be to find  
> all probesets differentially expressed? Probably not, so we are  
> back to the usual prescriptions; check your data carefully. Make  
> sure your results are sensible. Do EDA to ensure that you don't  
> have some wacky chip messing things up.

What is EDA? I did all of the quality measures in simpleaffy and in  
AffyPLM and the chips look fine.

> Check your code to be sure that you haven't made the kind of errors  
> that I like to make. Consult with the experimenter to see if very  
> few genes should be changing (or be expressed at all).

I have done both of these things.

Here is my present understanding of the situation following your note:

1. Filter by variance and other nsFilter parmaeters is good unless it  
leads to too few probestes.

Q1. How many are too small?

Q2. It is advisable to use an intensity filter but log2(100) on at  
least one chip is too high on dim chips?
	Q2B. Is there a way to quantify dimness so I knwo how to adjust the  
cutoff?
	Q2C. Or am I better off filtering only on variance, or variance plus  
the nsFilter defaults.



THANKS!
Rich






>
> Best,
>
> Jim
>
>
>> 	Up until this time I have not filtered because the filtering   
>> parameters looked arbitrary and I
>> thought that it was cheating to reduce the # of tests used to  
>> compute  the FDR. From reading and
>> further reflection I now believe otherwise. But whereas I now  
>> believe  I should filter I am
>> not at all sure what parameters to use, and how much my final list  
>> of  differentially expressed genes
>> will be sensitive to a choice of those parameters. In particular,  
>> i  wonder if the
>> intensity filter cutoff should vary with chip-type and  
>> preprocessing  method (eg GCRMA).
>> 	Any thoughts and guidance would be appreciated.
>> Thanks as always,
>> Rich
>> ------------------------------------------------------------
>> Richard A. Friedman, PhD
>> Biomedical Informatics Shared Resource
>> Herbert Irving Comprehensive Cancer Center (HICCC)
>> Lecturer
>> Department of Biomedical Informatics (DBMI)
>> Educational Coordinator
>> Center for Computational Biology and Bioinformatics (C2B2)
>> National Center for Multiscale Analysis of Genomic Networks (MAGNet)
>> Box 95, Room 130BB or P&S 1-420C
>> Columbia University Medical Center
>> 630 W. 168th St.
>> New York, NY 10032
>> (212)305-6901 (5-6901) (voice)
>> friedman at cancercenter.columbia.edu
>> http://cancercenter.columbia.edu/~friedman/
>> "Sure I am willing to stop watching television
>> to get a better education."
>> -Rose Friedman, age 11
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/ 
>> gmane.science.biology.informatics.conductor
>
> -- 
> James W. MacDonald, M.S.
> Biostatistician
> Affymetrix and cDNA Microarray Core
> University of Michigan Cancer Center
> 1500 E. Medical Center Drive
> 7410 CCGC
> Ann Arbor MI 48109
> 734-647-5623



More information about the Bioconductor mailing list