[BioC] genefiltering before or after the normalization?

Wed Jul 16 17:01:54 CEST 2008

At 09:14 AM 7/16/2008, Abhilash Venu wrote:
>Hi Sean,
>Thank you for sharing the thoughts.
>I have done the filtering, using the same code prior to the normalization,
>and it started to show some changes. I am providing the topTable results,
>the odds ratios started to show the positive change but still adj.P.Val is
>showing little higher, So in this scenario, whether I should do more
>stringent filtering before the analysis?

Hi Abhilash,

As Sean said before, the goal of data pre-processing and filtering 
should not be to *get* the results you want, but rather to arrive at 
the most _correct_ results given the type of data that is generated. 
It's a big statistical no-no to try several different analysis 
methods and then pick the one that gives you the results you like 
best. I'm not sure why you tried filtering before doing normalization 
when you were already told that it's supposed to be done after 
normalization. I know it's frustrating to not have any "significant" 
genes, especially when you know there are expression changes due to 
the treatment. Remember that a FDR level of 0.05 is not a magical 
threshold of significance, rather the amount of false positives YOU 
are willing to tolerate in your gene list.  I've seen papers where 
they've used gene lists with 0.1 or even 0.2 FDR thresholds. Another 
route is to just use the top 50 or 100 genes, as these have the most 
evidence for DE, even if they don't surpass any reasonable FDR adjustment.

Finally, remember that Affy arrays, and many other methods of 
expression measurement, are only measuring a tiny portion of the 
expected transcript. There are many known cases in which "expression" 
differences won't be reflected in that portion of the transcript. In 
these cases, the microarray data are "correct", even if they aren't 
telling you the entire story...

Best,
Jenny

>GeneName            logFC        AveExpr         t         P.Value
>   adj.P.Val        B
>NUDT16L1          2.7559164  14.32567  10.098560  1.520399e-07
>0.0065018      4.829862
>MGC4268          1.5820444   12.06414   7.695917   3.280927e-06
>0.061246       3.208160
>AR                    1.7511488    10.19825   7.506490   4.296601e-06
>0.0612466       3.048297
>LOC124220        0.9476445   15.51240   6.697382  1.431390e-05
>0.1530298      2.302016
>A_24_P289130   1.7622555   11.07025    6.401121  2.272696e-05
>0.156454      2.001432
>ZNF501              1.804305     10.69845  6.345654    2.481481e-05
>0.156454      1.943447
>ADAM22             -1.650837  11.89608  -6.187425   3.195991e-05
>0.156454      1.77502        THC2351317        1.0793141  12.34347
>6.179724   3.235878e-05    0.156454      1.766717
>AW276332          1.8253290  10.55792   6.147119   3.410664e-05
>0.1564544    1.731409
>THC2323609       2.0122396  10.82117    6.076649    3.823291e-05
>0.15645     1.654439
>
>
>Regards
>Abhilash
>
>On Sat, Jul 12, 2008 at 10:32 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>
> > On Sat, Jul 12, 2008 at 11:26 AM, Abhilash Venu <abhivenu at gmail.com>
> > wrote:
> > > Hi Sean,
> > >
> > > Yes, thank you.
> > >
> > > Yet my problem of the data did not get sorted out. I have tried different
> > > filtering methods including gapfilter and a combination of IQR with
> > pOverA
> > > or cv etc. But my adj p values are above the FDR limit of 0.05 after the
> > > limma analysis. Also B values are generally -3.  As Gorden has mentioned
> > in
> > > one of the previous mails, this is a indication of little evidance for
> > > differential expression.
> > >
> > > What could be the reason for this. Is this really an indicative of
> > absence
> > > of differential expression?
> >
> > It sounds like it.  Though people think of filtering as a way to
> > reduce the number of genes and improve the strength of signal after
> > multiple-testing correction, I don't think that is the correct
> > mindset.  Filtering is useful to remove probes from analysis that are
> > not measuring anything interesting (no change across experiments) or
> > are not well-measured.  So, the thought process should not be to do
> > hypothesis testing and then, if negative, to do filtering to try to
> > improve the situation, but to do filtering based on rational
> > thresholds for removing uninteresting or less-than-credible values as
> > part of a series of preprocessing steps.
> >
> > Sean
> >
> > > On Fri, Jul 11, 2008 at 4:17 PM, Sean Davis <sdavis2 at mail.nih.gov>
> > wrote:
> > >
> > >> On Fri, Jul 11, 2008 at 5:32 AM, Abhilash Venu <abhivenu at gmail.com>
> > wrote:
> > >> > Dear Dr. Huber,
> > >> >
> > >> > Thank you for the advice. I have tried the script that you have
> > advised
> > >> to
> > >> > use. As you mentioned I have used the script after the normalization,
> > but
> > >> > that has shown the following error, which I do not understand, whether
> > I
> > >> am
> > >> > using in the right way.
> > >> >
> > >> > MA<-normalizeBetweenArrays(log2(Rgene$G), method="quantile")#
> > >> normalization
> > >> >  rs = rowSds(MA)
> > >> >  fx = fx[ rs > quantile(rs, 0.05), ]
> > >> > Error: object "fx" not found
> > >>
> > >> Hi, Abhilash.  I think that line should read:
> > >>
> > >> fx = x[rs > quantile(rs,0.05),]
> > >>
> > >> Wolfgang was simply suggesting subsetting x by the results of sd
> > filtering.
> > >>
> > >> Sean
> > >>
> > >> > Can you advise me on the same.
> > >> > Thanks in advance.
> > >> >
> > >> > Abhilash
> > >> >
> > >> > On Fri, Jul 11, 2008 at 4:06 AM, Wolfgang Huber <huber at ebi.ac.uk>
> > wrote:
> > >> >
> > >> >> Hi Abhilash
> > >> >>
> > >> >>
> > >> >>  I am working with single color data from Agilent platform. After the
> > >> limma
> > >> >>> analysis the adjusted p values were higher than 5% of FDR. At this
> > >> >>> instance
> > >> >>> I am thinking of filtering the genes using genefilter. As my data
> > set
> > >> >>> contains only raw intensities of normal and test before the
> > >> normalization,
> > >> >>> where I am uisng 'normalizeBetweenArrays' command after log
> > >> transforming
> > >> >>> the
> > >> >>> data.
> > >> >>> In this scenario I am quite confused whether I should use the filter
> > >> >>> functions prior to normalization of after the normalization but
> > efore
> > >> >>> fitting the linear model?
> > >> >>> As my data is not an expressionSet I cannot use the nonfilter
> > commands,
> > >> in
> > >> >>> this case any suggestions of using other filtering methods?
> > >> >>>
> > >> >>> Appreciate the suggestions
> > >> >>>
> > >> >>>
> > >> >> Such filtering is performed after normalisation, but it is essential
> > >> that
> > >> >> the filter criterion does *not use the sample annotations*. E.g. you
> > can
> > >> use
> > >> >> for each gene the overall variance or IQR across the experiment.
> > >> >>
> > >> >> If x is a matrix with rows=genes and columns=samples, then this can
> > be
> > >> as
> > >> >> simple as:
> > >> >>
> > >> >>  rs = rowSds(x)
> > >> >>  fx = fx[ rs > quantile(rs, lambda), ]
> > >> >>
> > >> >> where rowSds is in the genefilter package, and lambda is a parameter
> > >> >> between 0 and 1 that contains your belief in what fraction of probes
> > on
> > >> the
> > >> >> array correspond to target molecules that are never expressed in the
> > >> >> conditions you study.
> > >> >>
> > >> >> Also note that after such filtering, strictly speaking, the nominal
> > >> >> p-values from the subsequent testing could be too small - but one can
> > >> show
> > >> >> that in typical microarray applications the bias is negligible
> > (compared
> > >> to
> > >> >> the impact of other effects), and in any case the p-values can be
> > used
> > >> for
> > >> >> ranking.
> > >> >>
> > >> >>  Best wishes
> > >> >>        Wolfgang
> > >> >>
> > >> >>
> > >> >> --
> > >> >> ----------------------------------------------------
> > >> >> Wolfgang Huber, EMBL-EBI, http://www.ebi.ac.uk/huber
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> >
> > >> > Regards,
> > >> > Abhilash
> > >> >
> > >> >        [[alternative HTML version deleted]]
> > >> >
> > >> > _______________________________________________
> > >> > Bioconductor mailing list
> > >> > Bioconductor at stat.math.ethz.ch
> > >> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > >> > Search the archives:
> > >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > >
> > > Regards,
> > > Abhilash
> > >
> > >        [[alternative HTML version deleted]]
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at stat.math.ethz.ch
> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> > >
> >
>
>
>
>--
>
>Regards,
>Abhilash
>
>         [[alternative HTML version deleted]]
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives: 
>http://news.gmane.org/gmane.science.biology.informatics.conductor

Jenny Drnevich, Ph.D.

Functional Genomics Bioinformatics Specialist
W.M. Keck Center for Comparative and Functional Genomics
Roy J. Carver Biotechnology Center
University of Illinois, Urbana-Champaign

330 ERML
1201 W. Gregory Dr.
Urbana, IL 61801
USA

ph: 217-244-7355
fax: 217-265-5066
e-mail: drnevich at illinois.edu