[BioC] Deciding on a cut off after QC

Adaikalavan Ramasamy ramasamy at cancer.org.uk
Mon May 16 15:59:52 CEST 2005


See comments below.

On Mon, 2005-05-16 at 05:42 -0700, Ankit Pal wrote:
> Dear Dr Smyth,
> I guess there has been a mistake in communication.
> There are 39000 spots on the array but the number of
> probes are lesser as there are replicates(not a fixed
> number) on the array.
> The output from toptable() gives me a total of 39000
> rows, one for each spot.
> I need to give an output of the spots most significant
> in this list.

I do not understand this clearly. Do you mean you have some duplicated
spots or is there redundancy when you convert to something common like
unigene ID ? 

If the former, I believe there are some facilities within LIMMA for
equal number duplications per spots but I am not sure. For the latter,
you could use tapply or something to find the min/max of the statistics.

> What is the cut off value(p or B or any other) I need
> to use above which I can consider a set of
> differentially expressed genes significant enough for
> further use (eg :function based clustering of all
> significantly overexpressed genes).
> I cannot use all the genes in the topTable() output.
> Where do I put the cut off?
> Is there a way I can calculate it?

Selection of p-value threshold is arbitrary. However it would be a good
idea to adjust for multiple comparison. My preferred method is the FDR
by Benjamini and Hochberg but there are many other methods too each with
a different meaning. I will illustrate FDR using the slightly modified
example from help(topTable) :

set.seed(123)  # for reproducibility only - do not use this in real life

M <- matrix(rnorm(10000*6,sd=0.3), 10000, 6)
M[1:100, 1:3] <- M[1:100, 1:3] + 2
design <- cbind( First3Arrays=c(1,1,1,0,0,0),   
                 Last3Arrays=c(0,0,0,1,1,1))
fit <- lmFit(M, design=design)
fit <- eBayes(fit)

tab <- topTable( fit, adjust.method="fdr", sort.by="p", n=nrow(M) )

positives <- rownames(tab)[ which(tab[ , "P.Value"] < 0.05) ]

The "positives" are the names of genes that we declare as differentially
expression and it is expected that 5% of these genes are false positives
on average. 

Note that in this case we can explicitly calculate the FP rate as
follows because we have simulated the datasets.

mean( ifelse( as.numeric(positives) <= 100, 0, 1 ) )
0.06542056

I am sure Gordon Symth has a far more elegant and flexible version of
this documented somewhere in his user guide.

Regards, Adai


> sincerely,
> -Ankit
> 
> --- Gordon K Smyth <smyth at wehi.EDU.AU> wrote:
> > On Mon, May 16, 2005 9:49 pm, Ankit Pal said:
> > > Dear Dr Smyth,
> > > I'm sorry for not having specified which result
> > file.
> > > It is the final result summary we get after we
> > give
> > > the command
> > > Resultfile <- topTable(fit,n=200, adjust="fdr")
> > > A sample result file has been attached.
> > > The code I used for my analysis is
> > >
> > >> targets <- readTargets("target.txt")
> > >
> > > #The QC filter
> > >> myfun <- function(x,threshold=55){
> > > + okred <- abs(x[,"% > B635+2SD"]) > threshold
> > > + okgreen <- abs(x[,"% > B532+2SD"]) > threshold
> > > + okflag <- abs(x[,"Flags"]) > 0
> > > + okRGN <- abs(x[,"Rgn R"]) > 0.6
> > > + as.numeric(okgreen || okred || okflag || okRGN)
> > > + }
> > > #end of QC filter
> > >
> > >> RG_7 <- read.maimages(targets$FileName,
> > > source="genepix",wt.fun=myfun)
> > >> RG_7$genes <- readGAL()
> > 
> > As I said last week, this command is not needed with
> > GenePix data.  You should omit it.
> > 
> > >> RG_7$printer <- getLayout(RG_7$genes)
> > >> MA_7 <-
> > normalizeWithinArrays(RG_7,method="loess")
> > >> MA_7 <- normalizeBetweenArrays(MA_7)
> > >> fit_7 <- lmFit(MA_7, design=c(1,-1,1,-1))
> > >> fit_7 <- eBayes(fit_7)
> > >> options(digits=3)
> > >> Resultfile_7 <- topTable(fit_7, n=39000,
> > > adjust="fdr")
> > >> Resdat_7 <-data.frame(Resultfile_7)
> > 
> > Resultfile_7 is already a data.frame.
> > 
> > >> write.table(Resdat_,file='Result.csv',quote =
> > FALSE,
> > > sep = "\t")
> > >
> > > I understand that the spots that do not qualify
> > the QC
> > > filter are given a weight of "0" by limma and are
> > not
> > > considered for normalization and will not affect
> > the
> > > analysis.
> > > The result file I get contains all the spots
> > (38000)
> > > in my case.
> > 
> > Mmm, this doesn't make sense.  You have 4 arrays
> > with 39000 spots on each array.  Hence you have
> > 4*39000 spots, not 39000.  The output from
> > topTable() gives a summarized log-ratio for each
> > probe,
> > not a result for each spot.
> > 
> > Gordon
> > 
> > > Didn't the spots that were bad get removed from
> > the
> > > final result?
> > > If not what is the cut off value (B, p etc) that I
> > > need to use to get a set of reliable spots(I cant
> > use
> > > all the 38000) from my result file for my
> > analysis.
> > > Is there a fixed formula to derive the same as the
> > > values vary with the analysis.
> > > Waiting for your reply,
> > > Thank you,
> > > -Ankit
> > 
> > 
> >
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
>



More information about the Bioconductor mailing list