[BioC] Very low P-values in limma
Gordon K Smyth
smyth at wehi.EDU.AU
Sat Oct 24 04:29:34 CEST 2009
Dear Paul,
Give your consensus correlation value, limma is treating your within-array
replicates as worth about 1/3 as much as replicates on independent arrays
(because 1-0.81^2 is about 1/3). This is how it is designed to work.
There are some caveats. Firstly, the limma duplicateCorrelation method
assumes replicates to be equally spaced, and yours are not. You've
actually re-ordered your data to fit it into limma. So the within-array
correlation will be underestimated, and significance over-estimated, for
transcripts for which the replicates are unusually close on the array.
Secondly, the within-array correlation is assumed to be the same for all
transcripts, which is never actually true. The approximation has proved
worthwhile when ndups=2 or 3, but it will yield over-optimistic results
when the number of within-replicates is large.
In your case, you get promising results (FDR=0.06), even if you average
over the within-array replicates. So there seems something genuine in
your data which is not just an artefact of the within-array analysis.
Given data like yours, I might personally use the within-array replicate
analysis, because it takes into account within-array as well as
between-array variation and usually gives a good ranking of transcripts,
but would treat the p-values as somewhat optimistic.
However, microarray analysis should never be just a plug-in analysis.
You need to plot your data in various ways to check quality and to check
assumptions. You should be doing some MA and MDS plots of your data pre
model fitting, then an MA-plot of the estimated fold changes, and perhaps
a sigma vs Amean plot as well (plotAS). We list members can't tell you
whether your results are reliable or not just from the p-values or the
number of DE genes. To me, your results are not impossible, but some
exploratory data analysis is needed.
Hope this helps
Best wishes
Gordon
On Fri, 23 Oct 2009, Paul Geeleher wrote:
> Hi Claus,
>
> corfit$consensus = .81 which I guess is pretty close to 1.
>
> The only differences in the code are that with the
> duplicateCorrelation method I finish with:
>
> corfit <- duplicateCorrelation(mat, design, ndups=4)
> fit <- lmFit(mat at hx, design, ndups=4, correlation=corfit$consensus)
> ebayes <- eBayes(fit)
> topTable(ebayes, coef = 2, adjust = "BH", n = 10)
>
> The huge discrepancy in p-values is strange indeed.
>
> Paul.
>
> On Fri, Oct 23, 2009 at 4:25 PM, Mayer, Claus-Dieter <c.mayer at abdn.ac.uk> wrote:
>> Dear Paul!
>>
>> What is the corfit$consensus value? If it is close to 1, using limma
>> with duplicateCorrelation or averaging (or indeed just using any of the
>> duplicates) should give you very similar results. If however
>> corfit$consensus is close to 0, limma would indeed treat the duplicates
>> similar to independent replicates. Obviously a corfit$consensus value
>> close to 0 would be quite worrying as that would indicate a very high
>> technical variability (or some subtle mistake in your code). Anyway,
>> checking that value might help to solve the puzzle.
>>
>> Claus
>>
>>> -----Original Message-----
>>> From: bioconductor-bounces at stat.math.ethz.ch [mailto:bioconductor-
>>> bounces at stat.math.ethz.ch] On Behalf Of Paul Geeleher
>>> Sent: 23 October 2009 15:13
>>> To: Gordon K Smyth
>>> Cc: Bioconductor mailing list
>>> Subject: Re: [BioC] Very low P-values in limma
>>>
>>> Dear list,
>>>
>>> I have a query, I've rerun my analysis by averaging out the within
>>> array duplicates using the avedups() function instead of
>>> duplicateCorrelation().
>>>
>>> aveMat <- avedups(mat at hx, ndups=4, spacing=1, weights=NULL)
>>> fit <- lmFit(aveMat, design)
>>> ebayes <- eBayes(fit)
>>> topTable(ebayes, coef = 2, adjust = "BH", n = 10)
>>>
>>> When I do this my most significant p-values drop from 10^-7 to .06!
>>>
>>> This seems dubious? It seems like to get values as low as 10^-7 the
>>> duplicateCorrelation() function must be treating the within array
>>> replicates in a similar way it would separate samples, this seems
>>> suspicious to me...
>>>
>>> -Paul.
>>>
>>>
>>>
>>> On Fri, Oct 23, 2009 at 3:45 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>>>> Dear Paul,
>>>>
>>>> I'm not quite sure why you're so shocked about getting significant
>>> results.
>>>> Usually people complain to me when they don't get significance :).
>>>>
>>>> limma is able to obtain much higher significance levels than a t-test
>>>> because it (i) leverages information from the within-array replicates
>>> and
>>>> (ii) borrows information across probes. These two approaches in
>>> combination
>>>> increase the effective degrees of freedom dramatically.
>>>>
>>>> We would hardly go to so much time and trouble to develop new
>>> statistical
>>>> methods if they didn't improve on ordinary t-tests.
>>>>
>>>> If I were you, I'd be looking at the sizes of the fold changes, and
>>> whether
>>>> the results seem biologically sensible and agree with known information,
>>> and
>>>> so on.
>>>>
>>>> Best wishes
>>>> Gordon
>>>>
>>>>
>>>> ---------- original message -----------
>>>> [BioC] Very low P-values in limma
>>>> Paul Geeleher paulgeeleher at gmail.com
>>>> Thu Oct 22 11:34:01 CEST 2009
>>>>
>>>> The value of df.prior is 3.208318. Hopefully somebody can help me here
>>>> because I'm still really at a loss on why these values are so low.
>>>>
>>>> Here's the script. I'm fairly sure it conforms to the documentation:
>>>>
>>>> library(affy)
>>>>
>>>> setwd('/home/paul/PhD_Stuff/miRNA_guilleme_project/HERArrays/')
>>>>
>>>> # create a color pallete for boxplots
>>>> cols <- brewer.pal(12, "Set3")
>>>>
>>>> # This defines the column name of the mean Cy3 foreground intensites
>>>> #Cy3 <- "F532 Median"
>>>> # background corrected value
>>>> Cy3 <- "F532 Median - B532"
>>>>
>>>> # This defines the column name of the mean Cy3 background intensites
>>>> Cy3b <- "B532 Mean"
>>>>
>>>> # Read the targets file (see limmaUsersGuide for info on how to create
>>> this)
>>>> targets <- readTargets("targets.csv")
>>>>
>>>> # read values from gpr file
>>>> # because there is no red channel, read Rb & R to be the same values as
>>> G
>>>> and Gb
>>>> # this is a type of hack that allows the function to work
>>>> RG <- read.maimages( targets$FileName,
>>>> source="genepix",columns=list(R=Cy3,G=Cy3, Rb=Cy3b, Gb=Cy3b))
>>>>
>>>> # remove the extraneous values red channel values
>>>> RG$R <- NULL
>>>> RG$Rb <- NULL
>>>>
>>>> # this line of code will remove any negative values which cause errors
>>>> further on
>>>> RG$G[RG$G<0] <- 0
>>>>
>>>> # create a pData for the data just read
>>>> # this indicates which population each array belongs to
>>>> # a are "her-" and b are "her+" (almost certain)
>>>> pData <- data.frame(population = c('a', 'a', 'a', 'a', 'b', 'b', 'b'))
>>>> rownames(pData) <- RG$targets$FileName
>>>>
>>>> # create design matrix
>>>> design <- model.matrix(~factor(pData$population))
>>>>
>>>> # In my .gpr files all miRNAs contain the string "-miR-" or "-let-" in
>>> their
>>>> name
>>>> # so the grep function can be used to extract all of these, removing
>>>> # all control signals and printing buffers etc.
>>>> # You need to check your .gpr files to find which signals you should
>>>> extract.
>>>> miRs <- c(grep("-miR-", RG$genes$Name), grep("-let-", RG$genes$Name))
>>>> RG.final <- RG[miRs, ]
>>>>
>>>> # load vsn library, this contains the normalization functions
>>>> library('vsn')
>>>>
>>>> # Do VSN normalization and output as vns object
>>>> mat <- vsnMatrix(RG.final$G)
>>>>
>>>> # insert rownames into matrix with normalized data
>>>> # this will mean that the gene names will appear in our final output
>>>> rownames(mat at hx) <- RG.final$genes$Name
>>>>
>>>> # my .gpr files contain 4 "within-array replicates" of each miRNA.
>>>> # We need to make full use of this information by calculating the
>>> duplicate
>>>> correlation
>>>> # in order to use the duplicateCorrelation() function below,
>>>> # we need to make sure that the replicates of each gene appear in
>>> sequence
>>>> in this matrix,
>>>> # so we sort the normalized values so the replicate groups of 4 miRs
>>> appear
>>>> in sequence
>>>> mat at hx <- mat at hx[order(rownames(mat at hx)), ]
>>>>
>>>> # calculate duplicate correlation between the 4 replicates on each array
>>>> corfit <- duplicateCorrelation(mat, design, ndups=4)
>>>>
>>>> # fit the linear model, including info on duplicates and correlation
>>>> fit <- lmFit(mat, design, ndups=4, correlation=corfit$consensus)
>>>>
>>>> # calculate values using ebayes
>>>> ebayes <- eBayes(fit)
>>>>
>>>> # output a list of top differnetially expressed genes...
>>>> topTable(ebayes, coef = 2, adjust = "BH", n = 10)
>>>>
>>>>
>>>> -Paul.
>>>>
>>>>
>>>> On Thu, Oct 22, 2009 at 12:09 AM, Wei Shi <shi at wehi.edu.au> wrote:
>>>>
>>>>> Dear Paul:
>>>>>
>>>>> The low p-values you have got are not surprising to me. I have got
>>>>
>>>> even
>>>>>
>>>>> lower FDR adjusted p-values than yours. This just means you got
>>>>> significantly differentially expressed genes. But on the other hand, I
>>>>
>>>> did
>>>>>
>>>>> see much higher adjusted p-values in some of my microarray analyses. In
>>>>
>>>> that
>>>>>
>>>>> case, I will explore the data in more depth such as looking at batch
>>>>
>>>> effect
>>>>>
>>>>> etc.
>>>>>
>>>>> Cheers,
>>>>> Wei
>>>>>
>>>>>
>>>>> Paul Geeleher wrote:
>>>>>
>>>>>> Hi folks, I'm analyzing microRNA data using limma and I'm wondering
>>>>
>>>> about
>>>>>>
>>>>>> the validity of the p-values I'm getting out. Its a simple 'Group A Vs
>>>>>> Group
>>>>>> B' experimental design. 4 arrays in one group, 3 in the other and 4
>>>>>> duplicate spots for each miRNA on each array.
>>>>>>
>>>>>> The lowest adjusted p-values in the differential expression analysis
>>>>
>>>> are
>>>>>>
>>>>>> in
>>>>>> the region of 10^-7.
>>>>>>
>>>>>> Its been pointed out to me that plugging the point values from each
>>>>
>>>> sample
>>>>>>
>>>>>> into a regular t-test you get p=0.008, which then also needs to take
>>>>
>>>> the
>>>>>>
>>>>>> multiple test hit. Can anybody explain why limma is giving me such
>>>>
>>>> lower
>>>>>>
>>>>>> values and if they are valid?
>>>>>>
>>>>>> I can provide more information if required.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Paul.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Paul Geeleher
>>>> School of Mathematics, Statistics and Applied Mathematics
>>>> National University of Ireland
>>>> Galway
>>>> Ireland
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Paul Geeleher
>>> School of Mathematics, Statistics and Applied Mathematics
>>> National University of Ireland
>>> Galway
>>> Ireland
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> The University of Aberdeen is a charity registered in Scotland, No SC013683.
>>
>
>
>
> --
> Paul Geeleher
> School of Mathematics, Statistics and Applied Mathematics
> National University of Ireland
> Galway
> Ireland
>
More information about the Bioconductor
mailing list