[BioC] limma vs t-test

Gordon K Smyth smyth at wehi.EDU.AU
Thu May 29 09:16:03 CEST 2014


Dear Giovanni,

On Tue, 27 May 2014, Giovanni Bucci wrote:

> Hi Gordon,
>
> I have two questions regarding applying a t-test and limma to microarray
> data.
>
> 1. For large number of samples per condition (for example 100 samples used
> in the R code sent previously) do you expect a high correlation between the
> 2 sets of p values?

For every large sample sizes, one would obviously expect a high 
correspondence in DE results between ordinary and moderated t.  However 
correlating p-values on the linear scale would be a poor way to show this, 
because the correlation will be determined almost entirely by 
non-significant p-values.  Correlating t-statistics (ordinary vs moderated 
t) would be better.

> The R code shows that there is no correlation between the p values when 
> the distribution of the standard deviation is taken from a real 
> microarray experiment.

Your code doesn't run for me, and does not look like a normal microarray 
pipeline.

> However, the correlation is high when the standard deviation distribution
> is uniform.
>
> This suggests to me that the eBayes step in limma "adjusts" the standard
> deviation significantly when its distribution is similar to a microrarray
> experiment even though each condition has a large number of samples that
> already provides a good estimate of the standard deviation.

Your conclusion is incorrect.

> I would appreciate your thoughts on this since I might be missing something.
>
> 2. Most likely the number of microarray experiments that have 100 
> samples per condition is very small. The number of experiments that 
> could used a pooled variance is much higher. I would like to run some 
> simulations that show the more conditions I have in an experiment the 
> closer I get to the ground truth.

This is untrue.  The variances estimates would converge (assuming that the 
variances are indeed equal for every condition), but the fold-changes 
would not become more precise with more groups.

> The idea here is similar to what happens to the law of large numbers. 
> The more conditions you have in your experiment the better the estimate 
> of the variance, the closer you get to the ground truth. Only that I do 
> not know what the ground truth is. I thought that I could use the t-test 
> with 100 samples in each condition, but since the ranking of genes is 
> different than what limma produces, I'm not sure.
>
> So the question is, is it possible to produce a simulation which shows 
> in a "law of large numbers" kind of fashion that the more conditions are 
> included in the model the closer you get to the ground truth? Also it 
> would be nice to show that limma is closer to the ground truth than the 
> equivalent t-test.

A number of papers have shown that limma separates true DE genes from 
nulls better than ordinary t.  As the residual df becomes large, the 
relative advantage of limma will diminish.

Gordon

> Thank you,
>
> Giovanni
>
>
>
>
>
>
>
>
> On Sat, May 24, 2014 at 4:06 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>
>> Date: Thu, 22 May 2014 16:24:53 -0500
>>> From: Giovanni Bucci <giovannifbucci at gmail.com>
>>> To: bioconductor at r-project.org
>>> Subject: [BioC] limma vs t-test
>>>
>>> Hi everybody,
>>>
>>> I have a question regarding comparing limma to the t-test. I compared the
>>> p
>>> values obtained from both algorithms using 100 samples in each condition.
>>> Given the large number of observations my expectation was to see high
>>> correlation between the p values. As shown below, I ran the same code in
>>> three different conditions for the mean and standard deviation. The mean
>>> and standard deviation loaded from the google docs files are from a real
>>> micro array experiment.
>>>
>>> 1. mean and std from microarray exeperiment: no correlation between p
>>> values
>>>
>>> 2. constant fold change and std from microarray exeperiment: very small
>>> correlation
>>>
>>> 3. constant fold change and uniform std from 0.01 to 0.2: high correlation
>>>
>>> I understand that limma uses information across genes, but shouldn't this
>>> information be weighed with the number of observations for each condition?
>>>
>>
>> limma does take account of the number of observations in each condition.
>>
>> Gordon
>>
>>  Thank you,
>>>
>>> Giovanni
>>>
>>> library(RCurl)
>>> library(genefilter)
>>> library(limma)
>>>
>>> setwd(tempdir())
>>>
>>>
>>> ##
>>> https://drive.google.com/file/d/0B__nP63GoFhMd042V09GcHFvVFE/edit?
>>> usp=sharing
>>> x = getBinaryURL("
>>> https://docs.google.com/uc?export=download&id=0B__
>>> nP63GoFhMd042V09GcHFvVFE",
>>> followlocation = TRUE, ssl.verifypeer = FALSE)
>>> writeBin(x, "std_var.txt", useBytes = TRUE)
>>> std_var =  as.matrix(read.table("std_var.txt",
>>> header = FALSE, sep = "\t", quote="\"", comment.char = ""
>>> ))
>>>
>>> ##
>>> https://drive.google.com/file/d/0B__nP63GoFhMc2tFT3hXVW52Q0E/edit?
>>> usp=sharing
>>> x = getBinaryURL("
>>> https://docs.google.com/uc?export=download&id=0B__
>>> nP63GoFhMc2tFT3hXVW52Q0E",
>>> followlocation = TRUE, ssl.verifypeer = FALSE)
>>> writeBin(x, "mean_var.txt", useBytes = TRUE)
>>> mean_var = as.matrix(read.table("mean_var.txt",
>>> header = FALSE, sep = "\t", quote="\"", comment.char = ""
>>> ))
>>>
>>>
>>> NO_OF_GROUPS = 2
>>>
>>> NO_OF_SAMP_PER_GROUP_GROUND_TRUTH = 100
>>>
>>> NO_OF_GENES = NROW(mean_var)
>>>
>>> gxexprs_GroundTruth = matrix(0, NO_OF_GENES,
>>> NO_OF_GROUPS*NO_OF_SAMP_PER_GROUP_GROUND_TRUTH)
>>>
>>>
>>> std_var_list = list(std_var, std_var,seq(.01,.2, length.out= NO_OF_GENES))
>>>
>>> simulated_mean = matrix(c(8,9),NO_OF_GENES,NO_OF_GROUPS, byrow=TRUE)
>>>
>>> mean_var_list = list(mean_var, simulated_mean, simulated_mean)
>>>
>>> main_list = list("micorarray mean and std", "means 8 and 9 and micorarray
>>> std",  "means 8 and 9 and uniform std from 0.01 to 0.2")
>>>
>>> for(count_i in 1:3)
>>> {
>>>
>>> mean_var = mean_var_list[[count_i]]
>>> std_var = std_var_list[[count_i]]
>>>
>>>
>>> for(i in 1:NO_OF_GENES)
>>> {
>>> if ((i%%1000)==1)
>>> {
>>> cat(i, "\n")
>>> }
>>> for(j in 1:NO_OF_GROUPS)
>>> {
>>> gxexprs_GroundTruth[i, NO_OF_SAMP_PER_GROUP_GROUND_TRUTH*(j-1) +
>>> 1:NO_OF_SAMP_PER_GROUP_GROUND_TRUTH]=
>>> rnorm(NO_OF_SAMP_PER_GROUP_GROUND_TRUTH, mean_var[i,j], std_var[i])
>>> }
>>> }
>>>
>>> final_choice = c(1,2)
>>>
>>> Group = factor(sprintf("S%02d",
>>> rep(final_choice,each=NO_OF_SAMP_PER_GROUP_GROUND_TRUTH)
>>> ))
>>>
>>> t_test_pvals = rowttests(gxexprs_GroundTruth, fac=Group)$p.value
>>>
>>>
>>> design <- model.matrix(~0+Group)
>>> colnames(design) <- gsub("Group","",colnames(design))
>>> contrastsX = sprintf("S01 - S%02d",2)
>>>
>>> fit <- lmFit(gxexprs_GroundTruth, design)
>>>
>>> contrast.matrix <- makeContrasts(contrasts=contrastsX,levels=design)
>>> fit2 <- contrasts.fit(fit, contrast.matrix)
>>> fit2 <- eBayes(fit2)
>>>
>>> TTable = topTable(fit2, coef=1, adjust="fdr", number =
>>> nrow(gxexprs_GroundTruth))
>>> limma_pvals = TTable$P.Value[as.numeric(rownames(TTable))]
>>>
>>> dev.new()
>>> plot(-log10(limma_pvals), -log10(t_test_pvals), main=main_list[[count_i]])
>>>
>>> }
>>>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list