[BioC] limma differences with and without an intercept
Gordon K Smyth
smyth at wehi.EDU.AU
Mon Feb 10 23:55:01 CET 2014
Dear Rory,
> Date: Sun, 9 Feb 2014 13:44:50 -0800 (PST)
> From: "Rory Kirchner [guest]" <guest at bioconductor.org>
> To: bioconductor at r-project.org, rory.kirchner at gmail.com
> Subject: [BioC] limma differences with and without an intercept
>
>
> When setting up a model in limma, if I include or exclude an intercept
> term, I get slightly different results.
There are a few situations in a which contrasts.fit() uses an
approximation to avoid having to refit the linear model for each gene.
The help page for contrasts.fit gives the following warning:
"Warning. For efficiency reasons, this function does not re-factorize the
design matrix for each probe. A consequence is that, if the design matrix
is non-orthogonal and the original fit included quality weights or missing
values, then the unscaled standard deviations produced by this function
are approximate rather than exact. The approximation is usually
acceptable. The results are always exact if the original fit was a oneway
model."
This approximation has generally been fine in the past with microarray
data, but the voom weights are pushing the approximation to the point that
it is having a noticeable effect on some analyses. We are planning to
rewrite the linear model code in the limma to eliminate the approximation
by storing more information from each fit.
> For example here is a simplified
> version of my model, which uses a paired design, each patient is
> represented twice with a measurement when sick and when recovered:
>
> design = model.matrix(~ patient + status, data=samples_norm)
> colnames(design)
> [1] "(Intercept)" "patientA010" "patientA011"
> [4] "patientA019" "patientA024" "patientA030"
> [7] "patientA031" "patientA034" "patientA036"
> [10] "patientA041" "patientA055" "patientA067"
> [13] "patientA073" "patientA080" "statusexacerbation"
>
> The first patient is patient A004. If I run an analysis and compare patient A010 to the baseline (patientA004):
>
> y = DGEList(counts=counts)
> y = calcNormFactors(y)
> v = voom(y, design)
> fit = lmFit(v, design)
> fit = eBayes(fit)
> a = rownames(topTable(fit, coef=2, n=Inf, p.value=0.05))
> length(a)
> [1] 70
This result is exact.
> If I do the same analysis, but use an intercept term I get a slightly different result:
> design = model.matrix(~ 0 + patient + status,
> data=samples_norm)
> y = DGEList(counts=counts)
> y = calcNormFactors(y)
> v = voom(y, design)
> fit = lmFit(v, design)
> cm = makeContrasts(
> onepatient=patientA010-patientA004,
> levels=design)
> fit = contrasts.fit(fit, cm)
> fit = eBayes(fit)
> b = rownames(topTable(fit, n=Inf, p.value=0.05))
> length(b)
> [1] 72
This result is approximate.
>> length(intersect(a, b))
> [1] 69
>
> This does not occur using the pickrell1 dataset.
Because the design matrix is orthogonal in that case. The approximation
only arises when there is an additive term (+ status in your case).
Best wishes
Gordon
> At first I thought this
> might just be random due to rounding of the p-values, but one of them is
> not really close to the cutoff so that isn't it. Should I not be
> expecting these two ways of parameterizing the model to give the same
> results?
>
> -- output of sessionInfo():
>
>> sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-apple-darwin13.0.0 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] splines grid parallel stats graphics grDevices utils
> [8] datasets methods base
>
> other attached packages:
> [1] Vennerable_3.0 gtools_3.2.1
> [3] RBGL_1.38.0 graph_1.40.1
> [5] CHBUtils_0.1 tweeDEseqCountData_1.0.10
> [7] sva_3.8.0 statmod_1.4.18
> [9] rpart_4.1-5 RColorBrewer_1.0-5
> [11] mgcv_1.7-28 nlme_3.1-113
> [13] DESeq_1.14.0 lattice_0.20-24
> [15] locfit_1.5-9.1 corpcor_1.6.6
> [17] BiocInstaller_1.12.0 affyPLM_1.38.0
> [19] preprocessCore_1.24.0 gcrma_2.34.0
> [21] affy_1.40.0 gplots_2.12.1
> [23] biomaRt_2.18.0 knitr_1.5
> [25] devtools_1.4.1 reshape_0.8.4
> [27] plyr_1.8 DESeq2_1.2.10
> [29] RcppArmadillo_0.4.000.2 Rcpp_0.11.0
> [31] GenomicRanges_1.14.4 XVector_0.2.0
> [33] IRanges_1.20.6 vsn_3.30.0
> [35] gridExtra_0.9.1 ggplot2_0.9.3.1
> [37] HTSFilter_1.2.1 Biobase_2.22.0
> [39] BiocGenerics_0.8.0 edgeR_3.4.2
> [41] limma_3.18.11 googleVis_0.4.7
> [43] xtable_1.7-1 extrafont_0.16
> [45] dplyr_0.1.1
>
> loaded via a namespace (and not attached):
> [1] affyio_1.30.0 annotate_1.40.0 AnnotationDbi_1.24.0
> [4] assertthat_0.1 Biostrings_2.30.1 bitops_1.0-6
> [7] caTools_1.16 codetools_0.2-8 colorspace_1.2-4
> [10] compiler_3.0.2 DBI_0.2-7 dichromat_2.0-0
> [13] digest_0.6.4 evaluate_0.5.1 extrafontdb_1.0
> [16] formatR_0.10 gdata_2.13.2 genefilter_1.44.0
> [19] geneplotter_1.40.0 gtable_0.1.2 httr_0.2
> [22] KernSmooth_2.23-10 labeling_0.2 MASS_7.3-29
> [25] Matrix_1.1-2 memoise_0.1 munsell_0.4.2
> [28] proto_0.3-10 RCurl_1.95-4.1 reshape2_1.2.2
> [31] RJSONIO_1.0-3 RSQLite_0.11.4 Rttf2pt1_1.2
> [34] scales_0.2.3 stats4_3.0.2 stringr_0.6.2
> [37] survival_2.37-7 tools_3.0.2 whisker_0.3-2
> [40] XML_3.98-1.1 zlibbioc_1.8.0
>
> --
> Sent via the guest posting facility at bioconductor.org.
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list