[BioC] limma, contrasts, and NA's
Gordon K Smyth
smyth at wehi.EDU.AU
Tue Aug 17 06:25:58 CEST 2010
Dear Albyn,
Many thanks for the very nice reproducible and complete data example.
I've always agreed that spot-specific NA's do have an effect on the
results from contrasts.fit(), and the results can change a bit with choice
of reference. I will try to add some extra functionality to lmFit() so
that users can extract contrasts at a stage when the genewise QR
decompositions are still available, so all calculations can be done
exactly. Thanks for sending me your version of code to do that.
Let me come back though to the theme of whether there should be so many
NAs in the data in the first place, and this is partly why I asked you for
a reproducible example. In my experience, if there are enough NAs to
cause problems for contrasts.fit(), then it is usually a symptom that too
much spot flagging is being done. I hope you won't mind if I make some
comments on the filtering in your example. I suspect that the filtering
was probably imposed on you by your collaborators anyway.
In your data example, nearly 40% of all spots on all arrays are being
marked as NA. In my opinion, this is seriously over the top, unless the
arrays are so bad as to be useless. I am very strongly opposed to the
notion of filtering faint spots. Why remove perfectly good data just
because it is a low value rather than a high value? I also don't have
much confidence in GenePix flags. They are mainly a defense against
things that we now know how to control. In my experience, biologists
overestimate their ability to flag bad spots. On the other hand,
filtering empty or non-expressed probes is a good idea, but whole probes
should be removed, not isolated spots on individual arrays. Finally,
filtering should be after the normalization rather than before.
I would probably pre-process your data like this:
library(limma)
url <- "http://people.reed.edu/~renns/jones_renn/maternal_data"
targets <- readTargets("targets_fixed.csv",path=url,sep=",")
RG <- read.maimages(targets,path=url,source="genepix.median")
RGb <-
backgroundCorrect(RG,method="normexp",normexp.method="mle",offset=50)
MA<- normalizeWithinArrays(RGb)
Empty <- MA$genes$ID %in% c("empty","Empty")
fit <- lmFit(MA[!Empty,], design)
etc. Then there would be no missing values, and the whole issue of
approximate calculation wouldn't arise.
Best wishes
Gordon
--------- original message ------------
[BioC] limma, contrasts, and NA's
Albyn Jones jones at reed.edu
Fri Aug 13 20:42:14 CEST 2010
Following up on a thread I initiated last year on missing values
in limma and contrasts.fit
(see
gmane.science.biology.informatics.conductor:26494
gmane.science.biology.informatics.conductor:26483)
I have prepared an example illustrating the fact that when there are
missing data, the contrast.fit function not only doesn't give exact
results, but the results depend on the source chosen as the
"reference" in modelMatrix(). When I pointed out to my colleagues in
biology that they could manually construct a design matrix that would
fit contrasts exactly, they rebelled at the need to construct N-1 linearly
indpendent columns for N sources. The utility of contrasts.fit() is
clear, avoiding the need for non-statisticians to learn to code design
matrices.
albyn
--
Albyn Jones
Reed College
jones at reed.edu
=========================================================================
library(limma)
### necessary files and data are at:
url <- "http://people.reed.edu/~renns/jones_renn/maternal_data/"
targets <- readTargets(paste(url, "targets_fixed.csv", sep = ""), sep =
",")
RG <- read.maimages(paste(url, targets$FileName, sep = ""), columns =
list(R="F635 Median", G="F532 Median", Rb="B635 Median", Gb="B532
Median"),
other.columns = c("Flags","B635 SD","B532 SD","F635 % Sat.","F532 %
Sat."))
names(RG$other) <- c("Flags", "RbSD", "GbSD","RSat" ,"GSat" )
RG$genes <- readGAL(paste(url,"Fishchip4.03_annotatedGAL_20090730HEM.gal",
sep = ""))
RG$printer <- getLayout(RG$genes)
### Filter features manually flagged bad and automatedly flagged "not
found"
RG$R[RG$other$Flags < 0]<-NA
RG$G[RG$other$Flags < 0]<-NA
as.numeric(apply(is.na(RG$R),2,sum))
### Filter faint features
RG$R[(RG$R < (RG$Rb + 2*RG$other$RbSD))&(RG$G < (RG$Gb +
2*RG$other$GbSD))]<-0
RG$G[RG$R == 0]<-NA
RG$R[RG$R == 0]<-NA
as.numeric(apply(is.na(RG$R),2,sum))
table(apply(is.na(RG$R),1,sum))
### Normalize
MA<- normalizeWithinArrays(RG, method = "printtiploess", bc.method =
"minimum")
### set up the design matrices
design1<- modelMatrix(targets, ref = "a1126")
design2<- modelMatrix(targets, ref = "a1217")
### contrast coding
cont1 <- makeContrasts(WvsL = ( - a1111 - a1210 + a0201 + a1217 +
a1220 - a1211 - a1219 + a1128 + a1202 - a1028 - a0128)/6,
levels=design1)
cont2 <- makeContrasts(WvsL = (a1126 - a1111 - a1210 + a0201 +
a1220 - a1211 - a1219 + a1128 + a1202 - a1028 - a0128)/6,
levels=design2)
### analysis
fit1 <- lmFit (MA, design1)
fit.cont1 <- contrasts.fit(fit1, cont1)
fit.ebayes1 <- eBayes(fit.cont1)
fit2 <- lmFit (MA, design2)
fit.cont2 <- contrasts.fit(fit2, cont2)
fit.ebayes2 <- eBayes(fit.cont2)
results1 <- decideTests(fit.ebayes1, method = "separate",
adjust.method="none", p.value=0.05)
results2 <- decideTests(fit.ebayes2, method = "separate",
adjust.method="none", p.value=0.05)
res<-cbind(results1[,1], results2[,1])
colnames(res)<-c("ref1", "ref2")
vennCounts(res, include="both")
# ref1 ref2 Counts
#[1,] 0 0 11862
#[2,] 0 1 30
#[3,] 1 0 100
#[4,] 1 1 814
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list