[BioC] Problem fitting a linear model using limma
James W. MacDonald
jmacdon at uw.edu
Thu Mar 8 17:25:37 CET 2012
Hi Joaquin,
I think you may have painted yourself into a corner with this
experiment. You have done something similar to the technical replication
examples in section 8.2, but the problem is that you haven't done enough
replicates to analyze that way. Note that in the examples in section
8.2, there are at least two identical (or dye swap) slides for each
comparison, whereas you only have one.
The experiment at the bottom of p.40 only has mu and wt samples, with
three technical replicates each, and they used 18 slides! To do
something similar, you would need 54 slides, not 21.
If this were my experiment to analyze, I would probably go with a single
channel analysis with either a blocking variable or
duplicateCorrelation() to account for the technical replicates. This is
a non-trivial analysis and I would highly recommend getting some help
from a local statistician, preferably one with microarray experience.
Best,
Jim
On 3/7/2012 6:26 PM, Joaquin Martinez wrote:
> Hi Jim,
>
> Once again, thank you for your helpful advice. Maybe I was not very
> clear in my original email. In the targets file that I sent, we have
> both biological replication (e.g. Puninf.1, Puninf.2 and Puninf.3) and
> technical replication (e.g. Puninf.1 present in slides29, 30 and 31)
> instead of including a common reference sample in all hybridizations.
> Ours is more like the "Direct Two-Color Designs" in section 7.4
> including "Technical Replication" as in section 8.2. Based on this,
> does your previous advice still hold? Should we remove the numbers
> from the sample names? Or are there any further considerations we
> should take into account in our analysis?
>
> I truly appreciate your help.
>
> Best,
> Joaquin
>
> On Wed, Mar 7, 2012 at 3:07 PM, James W. MacDonald <jmacdon at uw.edu
> <mailto:jmacdon at uw.edu>> wrote:
>
> Hi Joaquin,
>
>
> On 3/7/2012 2:12 PM, Joaquin Martinez wrote:
>
> Hi Jim,
>
> Thank you for your quick reply and advice. Actually we had
> considered doing exactly what you suggest and remove the
> numbers from the sample names. My concern then is what is then
> the reference or baseline?, i.e. Puninf.1 or an average value
> of Puninf.1, Puninf.2 and Puninf.3. Also, how does removing
> numbers take into consideration slight value differences among
> biological replicate samples?
>
>
> The baseline is the average of the Puninf samples. Which is, I
> believe, what you want. To make the comparisons you request, limma
> is going to compute a t-statistic, which simply put is
>
> difference in means between two samples/ measure of the
> variability of the means
>
> So the numerator of the statistic tells you how different the two
> samples are. The denominator (which is a measure of the slight
> differences among biological replicates) is used to determine if
> the differences between samples is 'large' or not. If there is a
> lot of variability within sample types, the numerator has to be
> pretty big. Conversely, if there isn't much variability within
> sample types then a smaller numerator may be considered
> significantly large.
>
> Note that you aren't forced to use Puninf as the reference. That's
> just the easiest way to use modelMatrix(). You can also pass in a
> matrix for the 'param' argument, selecting direct comparisons to
> make. Or you can just leave Puninf as the reference, and make
> further comparisons with a contrast matrix, which is probably simpler.
>
> Note here that the design matrix with Puninf as reference is
> computing implicit comparisons (e.g., the M163 coefficient is
> really log2(M163) - log2(Puninf)). So if you want to know if there
> is a significant difference between M163 and M86, you can just add
> a contrast matrix that looks like
>
> 1
> -1
> 0
> 0
>
> Which simply means that we will be computing log2(M163) -
> log2(Puninf) - (log2(M86) - log2(Puninf)). Algebraically, the two
> log2(Puninf) will cancel out, and you end up with log2(M163) -
> log2(M86). You can create any contrast matrix you want using
> makeContrasts().
>
>
>
> In the example we are following in the limma user's guide
> there are 3 wild-type mice (wt1,wt2,wt3) and 3 mutant mice
> (mu1, mu2, mu3), where wt1 is used as the reference. I think
> we have the same type of study but with 6 "mouse types".
>
>
> Unless I am mistaken, you are looking at the wrong example. The
> example you reference involves technical replication, where RNA
> from each mouse was hybridized to more than one slide. From what
> you said originally, each of the samples in your experiment is a
> biological replicate. Your experiment is more similar to the
> common reference design experiment on p34 of the limma User's Guide.
>
> Best,
>
> Jim
>
>
>
> Thanks,
> Joaquin
>
>
>
>
>
> On Wed, Mar 7, 2012 at 12:01 PM, James W. MacDonald
> <jmacdon at uw.edu <mailto:jmacdon at uw.edu> <mailto:jmacdon at uw.edu
> <mailto:jmacdon at uw.edu>>> wrote:
>
> Hi Joaquin,
>
>
> On 3/7/2012 11:08 AM, Joaquin Martinez wrote:
>
> Dear All,
>
> My colleague Ben and I have fitted a linear model to
> microarray data using
> the limma package but we get "coefficients not
> estimable" for
> the last two
> columns in the design matrix during the fit. As a
> consequence
> when trying
> to fit our contrast matrix we get the following error
> message,
> "Error in
> contrasts.fit(fit, contrasts = A) : trying to take
> contrast of
> non-estimable coefficient." We have discovered that if we
> simply reorder
> the design matrix columns and contrasts matrix rows, we
> encounter the same
> error. We would very much appreciate if someone could
> explain
> to us why the
> not-estimable coefficients arise, and how to correct
> the problem.
>
> We are following the guide (section 8.2) for limma 3.10.0
> using this
> version of R 2.14.0 (sessionInfo at end of email). We
> have 6
> different
> types of samples (Puninf, P86, P163, Muninf, M86,
> M163), and 3
> biological
> replicates of each (1,2,3; 4,5,6; 7,8,9; 10,11,12;
> 13,14,15;
> 16,17,18).
>
>
> Yes, but your targets file has numbers appended to the sample
> types, which makes R think they are different.
> Hypothetically you
> would have noticed this when using the modelMatrix() function:
>
>
> > modelMatrix(targets, ref = "Puninf.1")
> Found unique target names:
> M163.16 M163.17 M163.18 M86.13 M86.14 M86.15 Muninf.10
> Muninf.11
> Muninf.12 P163.7 P163.8 P163.9 P86.4 P86.5 P86.6 Puninf.1
> Puninf.2
> Puninf.3
>
> This indicates that limma thinks all of your replicates are
> different sample types. This won't work out (as you have
> already
> found) because the model will be over specified, which simply
> means that you cannot estimate all the parameters you have
> in the
> model. This is the same as saying you can't determine the
> values
> of x and y in the formula x + y = 3 (but if you also know
> that x
> - y = -1, then you can solve for both x and y).
>
> Now back to the problem at hand. If you remove the numbers from
> your sample types,
>
> targets$Cy3 <- sapply(strsplit(targets$Cy3, "\\."),
> function(x) x[1])
> targets$Cy5 <- sapply(strsplit(targets$Cy5, "\\."),
> function(x) x[1])
>
> then you will create the correct model matrix
>
> > modelMatrix(targets, ref="Puninf")
> Found unique target names:
> M163 M86 Muninf P163 P86 Puninf
> M163 M86 Muninf P163 P86
> [1,] 0 0 1 0 0
> [2,] 0 0 0 0 1
> [3,] 0 0 0 -1 0
> [4,] 0 -1 1 0 0
> [5,] 1 0 -1 0 0
> [6,] 0 1 0 0 -1
> [7,] -1 0 0 1 0
> [8,] 0 0 -1 0 0
> [9,] 0 0 0 0 -1
> [10,] 0 0 0 1 0
> [11,] 0 1 -1 0 0
> [12,] -1 0 1 0 0
> [13,] 0 -1 0 0 1
> [14,] 1 0 0 -1 0
> [15,] 0 0 1 0 0
> [16,] 0 0 0 0 1
> [17,] 0 0 0 -1 0
> [18,] 0 -1 1 0 0
> [19,] 1 0 -1 0 0
> [20,] 0 1 0 0 -1
> [21,] -1 0 0 1 0
>
> Which will compare all samples to Puninf.
>
> Best,
>
> Jim
>
>
>
>
>
>
> # The targets file looks like this
>
> targets<- structure(list(SlideNumber = c(29L, 30L, 31L,
> 32L,
> 33L, 34L,
> 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L,
> 51L, 46L,
> 47L,
> 48L, 50L), FileName = c("BE T22h slide 29", "BE T22h
> slide 30",
> "BE T22h slide 31", "BE T22h slide 32", "BE T22h
> slide 33",
> "BE T22h
> slide 34",
> "BE T22h slide 35", "BE T22h slide 36", "BE T22h
> slide 37",
> "BE T22h
> slide 38",
> "BE T22h slide 39", "BE T22h slide 40", "BE T22h
> slide 41",
> "BE T22h
> slide 42",
> "BE T22h slide 43", "BE T22h slide 44", "BE T22h
> slide 51",
> "BE T22h
> slide 46",
> "BE T22h slide 47", "BE T22h slide 48", "BE T22h
> slide 50"),
> Cy3 = c("Puninf.1", "Puninf.1", "P163.7", "M86.13",
> "Muninf.10",
> "P86.4", "M163.16", "Muninf.11", "P86.5", "Puninf.2",
> "Muninf.11",
> "M163.17", "M86.14", "P163.8", "Puninf.3", "Puninf.3",
> "P163.9",
> "M86.15", "Muninf.12", "P86.6", "M163.18"), Cy5 =
> c("Muninf.10",
> "P86.4", "Puninf.1", "Muninf.10", "M163.16", "M86.13",
> "P163.7",
> "Puninf.2", "Puninf.2", "P163.8", "M86.14",
> "Muninf.11",
> "P86.5", "M163.17", "Muninf.12", "P86.6", "Puninf.3",
> "Muninf.12",
> "M163.18", "M86.15", "P163.9")), .Names =
> c("SlideNumber",
> "FileName", "Cy3", "Cy5"), class = "data.frame",
> row.names
> = c(NA,
> -21L))
>
> # for readability we also paste it in here...
> #> targets
> # SlideNumber FileName Cy3 Cy5
> # 1 29 BE T22h slide 29 Puninf.1 Muninf.10
> # 2 30 BE T22h slide 30 Puninf.1 P86.4
> # 3 31 BE T22h slide 31 P163.7 Puninf.1
> # 4 32 BE T22h slide 32 M86.13 Muninf.10
> # 5 33 BE T22h slide 33 Muninf.10 M163.16
> # 6 34 BE T22h slide 34 P86.4 M86.13
> # 7 35 BE T22h slide 35 M163.16 P163.7
> # 8 36 BE T22h slide 36 Muninf.11 Puninf.2
> # 9 37 BE T22h slide 37 P86.5 Puninf.2
> # 10 38 BE T22h slide 38 Puninf.2 P163.8
> # 11 39 BE T22h slide 39 Muninf.11 M86.14
> # 12 40 BE T22h slide 40 M163.17 Muninf.11
> # 13 41 BE T22h slide 41 M86.14 P86.5
> # 14 42 BE T22h slide 42 P163.8 M163.17
> # 15 43 BE T22h slide 43 Puninf.3 Muninf.12
> # 16 44 BE T22h slide 44 Puninf.3 P86.6
> # 17 51 BE T22h slide 51 P163.9 Puninf.3
> # 18 46 BE T22h slide 46 M86.15 Muninf.12
> # 19 47 BE T22h slide 47 Muninf.12 M163.18
> # 20 48 BE T22h slide 48 P86.6 M86.15
> # 21 50 BE T22h slide 50 M163.18 P163.9
>
> # The design matrix is computed like this design<-
> modelMatrix(targets,
> ref = "Puninf.1")
> # but here is the structure if it is helpful
> design<- structure(c(0, 0, 0, 0, 1, 0, -1, 0, 0, 0, 0,
> 0, 0,
> 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1,
> 0, 1, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0,
> 0, 0, 0, 1, 0, -1, 0, 0, 0, -1, 0, 1, 0, 0, 0, 0, 0,
> 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 1, 0, -1,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0,
> 0, 0, 0, 0, -1, 0, 1, 0, 1, 0, 0, 1, -1, 0, 0, 0, 0,
> 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1,
> 0, 0, -1,
> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0,
> 0, 0, 0, 1, 0, 0, 1, -1, 0, 0, 0, 0, -1, 0, 0, 0, 1,
> 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 1,
> 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
> -1, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0,
> -1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, -1, 0, 0, 0, 0,
> 0, 0, 0,
> 0, 1, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, 1, 0, 0, 0, 0),
> .Dim =
> c(21L,
> 17L), .Dimnames = list(c("slide 29", "slide 30",
> "slide 31",
> "slide 32", "slide 33", "slide 34", "slide 35",
> "slide 36",
> "slide 37",
> "slide 38", "slide 39", "slide 40", "slide 41",
> "slide 42",
> "slide 43",
> "slide 44", "slide 51", "slide 46", "slide 47",
> "slide 48",
> "slide 50"
> ), c("M163.16", "M163.17", "M163.18", "M86.13",
> "M86.14",
> "M86.15",
> "Muninf.10", "Muninf.11", "Muninf.12", "P163.7",
> "P163.8",
> "P163.9",
> "P86.4", "P86.5", "P86.6", "Puninf.2", "Puninf.3")))
>
>
> # Fit linear model - commented out becuase we can't really
> send all the
> data in MA
> # fit<- lmFit(MA, design = design)
> # Coefficients not estimable: Puninf.2 Puninf.3
>
> # Make a contrast matrix:
> A<-
>
> makeContrasts(PuninfvsMuninf=(Muninf.10+Muninf.11+Muninf.12-Puninf.2-Puninf.3)/3,
> levels = design)
>
> # and now fit the contrast
> con.fit<- contrasts.fit(fit, contrasts = A)
> Error in contrasts.fit(fit, contrasts = A) : trying to take
> contrast of
> non-estimable coefficient
>
> # now change the column order in the design matrix and row
> order of
> contrast matrix
> # we create a new sort index - arbitrarily placing the last
> two columns in
> the original
> # design matrix in the middle of the new one. Ditto
> for the
> rows of the
> contrast matrix.
> n<- nrow(A)
> newIndex<- c(1:5, n-1, n, 6:(n-2))
> designB<- design[,newIndex]
> B<- A[newIndex,]
> attributes(B)<- attributes(A)
>
> # create a new fit, but we get the same knd of error,
> the last
> two columns
> are
> # identified as non-estimable
> fitB<- lmFit(X$MA, design = designB)
> #Coefficients not estimable: P86.5 P86.6
> #Warning message:
> #Partial NA coefficients for 16278 probe(s)
>
>
> As far as we can tell, it's the trailing columns that get
> singled out, but
> we don't know why. Hopefully the information about is
> detailed
> enough, but
> we can provide more information if necessary to help
> troubleshooting.
>
> Thanks in advance,
>
> Joaquin
>
> sessionInfo()
>
> R version 2.14.0 (2011-10-31)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> locale:
> [1]
> en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets
> methods
> base
>
> other attached packages:
> [1] limma_3.10.0
>
> loaded via a namespace (and not attached):
> [1] tools_2.14.0
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> <mailto:Bioconductor at r-project.org
> <mailto:Bioconductor at r-project.org>>
>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> -- James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
>
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list