[R-sig-ME] Problems with formula for highly pseudoreplicate mixed-effects system
Matthias Gralle
matthias_gralle at eva.mpg.de
Thu Sep 17 13:23:16 CEST 2009
Hello everybody,
I have posted this request on r-help before, but as nobody has responded
over there, and as I have now followed some of the threads on this
group, I will reformulate my request here and hope it makes more sense
this way.
I have been trying for some weeks to state the correct design of my
experiment as a GLM formula, and have not been able to find something
appropriate in Pinheiro & Bates 2000 or with any of the local R users,
so I am posting it here and hope somebody can help me.
In each experimental condition, described by
1) gene (10 levels, fixed, because of high interest to me)
2) species (2 levels, fixed, because of high interest)
3) day (2 levels, random)
4) replicate (2 levels per day, random),
I have several thousand data points consisting of two variables:
5) FITC (level of transfection of a cell)
6) APC (antibody binding to the cell)
Because of intrinsic and uncontrollable cell-to-cell variation, FITC
varies quite uniformly over a wide range, and APC correlates rather well
with FITC. In some cases, I made the intrinsic nesting of day and
replicate explicit by pasting them together as day_repl.
In order to make the examples clearer, I am attaching a reduced version
of my data with only two levels of gene and ca. 400 data points per
experimental condition:
>
small<-read.csv("small.csv",colClasses=c("character",rep("integer",2),rep("factor",5)))
My biological question is the following:
Is there any gene (in my set of 10 genes) where the species makes a
difference in the relation between FITC and APC ? If yes, in what gene
does species have an effect ? And what is the effect of the species
difference ?
My first attempt was the following: fit the data points of each
experimental condition to a linear equation
APC=Intercept+Slope*FITC
and analyse the slopes :
> lm(Slope~species*gene*day_repl) # the data set is not included
here, but rather trivial
This analysis shows clear differences between the genes, but no effect
of species and no interaction gene:species.
The linear fit to the cells is reasonably good, but of course does not
represent the data set completely, so I wanted to incorporate the
complete data set. The comparison is between models containing resp.
omitting the interaction gene:species.
Model ignoring pseudoreplication:
> model1=lmer(APC~(FITC+species+gene)3+(1|day)+(1|repl),REML=F,data=small)
> model2=lmer(APC~(FITC+species+gene)2+(1|day)+(1|repl),REML=F,data=small)
> anova(model1,model2)
gives me a significant difference for this data set and a highly
significant difference (p<10-16) for the original data set, but I don't
trust these results because the analysis does not account for
pseudoreplication, and with >200000 data points in the original data
set, any interaction will be significant (I have tried out several
reduced models). In fact, lme gives >2000 000 denominator dfs. I have
followed the discussion on why lmer does not estimate denominator dfs.
My version of lme4 (0.999375-27) does not contain hatTrace to estimate a
similar parameter.
In fact, I suppose FITC should be nested in day and replicate:
> model3=lmer(APC~species*gene+(1|day/day_repl/FITC),data=small)
Error: length(f1) == length(f2) is not TRUE
In addition: Warning messages:
1: In day_repl:day :
numerical expression has 800 elements: only the first used
(and several similar ones)
Can I do nesting without incurring this kind of error ? Or should I
truncate or sample the experimental conditions with more data points in
order to have at the end the same number of data points in every
experimental condition ?
Last attempt:
> model4=lmer(APC~gene*species+(1|day) + (1|repl) +
(1+(gene:species)|FITC),data=small)
> model5=lmer(APC~gene+species+(1|day) + (1|repl) + (1|FITC),data=small)
> anova(model4,model5)
works with this smaller data set, but fails to converge (after >3 days)
with the original data set. I am unsure if this is the right kind of
analysis for the data and there is only a problem of convergence, or if
it is the wrong formula.
I would be very grateful for any advice.
Matthias Gralle
For your information: I am using R version 2.8.0 (2008-10-20) on Ubuntu
8.04 on a linux 2.6.24-24-generic kernel on different Intel systems with
lme4 0.999375-27, and the output of some commmands is given below.
> library(lme4)
Loading required package: Matrix
Loading required package: lattice
Attaching package: 'Matrix'
The following object(s) are masked from package:stats :
xtabs
The following object(s) are masked from package:base :
colMeans,
colSums,
rcond,
rowMeans,
rowSums
> sessionInfo()
function (package = NULL)
{
z <- list()
z$R.version <- R.Version()
z$locale <- Sys.getlocale()
if (is.null(package)) {
package <- grep("^package:", search(), value = TRUE)
keep <- sapply(package, function(x) x == "package:base" ||
!is.null(attr(as.environment(x), "path")))
package <- sub("^package:", "", package[keep])
}
pkgDesc <- lapply(package, packageDescription)
if (length(package) == 0)
stop("no valid packages were specified")
basePkgs <- sapply(pkgDesc, function(x) !is.null(x$Priority) &&
x$Priority == "base")
z$basePkgs <- package[basePkgs]
if (any(!basePkgs)) {
z$otherPkgs <- pkgDesc[!basePkgs]
names(z$otherPkgs) <- package[!basePkgs]
}
loadedOnly <- loadedNamespaces()
loadedOnly <- loadedOnly[!(loadedOnly %in% package)]
if (length(loadedOnly)) {
names(loadedOnly) <- loadedOnly
pkgDesc <- c(pkgDesc, lapply(loadedOnly, packageDescription))
z$loadedOnly <- pkgDesc[loadedOnly]
}
class(z) <- "sessionInfo"
z
}
<environment: namespace:utils>
str(small)
'data.frame': 800 obs. of 8 variables:
$ X : int 10846 10946 11046 11146 11246 11346 11446 11546 11646
11746 ...
$ FITC : int 740 503 576 554 698 601 752 726 523 524 ...
$ APC : int 678 495 505 393 657 607 722 693 571 572 ...
$ gene : Factor w/ 2 levels "NM_1039029","NM_12242": 1 1 1 1 1 1 1 1
1 1 ...
$ species : Factor w/ 2 levels "human","other": 1 1 1 1 1 1 1 1 1 1 ...
$ day : int 20090806 20090806 20090806 20090806 20090806 20090806
20090806 20090806 20090806 20090806 ...
$ repl : Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
$ day_repl: Factor w/ 4 levels "20090806_a","20090806_b",..: 1 1 1 1 1
1 1 1 1 1 ...
> summary(small)
X FITC APC gene species
Min. : 1 Min. :417.0 Min. :256.0 NM_1039029:400 human:400
1st Qu.: 19921 1st Qu.:555.0 1st Qu.:478.0 NM_12242 :400 other:400
Median : 93356 Median :634.5 Median :557.0
Mean : 88387 Mean :646.7 Mean :565.7
3rd Qu.:155904 3rd Qu.:730.0 3rd Qu.:650.2
Max. :168337 Max. :973.0 Max. :844.0
day repl day_repl
Min. :20090806 a:542 20090806_a:283
1st Qu.:20090806 b:258 20090806_b:100
Median :20090812 20090812_a:259
Mean :20090809 20090812_b:158
3rd Qu.:20090812
Max. :20090812
--
Matthias Gralle, PhD
Dept. Evolutionary Genetics
Max Planck Institute for Evolutionary Anthropology
Deutscher Platz 6
04103 Leipzig, Germany
Tel +49 341 3550 519
Fax +49 341 3550 555
More information about the R-sig-mixed-models
mailing list