[R-sig-ME] Problems with formula for highly pseudoreplicate mixed-effects system

Thu Sep 17 13:23:16 CEST 2009

Hello everybody,

I have posted this request on r-help before, but as nobody has responded 
over there, and as I have now followed some of the threads on this 
group, I will reformulate my request here and hope it makes more sense 
this way.

I have been trying for some weeks to state the correct design of my 
experiment as a GLM formula, and have not been able to find something 
appropriate in Pinheiro & Bates 2000 or with any of the local R users, 
so I am posting it here and hope somebody can help me.

In each experimental condition, described by
1) gene (10 levels, fixed, because of high interest to me)
2) species (2 levels, fixed, because of high interest)
3) day (2 levels, random)
4) replicate (2 levels per day, random),

I have several thousand data points consisting of two variables:

5) FITC (level of transfection of a cell)
6) APC (antibody binding to the cell)

Because of intrinsic and uncontrollable cell-to-cell variation, FITC 
varies quite uniformly over a wide range, and APC correlates rather well 
with FITC. In some cases, I made the intrinsic nesting of day and 
replicate explicit by  pasting them together as day_repl.

In order to make the examples clearer, I am attaching a reduced version 
of my data with only two levels of gene and ca. 400 data points per 
experimental condition:

 > 
small<-read.csv("small.csv",colClasses=c("character",rep("integer",2),rep("factor",5)))

My biological question is the following:

Is there any gene (in my set of 10 genes) where the species makes a 
difference in the relation between FITC and APC ? If yes, in what gene 
does species have an effect ? And what is the effect of the species 
difference ?

My first attempt was the following: fit the data points of each 
experimental condition to a linear equation
APC=Intercept+Slope*FITC

and analyse the slopes :
 > lm(Slope~species*gene*day_repl)    # the data set is not included 
here, but rather trivial

This analysis shows clear differences between the genes, but no effect 
of species and no interaction gene:species.

The linear fit to the cells is reasonably good, but of course does not 
represent the data set completely, so I wanted to incorporate the 
complete data set. The comparison is between models containing resp. 
omitting the interaction gene:species.

Model ignoring pseudoreplication:

 > model1=lmer(APC~(FITC+species+gene)3+(1|day)+(1|repl),REML=F,data=small)
 > model2=lmer(APC~(FITC+species+gene)2+(1|day)+(1|repl),REML=F,data=small)
 > anova(model1,model2)

gives me a significant difference for this data set and a highly 
significant difference (p<10-16) for the original data set, but I don't 
trust these results because the analysis does not account for 
pseudoreplication, and with >200000 data points in the original data 
set, any interaction will be significant (I have tried out several 
reduced models). In fact, lme gives >2000 000 denominator dfs. I have 
followed the discussion on why lmer does not estimate denominator dfs. 
My version of lme4 (0.999375-27) does not contain hatTrace to estimate a 
similar parameter.

In fact, I suppose FITC should be nested in day and replicate:
 > model3=lmer(APC~species*gene+(1|day/day_repl/FITC),data=small)
Error: length(f1) == length(f2) is not TRUE
In addition: Warning messages:
1: In day_repl:day :
  numerical expression has 800 elements: only the first used
(and several similar ones)

Can I do nesting without incurring this kind of error ? Or should I 
truncate or sample the experimental conditions with more data points in 
order to have at the end the same number of data points in every 
experimental condition ?

Last attempt:
 > model4=lmer(APC~gene*species+(1|day) + (1|repl) + 
(1+(gene:species)|FITC),data=small)
 > model5=lmer(APC~gene+species+(1|day) + (1|repl) + (1|FITC),data=small)
 > anova(model4,model5)

works with this smaller data set, but fails to converge (after >3 days) 
with the original data set. I am unsure if this is the right kind of 
analysis for the data and there is only a problem of convergence, or if 
it is the wrong formula.

I would be very grateful for any advice.

Matthias Gralle

For your information: I am using R version 2.8.0 (2008-10-20) on Ubuntu 
8.04 on a linux 2.6.24-24-generic kernel on different Intel systems with 
lme4 0.999375-27, and the output of some commmands is given below.

 > library(lme4)
Loading required package: Matrix
Loading required package: lattice

Attaching package: 'Matrix'

        The following object(s) are masked from package:stats :

         xtabs

        The following object(s) are masked from package:base :

         colMeans,
         colSums,
         rcond,
         rowMeans,
         rowSums

 > sessionInfo()
function (package = NULL)
{
    z <- list()
    z$R.version <- R.Version()
    z$locale <- Sys.getlocale()
    if (is.null(package)) {
        package <- grep("^package:", search(), value = TRUE)
        keep <- sapply(package, function(x) x == "package:base" ||
            !is.null(attr(as.environment(x), "path")))
        package <- sub("^package:", "", package[keep])
    }
    pkgDesc <- lapply(package, packageDescription)
    if (length(package) == 0)
        stop("no valid packages were specified")
    basePkgs <- sapply(pkgDesc, function(x) !is.null(x$Priority) &&
        x$Priority == "base")
    z$basePkgs <- package[basePkgs]
    if (any(!basePkgs)) {
        z$otherPkgs <- pkgDesc[!basePkgs]
        names(z$otherPkgs) <- package[!basePkgs]
    }
    loadedOnly <- loadedNamespaces()
    loadedOnly <- loadedOnly[!(loadedOnly %in% package)]
    if (length(loadedOnly)) {
        names(loadedOnly) <- loadedOnly
        pkgDesc <- c(pkgDesc, lapply(loadedOnly, packageDescription))
        z$loadedOnly <- pkgDesc[loadedOnly]
    }
    class(z) <- "sessionInfo"
    z
}
<environment: namespace:utils>

str(small)
'data.frame':   800 obs. of  8 variables:
 $ X       : int  10846 10946 11046 11146 11246 11346 11446 11546 11646 
11746 ...
 $ FITC    : int  740 503 576 554 698 601 752 726 523 524 ...
 $ APC     : int  678 495 505 393 657 607 722 693 571 572 ...
 $ gene    : Factor w/ 2 levels "NM_1039029","NM_12242": 1 1 1 1 1 1 1 1 
1 1 ...
 $ species : Factor w/ 2 levels "human","other": 1 1 1 1 1 1 1 1 1 1 ...
 $ day     : int  20090806 20090806 20090806 20090806 20090806 20090806 
20090806 20090806 20090806 20090806 ...
 $ repl    : Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
 $ day_repl: Factor w/ 4 levels "20090806_a","20090806_b",..: 1 1 1 1 1 
1 1 1 1 1 ...
 > summary(small)
       X               FITC            APC                gene      species
 Min.   :     1   Min.   :417.0   Min.   :256.0   NM_1039029:400   human:400
 1st Qu.: 19921   1st Qu.:555.0   1st Qu.:478.0   NM_12242  :400   other:400
 Median : 93356   Median :634.5   Median :557.0
 Mean   : 88387   Mean   :646.7   Mean   :565.7
 3rd Qu.:155904   3rd Qu.:730.0   3rd Qu.:650.2
 Max.   :168337   Max.   :973.0   Max.   :844.0
      day           repl          day_repl
 Min.   :20090806   a:542   20090806_a:283
 1st Qu.:20090806   b:258   20090806_b:100
 Median :20090812           20090812_a:259
 Mean   :20090809           20090812_b:158
 3rd Qu.:20090812
 Max.   :20090812

-- 
Matthias Gralle, PhD
Dept. Evolutionary Genetics
Max Planck Institute for Evolutionary Anthropology
Deutscher Platz 6
04103 Leipzig, Germany
Tel +49 341 3550 519
Fax +49 341 3550 555