[R-sig-ME] [R] errors with lme4

Sat Nov 26 22:50:49 CET 2011

Investigating in more detail:

library(gdata)
fledge <- read.xls("Data_Ben.xls")

GLM.1 <- glm(fledgesucc ~ eggs + elev + hatch + lay + meanterranova +
             minpengS1 + minpengS2 + nnd +
             npd + seadist + wchillpengS1 + wchillpengS2,
             family=binomial, data=fledge)

## one definition of collinearity is that the cross product
##  of the design matrix, X, with itself, is not of full rank
##  We can analyze this by looking at its eigenvalues and eigenvectors
X <- model.matrix(GLM.1)
ee <- eigen(t(X) %*% X)
ee$values  ## last three eigenvalues are effectively zero ...
(badvals <- which(ee$values<1e-8))

## see the pattern of correlation in the eigenvectors
ev <- round(ee$vectors[,badvals],2)

## which variables are these?
badnames <- lapply(as.data.frame(ev),
       function(x) colnames(X)[x!=0])

## pick out the names, not including the intercept
badcols <- setdiff(Reduce(union,badnames),"(Intercept)")

## take a look at these columns ...
ff <- na.omit(fledge)
summary(ff[,badcols])

## identify strong correlations (we probably could have
##  done this first)
cor(ff[,badcols])>0.97

## look more carefully
with(ff,table(minpengS2,wchillpengS2))
with(ff,table(meanterranova,minpengS1))
with(ff,table(wchillpengS1,wchillpengS2))

## it looks like there are only three values for each of
##  these variables -- maybe there are more in the full data
## set -- but in any case they're extremely strongly correlated
## with each other

On Sat, Nov 26, 2011 at 4:35 PM, Ben Bolker <bbolker at gmail.com> wrote:
>  * It looks like you have 156 observations **after** dropping the 352
> observations
> with missing predictors ...  The 3 covariates are indeed (I very
> strongly suspect) due
> to collinearity.
>
>  I'm trying to work out a clever piece of linear algebra that will
> tell you exactly which
> columns of the design matrix (i.e. predictors) are multicollinear, but
> haven't figured it
> out yet. More sensibly, you should just stare at your data/use common
> sense to figure
> out which ones are completely confounded (because there definitely are
> some).  If two (or more)
> effects are completely confounded (another way of saying "perfectly
> multicollinear"), then it's
> simply impossible to get well-defined estimates of all of their
> effects in the same model.
> You have to decide which one(s) to take out.  (If you like you can run
> separate models
> with each term, but you have to recognize that you will have no way to
> choose among
> them on the basis of the data -- they should all fit equally well.)
>
> On Fri, Nov 25, 2011 at 4:19 AM, Alessio Unisi <franceschi6 at unisi.it> wrote:
>> ...I always try not to work in "a fairly dangerous way": ))
>>
>> Thanks again for help and advices.
>>
>> I run the the GLM without random effect as you suggest, attached the
>> results.
>> Briefly ...3 covariates completely NA...352 observations deleted due to
>> missingness...this is due to collinearity as you suggested?
>> In which way could i solve this problem?
>>
>> i attached a part of the dataset...sorry but can't send the original set of
>> data...sorry..
>>
>> kind regards
>> alessio franceschi
>>
>> Ben Bolker ha scritto:
>>>
>>>  [cc'ing to r-sig-mixed-models list]
>>>
>>> On 11-11-24 03:25 PM, Alessio Unisi wrote:
>>>
>>>>
>>>> Hi Ben,
>>>> thanks for answer!
>>>>
>>>> sorry but i'm a new R-user and i'm not so skilled...also in statistic! :
>>>> ))
>>>> just few answer and question to yours...
>>>>
>>>
>>>  Knowing *neither* R *nor* statistics can be a fairly dangerous
>>> combination.  If you ask politely, people on the R help lists will often
>>> help with statistical questions, but they are technically off-topic.
>>> There are other places (such as http://stats.stackexchange.com/ ) for
>>> asking statistics questions ... and it would really be very best to see
>>> if you can get some local help (classes or helpful colleagues/fellow
>>> students/professors/consultants).
>>>
>>>
>>>>>
>>>>> I can't prove it, but I strongly suspect that some of your
>>>>> coefficients are perfectly multicollinear.
>>>>
>>>> some they are...yes..but not all...hatch and lay surely they are...what
>>>> does it change i have to multiplicate instead of sum?
>>>>
>>>
>>>  By "perfectly multicollinear" I don't mean that they are strongly
>>> collinear (which ecologists often worry about, correctly, but sometimes
>>> more than they need to) but rather "perfectly".
>>>
>>>  For example, suppose you ran a 2x2 factorial design (e.g. effects of
>>> temperature and light) but ended up with a missing "corner" (e.g. no
>>> samples in the high-light/high-temperature combination).  You would then
>>> be unable to estimate an interaction term, because you would be trying
>>> to estimate 4 parameters (intercept/grand mean, light effect,
>>> temperature effect, interaction) from only three independent sets of
>>> data.  This is the same idea: some of your predictors probably line up
>>> *perfectly* with combinations of other predictors.
>>>
>>>>
>>>> the idea of using glmer() or lmer() is because i had to deal with random
>>>> factor (1|territory)...glm can't handle this? i don't think...
>>>>
>>>
>>>  You're correct that you will need to get back to glmer() eventually,
>>> but I wanted you try out glm() because the presence of NAs in your
>>> coefficient vector will confirm that collinearity is the problem, not
>>> some other issue with glmer ...
>>>
>>>
>>>>>
>>>>> How many observations are left after na.omit(fledge)?
>>>>>
>>>>>
>>>>
>>>> sorry..i don't understand...
>>>>
>>>
>>>  When you run the analysis, R will drop rows from your data set that
>>> have NAs in any of the predictors.  It looks like you have a total of
>>> 152 observations, but I wonder how many there are with complete records.
>>>  nrow(na.omit(fledge)) will tell you this.
>>>
>>>
>>>>>
>>>>>  What is the difference between your 'S1' and 'S2' temperature
>>>>> records?
>>>>>
>>>>
>>>> those are temperature recorded in different time....S1 is during egg
>>>> laying and incubation and S2 is during hatching and rearing of the chicks
>>>>
>>>
>>>  Can we please see the results of summary(fledge)?
>>>
>>>  It would be good if you were willing to post your whole data set
>>> somewhere for download (or at a pinch e-mail it to me).
>>>
>>>  Ben Bolker
>>>
>>>
>>>>
>>>> thank you
>>>> alessio
>>>>
>>>>
>>>> Ben Bolker ha scritto:
>>>>
>>>>>
>>>>> Alessio Unisi <franceschi6 <at> unisi.it> writes:
>>>>>
>>>>>
>>>>>>
>>>>>> Dear R-users,
>>>>>> i need help for this topic!
>>>>>>
>>>>>> I'm trying to determine if the reproductive success (0=fail,
>>>>>> 1=success) of a species of bird is related to a list of covariates.
>>>>>>
>>>>>> These are the covariates:
>>>>>> §    elev: elevation of nest (meters)
>>>>>> §    seadist: distance from the sea (meters)
>>>>>> §    meanterranova: records of temperature
>>>>>> §    minpengS1: records of temperature
>>>>>> §    wchillpengS1: records of temperature
>>>>>> §    minpengS2: records of temperature
>>>>>> §    wchillpengS2: records of temperature
>>>>>> §    nnd: nearest neighbour distance
>>>>>> §    npd: nearest penguin distance
>>>>>> §    eggs: numbers of eggs
>>>>>> §    lay: laying date (julian calendar)
>>>>>> §    hatch: hatching date (julian calendar)
>>>>>> I have some NAs in the data.
>>>>>>
>>>>>> I want to test the model with all the variable then i want to remove
>>>>>> some, but the ideal model:
>>>>>> GLM.1 <-lmer(fledgesucc ~ +lay +hatch +elev +seadist +nnd +npd
>>>>>> +meanterranova +minpengS1 +minpengS2 +wchillpengS1 +wchillpengS2
>>>>>> +(1|territory), family=binomial(logit), data=fledge)
>>>>>>
>>>>>> doesn't work because of these errors:
>>>>>> 'Warning message: In mer_finalize(ans) : gr cannot be computed at
>>>>>> initial par (65)'.
>>>>>> "matrix is not symmetric [1,2]"
>>>>>>
>>>>>> If i delete one or more of the T records (i.e. minpengS2
>>>>>> +wchillpengS2) the model works...below and example:
>>>>>>
>>>>>>  GLM.16 <-lmer(fledgesucc ~ lay +hatch +elev +seadist +nnd +npd
>>>>>> +meanterranova +minpengS1 +(1|territory), family=binomial(logit),
>>>>>> data=fledge)
>>>>>>
>>>>>>  > summary(GLM.16)
>>>>>> Generalized linear mixed model fit by the Laplace approximation
>>>>>> Formula: fledgesucc ~ lay + hatch + elev + seadist + nnd + npd +
>>>>>> meanterranova +      minpengS1 + (1 | territory)
>>>>>>   Data: fledge
>>>>>>  AIC   BIC logLik deviance
>>>>>>  174 204.2    -77      154
>>>>>> Random effects:
>>>>>>  Groups    Name        Variance Std.Dev.
>>>>>>  territory (Intercept) 0.54308  0.73694
>>>>>> Number of obs: 152, groups: territory, 96
>>>>>>
>>>>>>
>>>>>
>>>>>  I can't prove it, but I strongly suspect that some of your
>>>>> coefficients are perfectly multicollinear.  Try running your
>>>>> model as a regular GLM:
>>>>>
>>>>> g1 <- glm(fledgesucc ~ +lay +hatch +elev +seadist +nnd +npd
>>>>>  +meanterranova +minpengS1 +minpengS2 +wchillpengS1 +wchillpengS2
>>>>> and see if some of the coefficients are NA.
>>>>>
>>>>> coef(g1)
>>>>>
>>>>> lm() and glm() can handle this sort of "rank-deficient" or
>>>>> multicollinear input, (g)lmer can't, as of now.
>>>>>
>>>>> I suspect that you may be overfitting your model anyway:
>>>>> you should aim for not more than 10 observations per parameter
>>>>> (in your case, since all your predictors appear to be continuous,
>>>>> How many observations are left after na.omit(fledge)?
>>>>>
>>>>>  What is the difference between your 'S1' and 'S2' temperature
>>>>> records?
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>>> -----
>>>>> Nessun virus nel messaggio.
>>>>> Controllato da AVG - www.avg.com
>>>>> Versione: 2012.0.1873 / Database dei virus: 2101/4636 -  Data di
>>>>> rilascio: 24/11/2011
>>>>>
>>>
>>>
>>>
>>> -----
>>> Nessun virus nel messaggio.
>>> Controllato da AVG - www.avg.com
>>> Versione: 2012.0.1873 / Database dei virus: 2101/4637 -  Data di rilascio:
>>> 24/11/2011
>>>
>>>
>>>
>>
>> --
>> Alessio Franceschi
>> Phd student
>> Dipartimento di Scienze Ambientali "G. Sarfatti"
>> Università di Siena
>> Via P.A. Mattioli, 8  - 53100 Siena (Italy)
>> Cell. +393384431806
>> email: franceschi6 at unisi.it; alfranceschi at alice.it
>>
>>
>>> GLM.1 <- glm(fledgesucc ~ eggs + elev + hatch + lay + meanterranova +
>> +   minpengS1 + minpengS2 + nnd + npd + seadist + wchillpengS1 +
>> wchillpengS2,
>> +   family=binomial(logit), data=flege)
>>
>>> summary(GLM.1)
>>
>> Call:
>> glm(formula = fledgesucc ~ eggs + elev + hatch + lay + meanterranova +
>>    minpengS1 + minpengS2 + nnd + npd + seadist + wchillpengS1 +
>>    wchillpengS2, family = binomial(logit), data = flege)
>>
>> Deviance Residuals:
>>    Min       1Q   Median       3Q      Max
>> -0.9636  -0.7540  -0.6421  -0.4146   2.2963
>>
>> Coefficients: (3 not defined because of singularities)
>>               Estimate Std. Error z value Pr(>|z|)
>> (Intercept)   16.634984  13.927833   1.194    0.232
>> eggs          -0.421601   0.575456  -0.733    0.464
>> elev           0.009926   0.024849   0.399    0.690
>> hatch         -0.023411   0.245780  -0.095    0.924
>> lay           -0.013704   0.245635  -0.056    0.956
>> meanterranova  1.438584   1.377566   1.044    0.296
>> minpengS1     -0.427313   0.398560  -1.072    0.284
>> minpengS2            NA         NA      NA       NA
>> nnd           -0.034370   0.023913  -1.437    0.151
>> npd            0.003597   0.005014   0.717    0.473
>> seadist       -0.003850   0.004092  -0.941    0.347
>> wchillpengS1         NA         NA      NA       NA
>> wchillpengS2         NA         NA      NA       NA
>>
>> (Dispersion parameter for binomial family taken to be 1)
>>
>>    Null deviance: 159.06  on 151  degrees of freedom
>> Residual deviance: 153.85  on 142  degrees of freedom
>>  (352 observations deleted due to missingness)
>> AIC: 173.85
>>
>> Number of Fisher Scoring iterations: 5
>>
>>
>>
>