[R-sig-ME] [R] errors with lme4
Ben Bolker
bbolker at gmail.com
Sat Nov 26 22:50:49 CET 2011
Investigating in more detail:
library(gdata)
fledge <- read.xls("Data_Ben.xls")
GLM.1 <- glm(fledgesucc ~ eggs + elev + hatch + lay + meanterranova +
minpengS1 + minpengS2 + nnd +
npd + seadist + wchillpengS1 + wchillpengS2,
family=binomial, data=fledge)
## one definition of collinearity is that the cross product
## of the design matrix, X, with itself, is not of full rank
## We can analyze this by looking at its eigenvalues and eigenvectors
X <- model.matrix(GLM.1)
ee <- eigen(t(X) %*% X)
ee$values ## last three eigenvalues are effectively zero ...
(badvals <- which(ee$values<1e-8))
## see the pattern of correlation in the eigenvectors
ev <- round(ee$vectors[,badvals],2)
## which variables are these?
badnames <- lapply(as.data.frame(ev),
function(x) colnames(X)[x!=0])
## pick out the names, not including the intercept
badcols <- setdiff(Reduce(union,badnames),"(Intercept)")
## take a look at these columns ...
ff <- na.omit(fledge)
summary(ff[,badcols])
## identify strong correlations (we probably could have
## done this first)
cor(ff[,badcols])>0.97
## look more carefully
with(ff,table(minpengS2,wchillpengS2))
with(ff,table(meanterranova,minpengS1))
with(ff,table(wchillpengS1,wchillpengS2))
## it looks like there are only three values for each of
## these variables -- maybe there are more in the full data
## set -- but in any case they're extremely strongly correlated
## with each other
On Sat, Nov 26, 2011 at 4:35 PM, Ben Bolker <bbolker at gmail.com> wrote:
> * It looks like you have 156 observations **after** dropping the 352
> observations
> with missing predictors ... The 3 covariates are indeed (I very
> strongly suspect) due
> to collinearity.
>
> I'm trying to work out a clever piece of linear algebra that will
> tell you exactly which
> columns of the design matrix (i.e. predictors) are multicollinear, but
> haven't figured it
> out yet. More sensibly, you should just stare at your data/use common
> sense to figure
> out which ones are completely confounded (because there definitely are
> some). If two (or more)
> effects are completely confounded (another way of saying "perfectly
> multicollinear"), then it's
> simply impossible to get well-defined estimates of all of their
> effects in the same model.
> You have to decide which one(s) to take out. (If you like you can run
> separate models
> with each term, but you have to recognize that you will have no way to
> choose among
> them on the basis of the data -- they should all fit equally well.)
>
> On Fri, Nov 25, 2011 at 4:19 AM, Alessio Unisi <franceschi6 at unisi.it> wrote:
>> ...I always try not to work in "a fairly dangerous way": ))
>>
>> Thanks again for help and advices.
>>
>> I run the the GLM without random effect as you suggest, attached the
>> results.
>> Briefly ...3 covariates completely NA...352 observations deleted due to
>> missingness...this is due to collinearity as you suggested?
>> In which way could i solve this problem?
>>
>> i attached a part of the dataset...sorry but can't send the original set of
>> data...sorry..
>>
>> kind regards
>> alessio franceschi
>>
>> Ben Bolker ha scritto:
>>>
>>> [cc'ing to r-sig-mixed-models list]
>>>
>>> On 11-11-24 03:25 PM, Alessio Unisi wrote:
>>>
>>>>
>>>> Hi Ben,
>>>> thanks for answer!
>>>>
>>>> sorry but i'm a new R-user and i'm not so skilled...also in statistic! :
>>>> ))
>>>> just few answer and question to yours...
>>>>
>>>
>>> Knowing *neither* R *nor* statistics can be a fairly dangerous
>>> combination. If you ask politely, people on the R help lists will often
>>> help with statistical questions, but they are technically off-topic.
>>> There are other places (such as http://stats.stackexchange.com/ ) for
>>> asking statistics questions ... and it would really be very best to see
>>> if you can get some local help (classes or helpful colleagues/fellow
>>> students/professors/consultants).
>>>
>>>
>>>>>
>>>>> I can't prove it, but I strongly suspect that some of your
>>>>> coefficients are perfectly multicollinear.
>>>>
>>>> some they are...yes..but not all...hatch and lay surely they are...what
>>>> does it change i have to multiplicate instead of sum?
>>>>
>>>
>>> By "perfectly multicollinear" I don't mean that they are strongly
>>> collinear (which ecologists often worry about, correctly, but sometimes
>>> more than they need to) but rather "perfectly".
>>>
>>> For example, suppose you ran a 2x2 factorial design (e.g. effects of
>>> temperature and light) but ended up with a missing "corner" (e.g. no
>>> samples in the high-light/high-temperature combination). You would then
>>> be unable to estimate an interaction term, because you would be trying
>>> to estimate 4 parameters (intercept/grand mean, light effect,
>>> temperature effect, interaction) from only three independent sets of
>>> data. This is the same idea: some of your predictors probably line up
>>> *perfectly* with combinations of other predictors.
>>>
>>>>
>>>> the idea of using glmer() or lmer() is because i had to deal with random
>>>> factor (1|territory)...glm can't handle this? i don't think...
>>>>
>>>
>>> You're correct that you will need to get back to glmer() eventually,
>>> but I wanted you try out glm() because the presence of NAs in your
>>> coefficient vector will confirm that collinearity is the problem, not
>>> some other issue with glmer ...
>>>
>>>
>>>>>
>>>>> How many observations are left after na.omit(fledge)?
>>>>>
>>>>>
>>>>
>>>> sorry..i don't understand...
>>>>
>>>
>>> When you run the analysis, R will drop rows from your data set that
>>> have NAs in any of the predictors. It looks like you have a total of
>>> 152 observations, but I wonder how many there are with complete records.
>>> nrow(na.omit(fledge)) will tell you this.
>>>
>>>
>>>>>
>>>>> What is the difference between your 'S1' and 'S2' temperature
>>>>> records?
>>>>>
>>>>
>>>> those are temperature recorded in different time....S1 is during egg
>>>> laying and incubation and S2 is during hatching and rearing of the chicks
>>>>
>>>
>>> Can we please see the results of summary(fledge)?
>>>
>>> It would be good if you were willing to post your whole data set
>>> somewhere for download (or at a pinch e-mail it to me).
>>>
>>> Ben Bolker
>>>
>>>
>>>>
>>>> thank you
>>>> alessio
>>>>
>>>>
>>>> Ben Bolker ha scritto:
>>>>
>>>>>
>>>>> Alessio Unisi <franceschi6 <at> unisi.it> writes:
>>>>>
>>>>>
>>>>>>
>>>>>> Dear R-users,
>>>>>> i need help for this topic!
>>>>>>
>>>>>> I'm trying to determine if the reproductive success (0=fail,
>>>>>> 1=success) of a species of bird is related to a list of covariates.
>>>>>>
>>>>>> These are the covariates:
>>>>>> § elev: elevation of nest (meters)
>>>>>> § seadist: distance from the sea (meters)
>>>>>> § meanterranova: records of temperature
>>>>>> § minpengS1: records of temperature
>>>>>> § wchillpengS1: records of temperature
>>>>>> § minpengS2: records of temperature
>>>>>> § wchillpengS2: records of temperature
>>>>>> § nnd: nearest neighbour distance
>>>>>> § npd: nearest penguin distance
>>>>>> § eggs: numbers of eggs
>>>>>> § lay: laying date (julian calendar)
>>>>>> § hatch: hatching date (julian calendar)
>>>>>> I have some NAs in the data.
>>>>>>
>>>>>> I want to test the model with all the variable then i want to remove
>>>>>> some, but the ideal model:
>>>>>> GLM.1 <-lmer(fledgesucc ~ +lay +hatch +elev +seadist +nnd +npd
>>>>>> +meanterranova +minpengS1 +minpengS2 +wchillpengS1 +wchillpengS2
>>>>>> +(1|territory), family=binomial(logit), data=fledge)
>>>>>>
>>>>>> doesn't work because of these errors:
>>>>>> 'Warning message: In mer_finalize(ans) : gr cannot be computed at
>>>>>> initial par (65)'.
>>>>>> "matrix is not symmetric [1,2]"
>>>>>>
>>>>>> If i delete one or more of the T records (i.e. minpengS2
>>>>>> +wchillpengS2) the model works...below and example:
>>>>>>
>>>>>> GLM.16 <-lmer(fledgesucc ~ lay +hatch +elev +seadist +nnd +npd
>>>>>> +meanterranova +minpengS1 +(1|territory), family=binomial(logit),
>>>>>> data=fledge)
>>>>>>
>>>>>> > summary(GLM.16)
>>>>>> Generalized linear mixed model fit by the Laplace approximation
>>>>>> Formula: fledgesucc ~ lay + hatch + elev + seadist + nnd + npd +
>>>>>> meanterranova + minpengS1 + (1 | territory)
>>>>>> Data: fledge
>>>>>> AIC BIC logLik deviance
>>>>>> 174 204.2 -77 154
>>>>>> Random effects:
>>>>>> Groups Name Variance Std.Dev.
>>>>>> territory (Intercept) 0.54308 0.73694
>>>>>> Number of obs: 152, groups: territory, 96
>>>>>>
>>>>>>
>>>>>
>>>>> I can't prove it, but I strongly suspect that some of your
>>>>> coefficients are perfectly multicollinear. Try running your
>>>>> model as a regular GLM:
>>>>>
>>>>> g1 <- glm(fledgesucc ~ +lay +hatch +elev +seadist +nnd +npd
>>>>> +meanterranova +minpengS1 +minpengS2 +wchillpengS1 +wchillpengS2
>>>>> and see if some of the coefficients are NA.
>>>>>
>>>>> coef(g1)
>>>>>
>>>>> lm() and glm() can handle this sort of "rank-deficient" or
>>>>> multicollinear input, (g)lmer can't, as of now.
>>>>>
>>>>> I suspect that you may be overfitting your model anyway:
>>>>> you should aim for not more than 10 observations per parameter
>>>>> (in your case, since all your predictors appear to be continuous,
>>>>> How many observations are left after na.omit(fledge)?
>>>>>
>>>>> What is the difference between your 'S1' and 'S2' temperature
>>>>> records?
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>>> -----
>>>>> Nessun virus nel messaggio.
>>>>> Controllato da AVG - www.avg.com
>>>>> Versione: 2012.0.1873 / Database dei virus: 2101/4636 - Data di
>>>>> rilascio: 24/11/2011
>>>>>
>>>
>>>
>>>
>>> -----
>>> Nessun virus nel messaggio.
>>> Controllato da AVG - www.avg.com
>>> Versione: 2012.0.1873 / Database dei virus: 2101/4637 - Data di rilascio:
>>> 24/11/2011
>>>
>>>
>>>
>>
>> --
>> Alessio Franceschi
>> Phd student
>> Dipartimento di Scienze Ambientali "G. Sarfatti"
>> Università di Siena
>> Via P.A. Mattioli, 8 - 53100 Siena (Italy)
>> Cell. +393384431806
>> email: franceschi6 at unisi.it; alfranceschi at alice.it
>>
>>
>>> GLM.1 <- glm(fledgesucc ~ eggs + elev + hatch + lay + meanterranova +
>> + minpengS1 + minpengS2 + nnd + npd + seadist + wchillpengS1 +
>> wchillpengS2,
>> + family=binomial(logit), data=flege)
>>
>>> summary(GLM.1)
>>
>> Call:
>> glm(formula = fledgesucc ~ eggs + elev + hatch + lay + meanterranova +
>> minpengS1 + minpengS2 + nnd + npd + seadist + wchillpengS1 +
>> wchillpengS2, family = binomial(logit), data = flege)
>>
>> Deviance Residuals:
>> Min 1Q Median 3Q Max
>> -0.9636 -0.7540 -0.6421 -0.4146 2.2963
>>
>> Coefficients: (3 not defined because of singularities)
>> Estimate Std. Error z value Pr(>|z|)
>> (Intercept) 16.634984 13.927833 1.194 0.232
>> eggs -0.421601 0.575456 -0.733 0.464
>> elev 0.009926 0.024849 0.399 0.690
>> hatch -0.023411 0.245780 -0.095 0.924
>> lay -0.013704 0.245635 -0.056 0.956
>> meanterranova 1.438584 1.377566 1.044 0.296
>> minpengS1 -0.427313 0.398560 -1.072 0.284
>> minpengS2 NA NA NA NA
>> nnd -0.034370 0.023913 -1.437 0.151
>> npd 0.003597 0.005014 0.717 0.473
>> seadist -0.003850 0.004092 -0.941 0.347
>> wchillpengS1 NA NA NA NA
>> wchillpengS2 NA NA NA NA
>>
>> (Dispersion parameter for binomial family taken to be 1)
>>
>> Null deviance: 159.06 on 151 degrees of freedom
>> Residual deviance: 153.85 on 142 degrees of freedom
>> (352 observations deleted due to missingness)
>> AIC: 173.85
>>
>> Number of Fisher Scoring iterations: 5
>>
>>
>>
>
More information about the R-sig-mixed-models
mailing list