[R] confused on model.frame evaluation

Sat May 1 00:38:50 CEST 2010

On Apr 30, 2010, at 4:57 PM, Erik Iverson wrote:

> <snip>
>> I'm sure it's not a bug, but could someone point to a thread or offer some gentle advice on what's happening?  I think it's related to:
>> test <- data.frame(name1 = 1:5, name2 = 6:10, test = 11:15)
>> eval(expression(test[c("name1", "name2")]))
>> eval(expression(interco[c("name1", "test")]))
> 
> scratch that last one, obviously a typo was causing my confusion there!  The model.frame stuff remains a mystery to me though...

Hi Erik,

It's late on a Friday, it's grey and raining here in Minneapolis and I am short on caffeine, but, that being said, consider the following :-)

> working
  france manual famanual total working  no
1      1      1        1   107      85  22
2      1      1        0    65      44  21
3      1      0        1    66      24  42
4      1      0        0   171      17 154
5      0      1        1    87      24  63
6      0      1        0    65      22  43
7      0      0        1    85       1  84
8      0      0        0   148       6 142

> as.matrix(working[c("working", "no")])
     working  no
[1,]      85  22
[2,]      44  21
[3,]      24  42
[4,]      17 154
[5,]      24  63
[6,]      22  43
[7,]       1  84
[8,]       6 142

> with(working, as.matrix(working[c("working", "no")]))
     [,1]
[1,]   NA
[2,]   NA

For the incantations of model.frame(), the formula terms are evaluated first within the scope of the data frame indicated for the 'data' argument.

Thus, in the second case, I am asking for the as.matrix(...) call to be evaluated within the scope of the 'working' data frame, which returns a matrix with only two rows, one NA for each column that was asked for and not found, which is different than the number of rows in 'working', thus you get the error as soon as the 'france' column is evaluated in the formula to create the model frame:

Error in model.frame.default(formula = as.matrix(working[c("working",  :
 variable lengths differ (found for 'france')

2 rows in the response matrix versus 8 rows for 'france'...

It is kind of like you are asking for:

> as.matrix(working$working[c("working", "no")])
     [,1]
[1,]   NA
[2,]   NA

Now, try this:

> with(working, matrix(c(working, no), ncol = 2))
     [,1] [,2]
[1,]   85   22
[2,]   44   21
[3,]   24   42
[4,]   17  154
[5,]   24   63
[6,]   22   43
[7,]    1   84
[8,]    6  142

and then:

> summary(glm(matrix(c(working, no), ncol = 2) ~ france + manual + famanual, data = working, family = binomial))

Call:
glm(formula = matrix(c(working, no), ncol = 2) ~ france + manual + 
    famanual, family = binomial, data = working)

Deviance Residuals: 
       1         2         3         4         5         6         7  
 0.09316  -0.14108   2.38028  -1.91838  -1.48196   1.84993  -1.61864  
       8  
 1.16747  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -3.6902     0.2547 -14.489  < 2e-16 ***
france        1.9474     0.2162   9.008  < 2e-16 ***
manual        2.5199     0.2168  11.625  < 2e-16 ***
famanual      0.5522     0.2017   2.738  0.00618 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 308.329  on 7  degrees of freedom
Residual deviance:  18.976  on 4  degrees of freedom
AIC: 60.162

Number of Fisher Scoring iterations: 4

Does that help top clarify?

Regards,

Marc Schwartz