[R] Formula in a model
Gerrit Eichner
Gerrit.Eichner at math.uni-giessen.de
Thu Sep 12 09:53:30 CEST 2013
Hello, Paulito,
my comments are inline below:
> Thanks for the explanation. Let me give a specific example. Assume Temp
> (column 4) is the output and the rest of the columns are input is the
> training features. Note that I only use the air quality data for
> illustration purpose. T input->output mapping may not make sense in the
> real interpretation of this data.
>
> library(e1071)
>
> data(airquality)
> mytable=airquality
>
> colnames(mytable)=c('a','b','c','d','e','f')
>
> modelSVM1=svm(mytable[,6] ~ .,data=mytable)
> modelSVM2=svm(mytable[,-6],mytable[,6])
> modelSVM3=svm(f ~ ., data=mytable)
>
> predSVM1=predict(modelSVM1,newdata=mytable)
> predSVM2=predict(modelSVM2,newdata=mytable[,-6])
> predSVM3=predict(modelSVM3,newdata=mytable)
>
> Results of predSVM2 is similar with predSVM3 but different from predSVM1.
Well, because already modelSVM1 is different from the other two. This is
due to how the "." on the rhs of a formula is interpreted. From the help
page of formula:
"There are two special interpretations of . in a formula. The
usual one is in the context of a data argument of model fitting
functions and means 'all columns not otherwise in the formula':
see terms.formula. In the context of update.formula, only, it
means 'what was previously in this part of the formula'."
The first interpretation applies to your situation. With the formula for
your modelSVM1 the function model.matrix() (which is called inside the
formula version of svm()) creates a model matrix after looking for a
column "mytable[,6]" in the data argument. And since there is no column
with that name, it takes all columns of mytable (including the 6th, i.e.,
the one named "f"). See what model.matrix() does in that case:
> head( model.matrix(mytable[,6] ~ .,data=mytable), 3)
(Intercept) a b c d e f
1 1 41 190 7.4 67 5 1
2 1 36 118 8.0 72 5 2
3 1 12 149 12.6 74 5 3
In the case of modelSVM3 model.matrix() does find column "f" in the data
argument, and hence omits this column in forming the terms of the rhs of
the formula:
> head( model.matrix( f ~ .,data=mytable), 3)
(Intercept) a b c d e
1 1 41 190 7.4 67 5
2 1 36 118 8.0 72 5
3 1 12 149 12.6 74 5
The call to svm() for modelSVM2 is the (non-formula) default version and
does not need to call model.matrix() because (so to say) it expects that
the user has done that already by supplying the response to its argument y
and the adequately formed data matrix to its argument x.
> Question: Which is the correct formulation?
The second and the third (for a sensible purpose), unless you want to
experiment with svm() to see what happens if one does something rather
nonsensical.
> Why R doesn't detect error/discrepancy in formulation?
Because R, or in this case rather the concept of a formula and the
function model.matrix() are not designed to replace the user who knows
what s/he is doing after having read the documentation. ;)
> If I use the same formulation with rpart using the same data:
>
> library(rpart)
>
> data(airquality)
> mytable=airquality
>
> colnames(mytable)=c('a','b','c','d','e','f')
>
> modelRP1=rpart(mytable[,6]~.,data=mytable,method='anova') # this works
> modelRP3=rpart(f ~ ., data=mytable,method='anova') # this works
>
> predRP1=predict(modelRP1,newdata=mytable)
> predRP3=predict(modelRP3,newdata=mytable)
>
>
> The results between predRP1 and predRP3 are different while the statements:
>
> predRP2=predict(modelRP2,newdata=mytable[,-6])
> modelRP2=rpart(mytable[,-6],mytable[,6],method='anova')
>
> have errors.
This is presumably due to the same reasons as described above.
Remark: It is generally - for various reasons - recommended to use "<-" as
the assignment operator, not "=". (And I like to recommend to use use
blanks to increase readability of code.)
[... snip ...]
I hope the fog has lifted -- Gerrit
More information about the R-help
mailing list