[R] model.frame: how does one use it?

Sat Jun 16 01:27:10 CEST 2007

On 6/15/07, Philipp Benner <pbenner at uos.de> wrote:
>
> Thanks for your explanation!
>
> > With this in mind, either of the following might do what you want:
> >
> > badFunction <- function(mydata, myformula) {
> >    mydata$myweight <- abs(rnorm(nrow(mydata)))
> >    hyp <-
> >        rpart(myformula,
> >              data=mydata,
> >              weights=myweight,
> >              method="class")
> >    prev <- hyp
> > }
> >
> >
> > badFunction <- function(mydata, myformula) {
> >    myweight <- abs(rnorm(nrow(mydata)))
> >    environment(myformula) <- environment()
> >    hyp <-
> >        rpart(myformula,
> >              data=mydata,
> >              weights=myweight,
> >              method="class")
> >    prev <- hyp
> > }
>
> OK, this is what I have now:
>
> adaboostBad <- function(formula, data) {
>   ## local definition of the weight vector (won't work because pima.formula is not defined within this function)
>   w <- abs(rnorm(nrow(data)))
>   rpart(formula, data=data, weights=w)
> }
>
> adaboostGood <- function(formula, data) {
>   ## create weight vector in the data object
>   data$w <- abs(rnorm(nrow(data)))
>   rpart(formula, data=data, weights=w)
> }
>
> adaboostBest <- function(formula, data) {
>   ## associate the current environment (this function's one) with the object `formula'
>   environment(formula) <- environment()
>   w <- abs(rnorm(nrow(data)))
>   rpart(formula, data=data, weights=w)
> }
>

> As far as I understand this non-standard evaluation stuff,
> adaboostGood() and adaboostBest() are the only two possibilities to
> call rpart() with weight vectors. Now suppose that I don't know what
> `data' contains and suppose further that it already contains a
> column called `w'.  adaboostGood() would overwrite that column with
> new data which is then used as weight vector and as training data
> for rpart(). adaboostBest() would just use the wrong data as weight
> vector as it finds data$w before the real weight vector. So, in both
> cases I have to check for `names(data) == "w"` and stop if TRUE? Or
> is there a better way?

Well, that depends on what you want to happen when there is a column
called 'w' in data.  I don't see a situation where it makes sense to
use data$w as weights ('w' is just a name you happen to choose inside
adaboostBest), so I would just go with adaboostGood.

In case you are worried about overwriting the original data, that may
not be happening in the sense you are thinking.  When you say

data$w <- abs(rnorm(nrow(data)))

inside adaboostGood, that modifies a local copy of the data argument,
not the original (R argument semantics are call by value, not call by
reference).  You are losing data$w in the local copy in your function,
but why would you care if you are not using it anyway.

Of course, if your formula contains a reference to 'w' then you will
get wrong results, so checking for a unique name is always safer.
In addition, use an obfuscated name like '.__myWeights' instead
of 'w', and the check will be almost always irrelevant.

-Deepayan